The Pavilion · A Catalogue in Eight Plates
Reasoning with Machines Lab

A researchpractice, cataloguedin the open.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Standing direction

Reasoning systems that scientists, clinicians, and the public can trust — measured by what they actually do, not what they claim.

Founded
Oxford
Papers · 2025–26
10 cited
Venues
Nature Med · ICML · ICLR
Status
Open for partners
Plate II · The four standing directions

Four hands at the same table.

Each direction is a standing condition the lab is working toward. Together they describe the field we want AI evaluation, safety, and reasoning to become.

Direction 01

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

Toward

A field where every benchmark publishes its construct validity, and where evaluation is treated as an experimental science.

Tag · EvaluationPanel Mauve
Direction 02

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

Toward

A practice of measuring real harms — bias, toxicity, agentic misalignment — at the neuron and the deployment, before they reach the public.

Tag · SafetyPanel Sage
Direction 03

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

Toward

Scientific agents that synthesise knowledge reliably enough that a researcher can act on them — and transparently enough that they can audit them.

Tag · AgenticPanel Rose
Direction 04

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

Toward

Decisions made with AI in healthcare, law, and policy that are studied empirically — not assumed safe because the model is impressive.

Tag · Human-AIPanel Brass

“There is hope in honest error; none in the icy perfection of the mere stylist.”— J. D. Sedding, after the Glasgow Four

Four directions · One table
Plate III · The Catalogue

Ten plates,each a small case.

Ten papers from the past eighteen months. Nature Medicine, ICML (with a Spotlight), ICLR, NeurIPS, EMNLP, Information Fusion. Each plate is one finished cycle — built, peer-reviewed, published in the open.

Plate · No.·Title·Venue & date
= currently on view
10 of 10 plates · catalogue revised quarterlyA full bibliography is held by the lab; ask for the long form.
Plate IV · The Hands

Fifteen hands,one room.

A Principal Investigator, ten DPhil students, two MSc students, a visiting fellow, and a research affiliate. Each is introduced by what they are currently working on — the focus is the headline; the role is the caption.

Prof. Adam Mahdi

Master of the Pavilion · Principal Investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Currently practising

Coaching the lab's research cycles across evaluation, safety, agentic AI, and human–AI interaction — measuring what we ship before we ship it.

Felix Krones01 / 14

DPhil Student

Felix Krones

Multimodal evaluation across imaging and clinical text.

Djavan De Clercq02 / 14

DPhil Student

Djavan De Clercq

LLMs applied to food-security data and policy questions.

Andrew M. Bean03 / 14

DPhil Student

Andrew M. Bean

LLM evaluations that capture how people actually use models.

Yushi Yang04 / 14

DPhil Student

Yushi Yang

Post-training for LLM and agentic alignment, at the neuron level.

Harry Mayne05 / 14

DPhil Student

Harry Mayne

Interpretability and safety-relevant LLM evaluations.

Jessica Rodrigues06 / 14

DPhil Student

Jessica Rodrigues

Knowledge-graph methods for metascience and research synthesis.

Guy Parsons07 / 14

DPhil Student

Guy Parsons

Healthcare AI evaluation grounded in clinical workflow.

Karolina Korgul08 / 14

DPhil Student

Karolina Korgul

Agentic-AI safety and web-agent persuasion attacks.

Ryan Othniel Kearns09 / 14

DPhil Student

Ryan Othniel Kearns

The science of evals — measuring reasoning honestly.

Shreyansh Padarha10 / 14

DPhil Student

Shreyansh Padarha

Agentic systems for science, with safety and eval rigour.

Mia Kussman11 / 14

MSc Student

Mia Kussman

Studies of human–LLM interaction and LLM evaluation.

Caleb Tan12 / 14

MSc Student

Caleb Tan

LLM evaluation and reasoning benchmarks.

Sebastian Petric13 / 14

Visiting Policy Fellow

Sebastian Petric

LLMs applied to financial time series, at the policy boundary.

Tristan Naidoo14 / 14

Research Affiliate

Tristan Naidoo

Public-health AI and LLM evaluations grounded in epidemiology.

One sensei · Fourteen hands · Open to visiting researchersThe room is the architecture; the hands are the work.
Plate V · The Reading Room

What thelab is reading.

The reading-room ledger — papers accepted, conferences attended, honours noted. The most recent entry takes the rose tag. The ledger is kept in the open so collaborators always know what is current.

DateMarkEntryCategory
  1. May 2026
    Newest

    Three OxRML papers accepted at ICML 2026 — including a Spotlight

    Paper
  2. April 2026

    OxRML presenting at ICLR 2026

    Convening
  3. February 2026

    New paper in Nature Medicine on LLMs as medical assistants

    Paper
  4. February 2026

    Ryan Othniel Kearns wins MSc Thesis Prize

    Honour
  5. December 2025

    OxRML at NeurIPS 2025

    Convening
  6. November 2025

    OxRML at EMNLP 2025

    Convening
  7. June 2025

    Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025

    Honour
  8. February 2025

    New review paper in Information Fusion

    Paper
  9. September 2024

    Winners of the 2024 PhysioNet Challenge

    Honour

9 entries · ledger held in the open

Plate VI · The Salon

A room forlong conversations.

We work with foundations, governments, hyperscalers, and global corporates who want AI evaluation, safety, and reasoning treated with the same care as the systems they ship. Three formats — three cadences — one table.

Offering · 01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Cadence

Half-day to multi-week. On-site in your office or in Oxford; bespoke to the team and the question.

Book a workshop
Offering · 02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

Cadence

Quarterly cycles with a partner studio. Joint roadmaps, shared evals, shipped tooling.

See our builds
Offering · 03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Cadence

Multi-year. Named programmes, dedicated DPhil studentships, shared scientific direction.

Start a conversation

The invitation

If your team is shipping AI into a high-stakes setting — healthcare, finance, public infrastructure — there is a chair at the table.

Write directly. A brief note about the problem you are working on, the stakes, and the question you want answered. We'll reply within the week.

Plate VII · Honour Roll

The institutions and venues we are part of.

Universities, journals, and conferences where the lab's cycles have been hosted, peer-reviewed, and published.

  • 01University of OxfordHost institution
  • 02Oxford Internet InstituteAffiliated department
  • 03Nature MedicinePublished 2026
  • 04ICMLSpotlight & papers, 2026
  • 05NeurIPSDatasets & Benchmarks, 2025
  • 06ICLRAccepted, 2026
  • 07EMNLPMultiple, 2025

Plate VIII · The Visitors' Book

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.