Radcliffe Camera, Oxford
OXFORD · 51.7548°N 1.2544°W
OxRML/Reasoning with Machines Lab
Field 2026/Open to long collaborations
A research practice, considered

We build the instruments by which AI is measured.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Posture

We are interested in the measurement of intelligence the way a metrologist is interested in the kilogram: patiently, in the open, at the level of mechanism.

10
Featured cycles since 2024
15
Researchers in the atelier
06
Top venues this season
01
Nature Medicine, Feb 2026
Pl. 02 · The Practice

Four instruments the lab is building.

A single research practice composed of four interlocking studies, each one a partition that defines a region of the lab. Together they describe the field we want AI evaluation, safety, agentic AI, and human–AI interaction to become.

No. 01 · Instrument

Evaluation

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

What this partition defines

The science of how a benchmark is built, what it claims to measure, and what it actually captures.

In active practice

No. 02 · Instrument

Safety

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

What this partition defines

A field that measures real harms (bias, toxicity, agentic misalignment) at the neuron and the deployment.

In active practice

No. 03 · Instrument

Agentic

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

What this partition defines

Agentic systems that synthesise knowledge with enough rigour for a working scientist to act on them.

In active practice

No. 04 · Instrument

Human-AI

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

What this partition defines

High-stakes decisions made with AI, studied empirically rather than assumed safe because the model is impressive.

In active practice

Each instrument is, in the Bouroullec sense, a micro-architecture: modest by itself, complete in concert with the others.

Pl. 03 · The Drawings

Ten drawings that resolved into work.

A reservoir of deliberate experiments. Each paper began as a question and resolved into an instrument the field can use. The most recent sits at the top of the plate.

PlateDrawingVenue · Date
  1. Pl. 03.02

    A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

    A positive result on LLM self-explanation faithfulness, usable as a behavioural predictor when measured carefully.

    H Mayne, JS Kang, D Gould, K Ramchandran, A Mahdi, NY Siegel

    AI Safety and AlignmentBenchmarks and Evaluation

    ICML

    May 2026

  2. Pl. 03.03

    It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

    A web-agent benchmark for resistance to social engineering. Frontier agents fall for the persuasion attack.

    K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi

    Benchmarks and EvaluationAgentic AIAI Safety and Alignment

    ICML

    May 2025

  3. Pl. 03.04

    Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

    A preregistered, randomised Nature Medicine study on the public’s use of LLMs as medical assistants.

    AM Bean, RE Payne, G Parsons, HR Kirk, J Ciro, R Mosquera-Gómez, S Hincapié, AS Ekanayaka, L Tarassenko, L Rocher, A Mahdi

    AI in HealthcareBenchmarks and Evaluation

    Nature Medicine

    February 2026

  4. Pl. 03.05

    Measuring what matters: Construct validity in large language model benchmarks

    A construct-validity audit of the benchmarks the field treats as ground truth.

    AM Bean, RO Kearns, A Romanou, FS Hafner, H Mayne, J Batzner, et al.

    Benchmarks and Evaluation

    NeurIPS Datasets and Benchmarks

    November 2025

  5. Pl. 03.06

    Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

    How LLM-as-a-judge degrades, sometimes catastrophically, across language, modality, and domain.

    S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale

    Benchmarks and EvaluationAI Safety and Alignment

    NeurIPS LLM Lifecycle Workshop

    November 2025

  6. Pl. 03.07

    How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

    A neuron-level account of how Direct Preference Optimization reduces toxicity.

    Y Yang, F Sondej, H Mayne, A Lee, A Mahdi

    AI Safety and Alignment

    EMNLP

    November 2025

  7. Pl. 03.08

    LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

    Counterfactual self-explanations are fluent but systematically mis-locate the model's decision boundary.

    H Mayne, RO Kearns, Y Yang, AM Bean, E Delaney, C Russell, A Mahdi

    AI Safety and AlignmentBenchmarks and Evaluation

    EMNLP

    September 2025

  8. Pl. 03.09

    Review of multimodal machine learning approaches in healthcare

    A survey mapping the multimodal healthcare ML landscape: fusion strategies, deployment realities.

    F Krones, U Marikkar, G Parsons, A Szmul, A Mahdi

    AI in Healthcare

    Information Fusion

    February 2025

  9. Pl. 03.10

    LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

    A reasoning benchmark that disentangles inference from memorisation by templatised orthographic obfuscation.

    J Khouja, K Korgul, S Hellsten, L Yang, V Neacsu, H Mayne, RO Kearns, A Bean, A Mahdi

    Benchmarks and Evaluation

    ICLR

    April 2026

10 drawings · catalogued in order of recency

Pl. 04 · The Atelier

Fifteen tiles in the room.

A field of individual practitioners. Each is introduced by what they are currently practising and what they are working toward; roles sit underneath, as a matter of record.

Prof. Adam Mahdi

Tile 00 · Principal investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Currently practising

Coaching the lab’s research cycles across evaluation, safety, agentic AI, and human–AI interaction.

Working toward

A field where AI evaluation is treated as a science, with the same care as the systems it studies.

Felix Krones

Tile 01

Felix Krones

DPhil Student

Currently

Multimodal evaluation across imaging and clinical text.

Toward

Digital-health systems measured before they are deployed.

Djavan De Clercq

Tile 02

Djavan De Clercq

DPhil Student

Currently

LLMs applied to food-security data and policy questions.

Toward

Decision tools for food systems that hold up under audit.

Andrew M. Bean

Tile 03

Andrew M. Bean

DPhil Student

Currently

LLM evals that capture how people actually use models.

Toward

A standard for evaluating LLMs in real human contexts of use.

Yushi Yang

Tile 04

Yushi Yang

DPhil Student

Currently

Post-training for LLM and agentic alignment, at the neuron level.

Toward

Alignment interventions whose mechanism we can explain.

Harry Mayne

Tile 05

Harry Mayne

DPhil Student

Currently

Interpretability and safety-relevant LLM evaluations.

Toward

Interpretability methods that practitioners can trust under shift.

Jessica Rodrigues

Tile 06

Jessica Rodrigues

DPhil Student

Currently

Knowledge graphs for metascience and research synthesis.

Toward

Synthesis tools scientists treat as collaborators, not search engines.

Guy Parsons

Tile 07

Guy Parsons

DPhil Student

Currently

Healthcare AI evaluation grounded in clinical workflow.

Toward

Digital-health products that earn the trust of clinicians and patients.

Karolina Korgul

Tile 08

Karolina Korgul

DPhil Student

Currently

Agentic-AI safety, including web-agent persuasion attacks.

Toward

Web agents that resist social engineering as a default behaviour.

Ryan Othniel Kearns

Tile 09

Ryan Othniel Kearns

DPhil Student

Currently

The science of evals: measuring reasoning honestly.

Toward

A field where every benchmark publishes its construct validity.

Shreyansh Padarha

Tile 10

Shreyansh Padarha

DPhil Student

Currently

Agentic systems for science, with safety and eval rigour.

Toward

Scientific agents auditable enough to act on in real research.

Mia Kussman

Tile 11

Mia Kussman

MSc Student

Currently

Studies of human–LLM interaction and LLM evaluation.

Toward

Interaction patterns that improve, rather than substitute, human judgement.

Caleb Tan

Tile 12

Caleb Tan

MSc Student

Currently

LLM evaluation and reasoning benchmarks.

Toward

Reasoning evals that separate genuine inference from recall.

Sebastian Petric

Tile 13

Sebastian Petric

Visiting Policy Fellow

Currently

LLMs applied to financial time series, at the policy boundary.

Toward

Honest characterisation of LLM utility in high-stakes financial settings.

Tristan Naidoo

Tile 14

Tristan Naidoo

Research Affiliate

Currently

Public-health AI and LLM evaluations grounded in epidemiology.

Toward

Public-health AI evaluated like a health intervention.

Pl. 05 · The Field Log

The atelier’s daybook.

What we saw at the board this season: papers accepted, conferences reached, awards noted. Kept in order, kept in the open.

DateEntryCategory
  1. May 2026

    Three OxRML papers accepted at ICML 2026 — including a Spotlight

    Paper
  2. April 2026

    OxRML presenting at ICLR 2026

    Conference
  3. February 2026

    New paper in Nature Medicine on LLMs as medical assistants

    Paper
  4. February 2026

    Ryan Othniel Kearns wins MSc Thesis Prize

    Award
  5. December 2025

    OxRML at NeurIPS 2025

    Conference
  6. November 2025

    OxRML at EMNLP 2025

    Conference
  7. June 2025

    Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025

    Award
  8. February 2025

    New review paper in Information Fusion

    Paper
  9. September 2024

    Winners of the 2024 PhysioNet Challenge

    Award

9 entries · kept in order of event

Pl. 06 · The Workshop

We work with industry the way the Bouroullecs work with Vitra: patiently, materially, and for a long time.

Three formats for partnering with the lab. Each one a real commitment: we are picky about partners and slow to ramp, because shared roadmaps need shared standards.

Format · 01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Cadence
Half-day to multi-week
Format
On-site sessions for product and ML teams
What we commit
We bring the eval and safety toolkit we use in the lab and adapt it to the team in the room.
Book a workshop

Format · 02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

Cadence
Quarterly cycles
Format
Joint roadmaps with first-class development studios
What we commit
Lab breakthroughs become production tools: eval harnesses, safety dashboards, agentic-research platforms.
See our builds

Format · 03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Cadence
Multi-year, named
Format
Foundations, governments, global corporates
What we commit
Named labs, dedicated DPhil studentships, shared scientific direction. We work with partners who care about getting it right.
Start a conversation

We work with foundations, governments, and corporates who want AI evaluation, safety, and reasoning treated with the same care as the systems they ship.

Partner with us

We reply to every serious enquiry

Pl. 07 · The Catalogue

Where the work has been placed.

Universities, journals, and conferences where the lab’s work has been hosted, peer-reviewed, and published.

  • 01University of OxfordHost institution
  • 02Oxford Internet InstituteAffiliated department
  • 03Nature MedicinePublished 2026
  • 04ICMLSpotlight & papers, 2026
  • 05NeurIPSDatasets & Benchmarks, 2025
  • 06ICLRAccepted, 2026
  • 07EMNLPMultiple, 2025

Pl. 08 · The lab newsletter

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.