OXFORD · 51.7548°N 1.2544°W

—OxRML/Reasoning with Machines Lab

Active 2026/Open to collaborations

About the lab

Reasoning with Machines Lab

@ Oxford University

Led by Prof. Adam Mahdi, our lab advances the science of AI evaluation, benchmarking, safety and security. Through rigorous empirical research, we study how LLMs and agentic systems reason, interact with humans and drive scientific discovery.

Mission

We advance the science of AI evaluation, safety, and reasoning through rigorous empirical research, in the open and over the long term.

Read our research Engage with OxRML

10: Featured papers since 2024
15: Researchers in the lab
06: Top venues this year
01: Nature Medicine, Feb 2026

§ 02 · Research Themes

Four research themes the lab works across.

Four areas of focused work: AI evaluation, AI safety and security, agentic AI for science, and human–AI interaction.

Theme 01

Evaluation

Benchmarks and Evaluation

We develop the science of LLM evaluation, setting the standard for rigorous assessment and identifying hidden risks before they matter.

Focus

How benchmarks are built, what they claim to measure, and what they actually capture.

In active practice

Theme 02

Safety

AI Safety and Security

From bias and toxicity to agentic misalignment, we study the full spectrum of AI risk and develop the technical and governance tools to address it.

Focus

The full spectrum of AI risk — bias, toxicity, agentic misalignment — studied at the neuron and at deployment.

In active practice

Theme 03

Agentic

Agentic AI for Science

We build agentic systems that automate scientific knowledge synthesis and discovery, with a focus on agents that are reliable, transparent and domain-grounded.

Focus

Agentic systems that synthesise scientific knowledge reliably enough to act on.

In active practice

Theme 04

Human-AI

Human–AI Interaction

We run large-scale empirical studies on how people use AI for high stakes decisions, from healthcare and law to policy and beyond.

Focus

How people use AI in high-stakes decisions — healthcare, law, policy — studied empirically.

In active practice

Each theme runs on a multi-year horizon and feeds the others.

§ 03 · Publications

Ten recent papers from the lab.

Recent peer-reviewed work across evaluation, safety, agentic AI, and human–AI interaction. The most recent paper sits at the top.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A benchmark that distinguishes document-collection navigation from stochastic search in multimodal agents.

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

Benchmarks and EvaluationAgentic AI

Read the paper

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Fig. 01

No.FigurePaperVenue · Date

No. 02
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
A positive result on LLM self-explanation faithfulness, usable as a behavioural predictor when the measurement is set up right.
H Mayne, JS Kang, D Gould, K Ramchandran, A Mahdi, NY Siegel
AI Safety and AlignmentBenchmarks and Evaluation
ICML
May 2026
No. 03
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
A web-agent benchmark for resistance to social engineering. Frontier agents fall for the persuasion attack.
K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi
Benchmarks and EvaluationAgentic AIAI Safety and Alignment
ICML
May 2025
No. 04
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
A preregistered, randomised Nature Medicine study on the public’s use of LLMs as medical assistants.
AM Bean, RE Payne, G Parsons, HR Kirk, J Ciro, R Mosquera-Gómez, S Hincapié, AS Ekanayaka, L Tarassenko, L Rocher, A Mahdi
AI in HealthcareBenchmarks and Evaluation
Nature Medicine
February 2026
No. 05
Measuring what matters: Construct validity in large language model benchmarks
A construct-validity audit of the benchmarks the field treats as ground truth.
AM Bean, RO Kearns, A Romanou, FS Hafner, H Mayne, J Batzner, et al.
Benchmarks and Evaluation
NeurIPS Datasets and Benchmarks
November 2025
No. 06
Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints
How LLM-as-a-judge degrades, sometimes catastrophically, across language, modality, and domain.
S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale
Benchmarks and EvaluationAI Safety and Alignment
NeurIPS LLM Lifecycle Workshop
November 2025
No. 07
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
A neuron-level account of how Direct Preference Optimization reduces toxicity.
Y Yang, F Sondej, H Mayne, A Lee, A Mahdi
AI Safety and Alignment
EMNLP
November 2025
No. 08
LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
Counterfactual self-explanations are fluent but systematically mis-locate the model's decision boundary.
H Mayne, RO Kearns, Y Yang, AM Bean, E Delaney, C Russell, A Mahdi
AI Safety and AlignmentBenchmarks and Evaluation
EMNLP
September 2025
No. 09
Review of multimodal machine learning approaches in healthcare
A survey mapping the multimodal healthcare ML landscape: fusion strategies, deployment realities.
F Krones, U Marikkar, G Parsons, A Szmul, A Mahdi
AI in Healthcare
Information Fusion
February 2025
No. 10
LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
A reasoning benchmark that disentangles inference from memorisation by templatised orthographic obfuscation.
J Khouja, K Korgul, S Hellsten, L Yang, V Neacsu, H Mayne, RO Kearns, A Bean, A Mahdi
Benchmarks and Evaluation
ICLR
April 2026

10 papers · most recent first

§ 04 · Team

The researchers of the lab.

DPhil and MSc students, fellows, and affiliates. Each is introduced by what they are currently working on and what they are working toward.

Principal Investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks.

Currently

Leading the lab’s research across evaluation, safety, agentic AI, and human–AI interaction.

Working toward

Advancing AI evaluation as a rigorous science, with the same care as the systems it studies.

No. 01

Felix Krones

DPhil Student

Currently

Multimodal evaluation across imaging and clinical text.

Toward

Digital-health systems measured before they are deployed.

No. 02

Djavan De Clercq

DPhil Student

Currently

LLMs applied to food-security data and policy questions.

Toward

Decision tools for food systems that hold up under audit.

No. 03

Andrew M. Bean

DPhil Student

Currently

LLM evals that capture how people use models in practice.

Toward

A standard for evaluating LLMs in real human contexts of use.

No. 04

Yushi Yang

DPhil Student

Currently

Post-training for LLM and agentic alignment, at the neuron level.

Toward

Alignment interventions whose mechanism we can explain.

No. 05

Harry Mayne

DPhil Student

Currently

Interpretability and safety-relevant LLM evaluations.

Toward

Interpretability methods that practitioners can trust under shift.

No. 06

Jessica Rodrigues

DPhil Student

Currently

Knowledge graphs for metascience and research synthesis.

Toward

Synthesis tools scientists treat as collaborators, not search engines.

No. 07

Guy Parsons

DPhil Student

Currently

Healthcare AI evaluation grounded in clinical workflow.

Toward

Digital-health products that earn the trust of clinicians and patients.

No. 08

Karolina Korgul

DPhil Student

Currently

Agentic-AI safety, including web-agent persuasion attacks.

Toward

Web agents that resist social engineering as a default behaviour.

No. 09

Ryan Othniel Kearns

DPhil Student

Currently

The science of evals: measuring reasoning honestly.

Toward

A field where every benchmark publishes its construct validity.

No. 10

Shreyansh Padarha

DPhil Student

Currently

Agentic systems for science, with safety and eval rigour.

Toward

Scientific agents auditable enough to act on in real research.

No. 11

Mia Kussman

MSc Student

Currently

Studies of human–LLM interaction and LLM evaluation.

Toward

Interaction patterns that improve, rather than substitute, human judgement.

No. 12

Caleb Tan

MSc Student

Currently

LLM evaluation and reasoning benchmarks.

Toward

Reasoning evals that separate genuine inference from recall.

No. 13

Sebastian Petric

Visiting Policy Fellow

Currently

LLMs applied to financial time series, at the policy boundary.

Toward

Honest characterisation of LLM utility in high-stakes financial settings.

No. 14

Tristan Naidoo

Research Affiliate

Currently

Public-health AI and LLM evaluations grounded in epidemiology.

Toward

Public-health AI evaluated like a health intervention.

No. 15

Josh Lawman

Entrepreneur in Residence

Currently

Research-to-product translation

Toward

Steady work on research-to-product translation.

§ 05 · News

Recent news from the lab.

Papers accepted, conferences attended, and awards received. Most recent first.

DateEntryCategory

May 2026
Papers accepted at ICML 2026!
Paper
April 2026
OxRML at ICLR 2026
Conference
February 2026
Ryan Othniel Kearns Wins MSc Thesis Prize
Award
February 2026
New Paper in Nature Medicine!
Paper
December 2025
OxRML @ NeurIPS 2025
Conference
November 2025
OxRML @ EMNLP 2025
Conference
June 2025
Prof. Adam Mahdi Wins Teaching Excellence Award 2025
Award
February 2025
New Paper in Information Fusion!
Paper
December 2024
OxRML @ NeurIPS 2024
Conference
September 2024
Winners of 2024 PhysioNet Challenge
Award
July 2024
OxRML @ ICML 2024
Conference

11 entries · most recent first

§ 06 · Engage with the Lab

Three ways to work with the lab — long-term, and on shared standards.

Three formats for partnering with the lab. Each is a real commitment — we are selective about partners and ramp slowly, because shared roadmaps need shared standards.

Format · 01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Cadence: Half-day to multi-week
Format: On-site sessions for product and ML teams
What we commit: We bring the eval and safety toolkit we use in the lab and adapt it to the team in the room.

Book a workshop

Format · 02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

Cadence: Quarterly cycles
Format: Joint roadmaps with first-class development studios
What we commit: Lab breakthroughs become production tools: eval harnesses, safety dashboards, agentic-research platforms.

See our builds

Format · 03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Cadence: Multi-year, named
Format: Foundations, governments, global corporates
What we commit: Named labs, dedicated DPhil studentships, shared scientific direction. We work with partners who care about getting it right.

Start a conversation

We work with foundations, governments, and companies who want AI evaluation, safety, and reasoning treated as a science.

Read our research

We reply to every serious enquiry.

§ 07 · Venues

Where the work has been published.

Universities, journals, and conferences where the lab’s work has been hosted, peer-reviewed, and published.

01University of OxfordHost institution
02Oxford Internet InstituteAffiliated department
03Nature MedicinePublished 2026
04ICMLSpotlight & papers, 2026
05NeurIPSDatasets & Benchmarks, 2025
06ICLRAccepted, 2026
07EMNLPMultiple, 2025

§ 08 · AI News

From the OxRML lab.

New papers, open positions, partnership opportunities, and what we have been reading.

Reasoning with Machines Lab

Four research themes the lab works across.

Benchmarks and Evaluation

AI Safety and Security

Agentic AI for Science

Human–AI Interaction

Ten recent papers from the lab.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Measuring what matters: Construct validity in large language model benchmarks

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Review of multimodal machine learning approaches in healthcare

LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

The researchers of the lab.

Prof. Adam Mahdi

Felix Krones

Djavan De Clercq

Andrew M. Bean

Yushi Yang

Harry Mayne

Jessica Rodrigues

Guy Parsons

Karolina Korgul

Ryan Othniel Kearns

Shreyansh Padarha

Mia Kussman

Caleb Tan

Sebastian Petric

Tristan Naidoo

Josh Lawman

Recent news from the lab.

Three ways to work with the lab — long-term, and on shared standards.

Workshops for industry teams

Tools co-built with engineering partners

Research partnerships

Where the work has been published.

From the OxRML lab.