University of Oxford · Oxford Internet Institute

Reasoning with Machines Lab

@ University of Oxford

Led by Prof. Adam Mahdi, our lab advances the science of AI evaluation, benchmarking, safety and security. Through rigorous empirical research, we study how LLMs and agentic systems reason, interact with humans and drive scientific discovery.

Read our research Engage with OxRML

Lab Lead · Associate Professor

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks.

Recognised by

University of OxfordHost institution

Oxford Internet InstituteAffiliated department

Nature MedicinePublished 2026

ICMLSpotlight & papers, 2026

NeurIPSDatasets & Benchmarks, 2025

ICLRAccepted, 2026

EMNLPMultiple, 2025

The latest

From the lab

New paper·May 2026
Papers accepted at ICML 2026!
At conference·April 2026
OxRML at ICLR 2026
Award·February 2026
Ryan Othniel Kearns Wins MSc Thesis Prize
New paper·February 2026
New Paper in Nature Medicine!
At conference·December 2025
OxRML @ NeurIPS 2025
At conference·November 2025
OxRML @ EMNLP 2025

Research Themes

We study the science of how machines reason.

Evaluation01 / 04

Benchmarks and Evaluation

We develop the science of LLM evaluation, setting the standard for rigorous assessment and identifying hidden risks before they matter.

Safety02 / 04

AI Safety and Security

From bias and toxicity to agentic misalignment, we study the full spectrum of AI risk and develop the technical and governance tools to address it.

Agentic03 / 04

Agentic AI for Science

We build agentic systems that automate scientific knowledge synthesis and discovery, with a focus on agents that are reliable, transparent and domain-grounded.

Human-AI04 / 04

Human–AI Interaction

We run large-scale empirical studies on how people use AI for high stakes decisions, from healthcare and law to policy and beyond.

The work

Peer-reviewed,
published, cited.

A selection from the last eighteen months at ICML, NeurIPS, ICLR, EMNLP, and Nature Medicine. Each one measures something concrete about what current models do.

FeaturedICML (Spotlight)·May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

Read the paper

ICML·May 2026

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

Read

ICML·May 2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

Read

Nature Medicine·February 2026

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

Read

NeurIPS Datasets and Benchmarks·November 2025

Measuring what matters: Construct validity in large language model benchmarks

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

Read

NeurIPS LLM Lifecycle Workshop·November 2025

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

Read

EMNLP·November 2025

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.

Read

EMNLP·September 2025

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.

Read

Information Fusion·February 2025

Review of multimodal machine learning approaches in healthcare

A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.

Read

ICLR·April 2026

LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.

Read

The people

A small group
of careful people.

DPhil students, MSc researchers, visiting fellows, and affiliates. We come from machine learning, statistics, clinical medicine, and policy.

Felix Krones
DPhil Student
Multimodal AI, digital health
Djavan De Clercq
DPhil Student
AI and food security, LLMs
Andrew M. Bean
DPhil Student
LLM evaluations, human–LLM interaction
Yushi Yang
DPhil Student
LLM & agentic post-training, AI alignment
Harry Mayne
DPhil Student
LLM interpretability, AI safety, LLM evaluations
Jessica Rodrigues
DPhil Student
Knowledge graphs, metascience
Guy Parsons
DPhil Student
Healthcare AI, digital health
Karolina Korgul
DPhil Student
AI safety, agentic AI
Ryan Othniel Kearns
DPhil Student
Science of evals, reasoning in LLMs
Shreyansh Padarha
DPhil Student
AI for science, AI safety, LLM evaluations
Mia Kussman
MSc Student
Human–LLM interaction, LLM evaluations
Caleb Tan
MSc Student
LLM evaluations, reasoning
Sebastian Petric
Visiting Policy Fellow
LLMs and financial time series
Tristan Naidoo
Research Affiliate
Public health AI, LLM evaluations
Josh Lawman
Entrepreneur in Residence
Research-to-product translation

Plus collaborators across the Oxford Internet Institute, the Department of Engineering Science, and the Big Data Institute.

Work with us

Three ways
to partner.

We pick collaborators with care. If you are building AI into a setting where being wrong has a cost, talk to us.

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Book a workshop

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

See our builds

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Start a conversation

Or just say hello

hello@oxrml.com

Reasoning with Machines Lab

We study the science of how machines reason.

Benchmarks and Evaluation

AI Safety and Security

Agentic AI for Science

Human–AI Interaction

Peer-reviewed,published, cited.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Measuring what matters: Construct validity in large language model benchmarks

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Review of multimodal machine learning approaches in healthcare

LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

A small groupof careful people.

Three waysto partner.

Workshops for industry teams

Tools co-built with engineering partners

Research partnerships

From the OxRML lab.

Peer-reviewed,
published, cited.

A small group
of careful people.

Three ways
to partner.