Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Prof. Adam Mahdi
Oxford Internet Institute, University of Oxford
Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.
Recognised by
The latest
From the lab
Three OxRML papers accepted at ICML 2026 — including a Spotlight
OxRML presenting at ICLR 2026
New paper in Nature Medicine on LLMs as medical assistants
Ryan Othniel Kearns wins MSc Thesis Prize
OxRML at NeurIPS 2025
OxRML at EMNLP 2025
Four directions
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.
The work
A selection from the last eighteen months at ICML, NeurIPS, ICLR, EMNLP, and Nature Medicine. Each one measures something concrete about what current models do.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.
Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.
Read the paper
LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.

Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.

A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.

A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.
The people
DPhil students, MSc researchers, visiting fellows, and affiliates. We come from machine learning, statistics, clinical medicine, and policy.

Felix Krones
DPhil Student
Multimodal AI, digital health

Djavan De Clercq
DPhil Student
AI and food security, LLMs

Andrew M. Bean
DPhil Student
LLM evaluations, human–LLM interaction

Yushi Yang
DPhil Student
LLM & agentic post-training, AI alignment

Harry Mayne
DPhil Student
LLM interpretability, AI safety, LLM evaluations

Jessica Rodrigues
DPhil Student
Knowledge graphs, metascience

Guy Parsons
DPhil Student
Healthcare AI, digital health

Karolina Korgul
DPhil Student
AI safety, agentic AI

Ryan Othniel Kearns
DPhil Student
Science of evals, reasoning in LLMs

Shreyansh Padarha
DPhil Student
AI for science, AI safety, LLM evaluations

Mia Kussman
MSc Student
Human–LLM interaction, LLM evaluations
Caleb Tan
MSc Student
LLM evaluations, reasoning

Sebastian Petric
Visiting Policy Fellow
LLMs and financial time series

Tristan Naidoo
Research Affiliate
Public health AI, LLM evaluations
Plus collaborators across the Oxford Internet Institute, the Department of Engineering Science, and the Big Data Institute.
Work with us
We pick collaborators with care. If you are building AI into a setting where being wrong has a cost, talk to us.
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
We work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
Applied research collaborations with foundations, governments, and large companies.
Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.
Or just say hello
hello@oxrml.com