Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
Oxford Internet Institute·Reasoning with Machines Lab
Benchmarks, safety audits, agentic systems, and human-AI studies from the Oxford Internet Institute. Fourteen researchers; ten papers in 2025-26 at Nature Medicine, ICML, NeurIPS, ICLR, and EMNLP.

Principal Investigator
Prof. Adam Mahdi
Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.
Oxford Internet Institute, University of Oxford

ICML (Spotlight)May 2026
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

ICMLMay 2026
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

ICLRApril 2026
LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
Selected from 10 publications · See the full list ↓
Research · four lines
Empirical work on language models and the agents built from them, across evaluation, safety, agentic science, and human-AI interaction. Each is a long-running programme.
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.
Publications · 2025-2610 papers
An ICML Spotlight, three further ICML accepts, work at NeurIPS, ICLR, and EMNLP, and a Nature Medicine paper on LLMs as medical assistants, all within the last twelve months.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.
Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.
Team14 members · Oxford
A small group of DPhil students, MSc researchers, and visiting fellows. Everyone is listed here with what they work on.

Principal Investigator
Prof. Adam Mahdi
AI · reasoning · evaluation

DPhil Student
Multimodal AI, digital health

DPhil Student
AI and food security, LLMs

DPhil Student
LLM evaluations, human–LLM interaction

DPhil Student
LLM & agentic post-training, AI alignment

DPhil Student
LLM interpretability, AI safety, LLM evaluations

DPhil Student
Knowledge graphs, metascience

DPhil Student
Healthcare AI, digital health

DPhil Student
AI safety, agentic AI

DPhil Student
Science of evals, reasoning in LLMs

DPhil Student
AI for science, AI safety, LLM evaluations

MSc Student
Human–LLM interaction, LLM evaluations
MSc Student
LLM evaluations, reasoning
Work with us3 routes
Applied work for foundations, governments, and companies with hard AI problems. We pair research methods with engineering partners who can ship.
Affiliations & venues
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
We work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
Applied research collaborations with foundations, governments, and large companies.
Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.