Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Oxford Internet Institute, University of Oxford
Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Four themes the lab works across. Every paper, collaboration, and DPhil project sits somewhere in this frame.
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.
Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.
S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale
DPhils, MSc students, and visiting researchers. Each focuses on one of the four areas above.

DPhil Student
Multimodal AI, digital health

DPhil Student
AI and food security, LLMs

DPhil Student
LLM evaluations, human–LLM interaction

DPhil Student
LLM & agentic post-training, AI alignment

DPhil Student
LLM interpretability, AI safety, LLM evaluations

DPhil Student
Knowledge graphs, metascience

DPhil Student
Healthcare AI, digital health

DPhil Student
AI safety, agentic AI
We work with foundations, governments, and enterprises that take AI seriously and have patience for empirical work.
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
Book a workshopWe work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
See our buildsApplied research collaborations with foundations, governments, and large companies.
Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.
Start a conversationWhere the work has been received
The lab newsletter
New papers, open positions, partnership opportunities, and what we have been reading.
Unsubscribe in one click. We never share your email.