I. Holding space
Benchmarks and Evaluation
“What is true about this system, under load?”
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
we measure, we wait, we record
You already know your model can do remarkable things. You already sense where it falls short. We are the lab you sit beside while you sort the one from the other, patiently and on the record.
An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Principal Investigator
Oxford Internet Institute, University of Oxford
Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.
In good company
From the journal
Small markers of what has arrived in the last few seasons.
A paper
May 2026Three OxRML papers accepted at ICML 2026 — including a Spotlight
A gathering
April 2026OxRML presenting at ICLR 2026
A paper
February 2026New paper in Nature Medicine on LLMs as medical assistants
A milestone
February 2026Ryan Othniel Kearns wins MSc Thesis Prize
A gathering
December 2025OxRML at NeurIPS 2025
A gathering
November 2025OxRML at EMNLP 2025
Four conditions
Four conditions our group cultivates. Each one is a long act of looking at models, at the people who use them, and at the gap between what is claimed and what is true. None of them are solutions. They are postures.
I. Holding space
“What is true about this system, under load?”
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
we measure, we wait, we record
II. Tending the edges
“Where does it break, and who is hurt when it does?”
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
we watch, we name, we protect
III. Building a midwife
“Can an agent help science itself emerge faster?”
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
we design, we ground, we test in the open
IV. Listening to people
“What happens when a stranger holds your decision?”
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.
we run trials, we collect testimony, we publish
Recent harvest
Each paper began as a question someone in the group could not put down. Follow the links to the full work.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.
Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.
H Mayne, JS Kang, D Gould, K Ramchandran, A Mahdi, NY Siegel

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.
K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.
AM Bean, RE Payne, G Parsons, HR Kirk, J Ciro, R Mosquera-Gómez, S Hincapié, AS Ekanayaka, L Tarassenko, L Rocher, A Mahdi

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.
AM Bean, RO Kearns, A Romanou, FS Hafner, H Mayne, J Batzner, et al.

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.
S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale
4 more in the journal, ripening.
The circle
Twelve are pictured below; two more are listed at the bottom. Each member of OxRML is here because they are chasing a question, a method, or a stubborn intuition. They are here to find out.

DPhil Student
Multimodal AI, digital health

DPhil Student
AI and food security, LLMs

DPhil Student
LLM evaluations, human–LLM interaction

DPhil Student
LLM & agentic post-training, AI alignment

DPhil Student
LLM interpretability, AI safety, LLM evaluations

DPhil Student
Knowledge graphs, metascience

DPhil Student
Healthcare AI, digital health

DPhil Student
AI safety, agentic AI

DPhil Student
Science of evals, reasoning in LLMs

DPhil Student
AI for science, AI safety, LLM evaluations

MSc Student
Human–LLM interaction, LLM evaluations
MSc Student
LLM evaluations, reasoning
And also, holding from a distance
Sebastian Petric · Tristan Naidoo
LLMs and financial time series · Public health AI, LLM evaluations
For decision-makers at Gates Foundation, Schwarz Group, and the rest of you
Three ways we sit alongside organisations whose decisions matter. Each one is a different rhythm of involvement. None of them are us doing the work for you.
You bring
A team that ships AI into places where mistakes are costly.
We hold
Half-day to multi-week formats on evaluation, safety, and agent reliability, adapted to your stack, your stakes, and your timeline.
What arrives
Your team leaves with shared language, working harnesses, and a clearer view of what you are shipping.
You bring
A research idea that needs to become a tool other people can use.
We hold
We partner with first-class development studios to turn a paper or prototype into a production-grade tool: evaluation harnesses, safety dashboards, and agentic-research platforms.
What arrives
A piece of software that carries the research methodology and survives contact with real users.
You bring
A foundation, government, or company that wants AI to go well in your domain.
We hold
Multi-year applied programmes with shared roadmaps, dedicated DPhil studentships, named labs, and joint publications. We pick partners who care about getting AI right.
What arrives
A body of work that outlasts a single project, and a relationship built on the public record.
Talk to us about the work that matters most this year.
Begin a conversationThe lab newsletter
New papers, open positions, partnership opportunities, and what we have been reading.
Unsubscribe in one click. We never share your email.