Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
> Abstract // An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them. We publish in Nature Medicine, ICML, NeurIPS, ICLR and EMNLP. We partner with the organisations that ship AI into rooms where being wrong is unacceptable.
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.
A benchmark that tells real navigation apart from stochastic search when agents work over document collections.
LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.
A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.
A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.
A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.
How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.
Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.
Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.
A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.
A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.
> Hiring philosophy // OxRML is staffed in the proportion that should alarm a traditional VC: heavy concentration of DPhils, research engineers, and visiting fellows. The ratio of scientists to anything else is the point.

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.
FK
DD
AM
YY
HM
JR
GP
KK
RO
SP
MK
SP
TNThe hardest technical problems produce the most defensible products. If a competitor can replicate your evaluation stack in a hackathon, you have a feature, not a moat. We work with the organisations that understand the difference.
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
We work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
Applied research collaborations with foundations, governments, and large companies.
Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.
New papers, open positions, partnership opportunities, and what we have been reading.