Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them. OxRML publishes empirical research on LLMs and agentic systems, built to stay responsive under pressure and useful to the teams shipping AI into the world.

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.
Each research direction publishes to a topic. Partners and students subscribe independently. Failures stay isolated; load scales horizontally.
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.
Each publication is a message: produced by the lab, consumed by venues, acknowledged by the field. Ten in-flight, six pinned below.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.
Fourteen researchers, partitioned into consumer groups by role. Each subscribes to multiple research streams; together they keep the lab elastic and resilient.

subscribes :: Multimodal AI, digital health

subscribes :: AI and food security, LLMs

subscribes :: LLM evaluations, human–LLM interaction

subscribes :: LLM & agentic post-training, AI alignment

subscribes :: LLM interpretability, AI safety, LLM evaluations

subscribes :: Knowledge graphs, metascience

subscribes :: Healthcare AI, digital health

subscribes :: AI safety, agentic AI

subscribes :: Science of evals, reasoning in LLMs

subscribes :: AI for science, AI safety, LLM evaluations

subscribes :: Human–LLM interaction, LLM evaluations
subscribes :: LLM evaluations, reasoning

subscribes :: LLMs and financial time series

subscribes :: Public health AI, LLM evaluations
We design partnerships the way reactive systems are designed: bounded scope, clear contracts, graceful escalation. Pick a latency that fits.
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
We work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
Applied research collaborations with foundations, governments, and large companies.
Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.
New papers, open positions, partnership opportunities, and what we have been reading.