Benchmarks and Evaluation
We develop the science of LLM evaluation: how to measure what models do, where current benchmarks mislead, and how to build ones that hold up.
Reasoning with Machines Lab, a research group at the University of Oxford; cf. also Oxford Internet Institute.
An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.
Four research currents, parsed below as noun phrases. The tree structure is not decorative: it shows what modifies what, and the order of attachment. Each line runs over years, not quarters.
We develop the science of LLM evaluation: how to measure what models do, where current benchmarks mislead, and how to build ones that hold up.
Bias, toxicity, agentic misalignment. We study where AI fails and build the technical and governance tools that address those failures.
We build agentic systems for scientific knowledge synthesis and discovery. The work is on keeping these agents reliable, transparent, and grounded in their domain.
We run empirical studies on how people use AI for high-stakes decisions in healthcare, law, and policy.
Below: three featured papers with a full morphological parse of their load-bearing terms, then a denser reference list of the recent corpus.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.
The lab roster, set as a field-journal informant register. Each entry is tagged with a feature bundle parsed from their focus, and a register code: DPH (DPhil), MSC (MSc), VIS (visiting), AFF (affiliate).













Three modalities by which industry, government, and foundation partners work with the lab.
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
We work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
Applied research collaborations with foundations, governments, and large companies.
Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.