Filed Hilary Term · University of Oxford

We measure what the
field forgets to measure.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Editor’s note. A research group working in the open: benchmarks that hold up under scrutiny, safety work that survives contact with deployment, and agentic systems for the high-stakes domains that already trust us in healthcare, policy, and science.

Partner with us→Read our research↗

Cont. on p. 02 ↓

The Index

In this issue.

Six sections, page numbers included. Use this to skip ahead to whatever you came for.

§02Four research themes, one method↗Evaluation · Safety · Agentic Science · Human-AIp. 04
§03The Folio. Featured publications.↗ICML · ICLR · NeurIPS · EMNLP · Nature Medicinep. 10
§04The Roster↗DPhil students, MSc, policy fellows, affiliatesp. 14
§05Work with us. Three doors in.↗Workshops · Co-built tech · Research partnershipsp. 03
§06Dispatches. Recent news.↗Awards · Conferences · New papersp. 09

§02

Four pillars

We study what the
models won’t tell us themselves.

Read across the four columns below. Each one is a portfolio, not a project.

No. 01Evaluation
Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
Cross-ref. → Folio §03, Roster §04
No. 02Safety
AI Safety and Security
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
Cross-ref. → Folio §03, Roster §04
No. 03Agentic
Agentic AI for Science
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
Cross-ref. → Folio §03, Roster §04
No. 04Human-AI
Human–AI Interaction
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.
Cross-ref. → Folio §03, Roster §04

§03

Selected works

The Folio. Ten papers, written in plain English.

10 papers · 2025-26

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Lead · fig. 02

ICML (Spotlight) · May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

Read the paper →

Benchmarks and Evaluation
Agentic AI

fig. 03May 2026
ICML
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
H Mayne, JS Kang, D Gould, K Ramchandran, A Mahdi, NY Siegel
LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.
Read full text →
fig. 04May 2025
ICML
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi
A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.
Read full text →
fig. 05February 2026
Nature Medicine
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
AM Bean, RE Payne, G Parsons, HR Kirk, J Ciro, R Mosquera-Gómez, S Hincapié, AS Ekanayaka, L Tarassenko, L Rocher, A Mahdi
A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.
Read full text →
fig. 06November 2025
NeurIPS Datasets and Benchmarks
Measuring what matters: Construct validity in large language model benchmarks
AM Bean, RO Kearns, A Romanou, FS Hafner, H Mayne, J Batzner, et al.
A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.
Read full text →
fig. 07November 2025
NeurIPS LLM Lifecycle Workshop
Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints
S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale
How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.
Read full text →

§04

The contributors

The Roster. Fourteen names on the masthead.

DPhil students, MSc researchers, visiting fellows and affiliates. Each portrait is one column of the gazette.

No. 01
Felix Krones
DPhil Student
Multimodal AI, digital health
No. 02
Djavan De Clercq
DPhil Student
AI and food security, LLMs
No. 03
Andrew M. Bean
DPhil Student
LLM evaluations, human–LLM interaction
No. 04
Yushi Yang
DPhil Student
LLM & agentic post-training, AI alignment
No. 05
Harry Mayne
DPhil Student
LLM interpretability, AI safety, LLM evaluations
No. 06
Jessica Rodrigues
DPhil Student
Knowledge graphs, metascience
No. 07
Guy Parsons
DPhil Student
Healthcare AI, digital health
No. 08
Karolina Korgul
DPhil Student
AI safety, agentic AI
No. 09
Ryan Othniel Kearns
DPhil Student
Science of evals, reasoning in LLMs
No. 10
Shreyansh Padarha
DPhil Student
AI for science, AI safety, LLM evaluations
No. 11
Mia Kussman
MSc Student
Human–LLM interaction, LLM evaluations
No. 12
Caleb Tan
MSc Student
LLM evaluations, reasoning

Also contributing this term.

Sebastian PetricVisiting Policy Fellow
Tristan NaidooResearch Affiliate

§05

Three doors in

Work with us. Pick a door.

Foundations, governments, and global operators have hired us to do three things. Read the briefs below.

INo. 01 / III
Workshops for industry teams
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
Book a workshop →
IINo. 02 / III
Tools co-built with engineering partners
We work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
See our builds →
IIINo. 03 / III
Research partnerships
Applied research collaborations with foundations, governments, and large companies.
Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.
Start a conversation →

In good company.

University of OxfordHost institution
Oxford Internet InstituteAffiliated department
Nature MedicinePublished 2026
ICMLSpotlight & papers, 2026
NeurIPSDatasets & Benchmarks, 2025
ICLRAccepted, 2026
EMNLPMultiple, 2025

§06 · Stop press

Dispatches from the lab.

Nine recent items, chronological, no filler. If you want the longer read, the Folio is next door.

¶paper: 3
◇conference: 3
★award: 3
·other: 0

May 2026
Three OxRML papers accepted at ICML 2026 — including a Spotlight
paper
April 2026
OxRML presenting at ICLR 2026
conference
February 2026
New paper in Nature Medicine on LLMs as medical assistants
paper
February 2026
Ryan Othniel Kearns wins MSc Thesis Prize
award
December 2025
OxRML at NeurIPS 2025
conference
November 2025
OxRML at EMNLP 2025
conference
June 2025
Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025
award
February 2025
New review paper in Information Fusion
paper
September 2024
Winners of the 2024 PhysioNet Challenge
award

We measure what thefield forgets to measure.

In this issue.

We study what the models won’t tell us themselves.

Benchmarks and Evaluation

AI Safety and Security

Agentic AI for Science

Human–AI Interaction

The Folio. Ten papers, written in plain English.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Measuring what matters: Construct validity in large language model benchmarks

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

The Roster. Fourteen names on the masthead.

Work with us. Pick a door.

Workshops for industry teams

Tools co-built with engineering partners

Research partnerships

Dispatches from the lab.

A quarterly note from the lab. Nothing else.

We measure what the
field forgets to measure.

We study what the
models won’t tell us themselves.