Reasoning · with · Machines

We study
evaluation, safety & reasoning
in machines.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Partner with us Read our research

Prof. Adam MahdiPrincipal Investigator

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Radcliffe Camera · Oxford

51.7548° N1.2544° W

§ 01Four research areas

Our research.

Four themes the lab works across. Every paper, collaboration, and DPhil project sits somewhere in this frame.

EDawn, first principles

01/04

Evaluation

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

SSky, open horizon

02/04

Safety

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

WEvening, sustained inquiry

03/04

Agentic

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

NNight, the long watch

04/04

Human-AI

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

§ 02Selected works

Recent publications.

See all papers

FeaturedICML (Spotlight)May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

ICMLMay 2026

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

ICMLMay 2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

Nature MedicineFebruary 2026

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

NeurIPS Datasets and BenchmarksNovember 2025

Measuring what matters: Construct validity in large language model benchmarks

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

NeurIPS LLM Lifecycle WorkshopNovember 2025

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale

§ 03The team

The people behind the work.

DPhils, MSc students, and visiting researchers. Each focuses on one of the four areas above.

Meet the full lab

Felix Krones

DPhil Student

Multimodal AI, digital health

Djavan De Clercq

DPhil Student

AI and food security, LLMs

Andrew M. Bean

DPhil Student

LLM evaluations, human–LLM interaction

Yushi Yang

DPhil Student

LLM & agentic post-training, AI alignment

Harry Mayne

DPhil Student

LLM interpretability, AI safety, LLM evaluations

Jessica Rodrigues

DPhil Student

Knowledge graphs, metascience

Guy Parsons

DPhil Student

Healthcare AI, digital health

Karolina Korgul

DPhil Student

AI safety, agentic AI

§ 04Work with us

Three ways to work with the lab.

We work with foundations, governments, and enterprises that take AI seriously and have patience for empirical work.

Pillar 01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Book a workshop

Pillar 02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

See our builds

Pillar 03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Start a conversation

Where the work has been received

University of OxfordHost institution
Oxford Internet InstituteAffiliated department
Nature MedicinePublished 2026
ICMLSpotlight & papers, 2026
NeurIPSDatasets & Benchmarks, 2025
ICLRAccepted, 2026
EMNLPMultiple, 2025

The lab newsletter

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.

We studyevaluation, safety & reasoningin machines.

Our research.

Benchmarks and Evaluation

AI Safety and Security

Agentic AI for Science

Human–AI Interaction

Recent publications.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Measuring what matters: Construct validity in large language model benchmarks

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

The people behind the work.

Felix Krones

Djavan De Clercq

Andrew M. Bean

Yushi Yang

Harry Mayne

Jessica Rodrigues

Guy Parsons

Karolina Korgul

Three ways to work with the lab.

Workshops for industry teams

Tools co-built with engineering partners

Research partnerships

A quarterly note from the lab. Nothing else.

We study
evaluation, safety & reasoning
in machines.