Oxford Internet Institute·Reasoning with Machines Lab

How language models reason, and where they fail.

Benchmarks, safety audits, agentic systems, and human-AI studies from the Oxford Internet Institute. Fourteen researchers; ten papers in 2025-26 at Nature Medicine, ICML, NeurIPS, ICLR, and EMNLP.

Partner with us Read our research

Principal Investigator

Prof. Adam Mahdi

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Oxford Internet Institute, University of Oxford

Recent work · 2025-2604 / 10

ICML (Spotlight)May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

ICMLMay 2026

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

ICLRApril 2026

LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Selected from 10 publications · See the full list ↓

ICML·NeurIPS·ICLR·EMNLP·Nature Medicine·PhysioNet Challenge

Research · four lines

Four lines of research.

Empirical work on language models and the agents built from them, across evaluation, safety, agentic science, and human-AI interaction. Each is a long-running programme.

01 · Evaluation

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

02 · Safety

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

03 · Agentic

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

04 · Human-AI

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

Publications · 2025-2610 papers

Recent papers.

An ICML Spotlight, three further ICML accepts, work at NeurIPS, ICLR, and EMNLP, and a Nature Medicine paper on LLMs as medical assistants, all within the last twelve months.

Featured paperICML (Spotlight)May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

Benchmarks and EvaluationAgentic AI

Read the paper

ICMLMay 2026

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

Open ↗

ICMLMay 2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

Open ↗

Nature MedicineFebruary 2026

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

Open ↗

NeurIPS Datasets and BenchmarksNovember 2025

Measuring what matters: Construct validity in large language model benchmarks

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

Open ↗

NeurIPS LLM Lifecycle WorkshopNovember 2025

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

Open ↗

Full publications list

Team14 members · Oxford

The team.

A small group of DPhil students, MSc researchers, and visiting fellows. Everyone is listed here with what they work on.

Principal Investigator

Prof. Adam Mahdi

AI · reasoning · evaluation

Felix Krones

DPhil Student

Multimodal AI, digital health

Djavan De Clercq

DPhil Student

AI and food security, LLMs

Andrew M. Bean

DPhil Student

LLM evaluations, human–LLM interaction

Yushi Yang

DPhil Student

LLM & agentic post-training, AI alignment

Harry Mayne

DPhil Student

LLM interpretability, AI safety, LLM evaluations

Jessica Rodrigues

DPhil Student

Knowledge graphs, metascience

Guy Parsons

DPhil Student

Healthcare AI, digital health

Karolina Korgul

DPhil Student

AI safety, agentic AI

Ryan Othniel Kearns

DPhil Student

Science of evals, reasoning in LLMs

Shreyansh Padarha

DPhil Student

AI for science, AI safety, LLM evaluations

Mia Kussman

MSc Student

Human–LLM interaction, LLM evaluations

Caleb Tan

MSc Student

LLM evaluations, reasoning

Work with us3 routes

Ways to work with us.

Applied work for foundations, governments, and companies with hard AI problems. We pair research methods with engineering partners who can ship.

Affiliations & venues

✦University of OxfordHost institution
✦Oxford Internet InstituteAffiliated department
✦Nature MedicinePublished 2026
✦ICMLSpotlight & papers, 2026
✦NeurIPSDatasets & Benchmarks, 2025
✦ICLRAccepted, 2026
✦EMNLPMultiple, 2025

01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Book a workshop

02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

See our builds

03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Start a conversation

The lab newsletterQuarterly · low noise

Quarterly updates from the lab.

New papers, open positions, partnership opportunities, and what we have been reading.