Oxford Internet Institute·Reasoning with Machines Lab

How language models reason, and where they fail.

Benchmarks, safety audits, agentic systems, and human-AI studies from the Oxford Internet Institute. Fourteen researchers; ten papers in 2025-26 at Nature Medicine, ICML, NeurIPS, ICLR, and EMNLP.

Portrait of Prof. Adam Mahdi

Principal Investigator

Prof. Adam Mahdi

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Oxford Internet Institute, University of Oxford

ICML·NeurIPS·ICLR·EMNLP·Nature Medicine·PhysioNet Challenge

Research · four lines

Four lines of research.

Empirical work on language models and the agents built from them, across evaluation, safety, agentic science, and human-AI interaction. Each is a long-running programme.

01 · Evaluation

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

02 · Safety

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

03 · Agentic

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

04 · Human-AI

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

Publications · 2025-2610 papers

Recent papers.

An ICML Spotlight, three further ICML accepts, work at NeurIPS, ICLR, and EMNLP, and a Nature Medicine paper on LLMs as medical assistants, all within the last twelve months.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Featured paperICML (Spotlight)May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

Benchmarks and EvaluationAgentic AI
Read the paper

Team14 members · Oxford

The team.

A small group of DPhil students, MSc researchers, and visiting fellows. Everyone is listed here with what they work on.

Portrait of Prof. Adam Mahdi

Principal Investigator

Prof. Adam Mahdi

AI · reasoning · evaluation

Felix Krones

Felix Krones

DPhil Student

Multimodal AI, digital health

Djavan De Clercq

Djavan De Clercq

DPhil Student

AI and food security, LLMs

Andrew M. Bean

Andrew M. Bean

DPhil Student

LLM evaluations, human–LLM interaction

Yushi Yang

Yushi Yang

DPhil Student

LLM & agentic post-training, AI alignment

Harry Mayne

Harry Mayne

DPhil Student

LLM interpretability, AI safety, LLM evaluations

Jessica Rodrigues

Jessica Rodrigues

DPhil Student

Knowledge graphs, metascience

Guy Parsons

Guy Parsons

DPhil Student

Healthcare AI, digital health

Karolina Korgul

Karolina Korgul

DPhil Student

AI safety, agentic AI

Ryan Othniel Kearns

Ryan Othniel Kearns

DPhil Student

Science of evals, reasoning in LLMs

Shreyansh Padarha

Shreyansh Padarha

DPhil Student

AI for science, AI safety, LLM evaluations

Mia Kussman

Mia Kussman

MSc Student

Human–LLM interaction, LLM evaluations

Caleb Tan

Caleb Tan

MSc Student

LLM evaluations, reasoning

Work with us3 routes

Ways to work with us.

Applied work for foundations, governments, and companies with hard AI problems. We pair research methods with engineering partners who can ship.

Affiliations & venues

  • University of OxfordHost institution
  • Oxford Internet InstituteAffiliated department
  • Nature MedicinePublished 2026
  • ICMLSpotlight & papers, 2026
  • NeurIPSDatasets & Benchmarks, 2025
  • ICLRAccepted, 2026
  • EMNLPMultiple, 2025
01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

The lab newsletterQuarterly · low noise

Quarterly updates from the lab.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.