Vol. XII · No. 05Oxford, United Kingdom
The Oxford Reasoning QuarterlyOxRML · Reasoning with Machines Lab

Filed Hilary Term · University of Oxford

We measure what the
field forgets to measure.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Editor’s note. A research group working in the open: benchmarks that hold up under scrutiny, safety work that survives contact with deployment, and agentic systems for the high-stakes domains that already trust us in healthcare, policy, and science.

The Index

In this issue.

Six sections, page numbers included. Use this to skip ahead to whatever you came for.

§02

Four pillars

We study what the
models won’t tell us themselves.

Read across the four columns below. Each one is a portfolio, not a project.

  1. No. 01Evaluation

    Benchmarks and Evaluation

    We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

    Cross-ref. → Folio §03, Roster §04

  2. No. 02Safety

    AI Safety and Security

    We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

    Cross-ref. → Folio §03, Roster §04

  3. No. 03Agentic

    Agentic AI for Science

    Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

    Cross-ref. → Folio §03, Roster §04

  4. No. 04Human-AI

    Human–AI Interaction

    Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

    Cross-ref. → Folio §03, Roster §04

§03

Selected works

The Folio. Ten papers, written in plain English.

10 papers · 2025-26

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document CollectionsLead · fig. 02

ICML (Spotlight) · May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

Read the paper →
  • Benchmarks and Evaluation
  • Agentic AI

§04

The contributors

The Roster. Fourteen names on the masthead.

DPhil students, MSc researchers, visiting fellows and affiliates. Each portrait is one column of the gazette.

  • Felix KronesNo. 01

    Felix Krones

    DPhil Student

    Multimodal AI, digital health

  • Djavan De ClercqNo. 02

    Djavan De Clercq

    DPhil Student

    AI and food security, LLMs

  • Andrew M. BeanNo. 03

    Andrew M. Bean

    DPhil Student

    LLM evaluations, human–LLM interaction

  • Yushi YangNo. 04

    Yushi Yang

    DPhil Student

    LLM & agentic post-training, AI alignment

  • Harry MayneNo. 05

    Harry Mayne

    DPhil Student

    LLM interpretability, AI safety, LLM evaluations

  • Jessica RodriguesNo. 06

    Jessica Rodrigues

    DPhil Student

    Knowledge graphs, metascience

  • Guy ParsonsNo. 07

    Guy Parsons

    DPhil Student

    Healthcare AI, digital health

  • Karolina KorgulNo. 08

    Karolina Korgul

    DPhil Student

    AI safety, agentic AI

  • Ryan Othniel KearnsNo. 09

    Ryan Othniel Kearns

    DPhil Student

    Science of evals, reasoning in LLMs

  • Shreyansh PadarhaNo. 10

    Shreyansh Padarha

    DPhil Student

    AI for science, AI safety, LLM evaluations

  • Mia KussmanNo. 11

    Mia Kussman

    MSc Student

    Human–LLM interaction, LLM evaluations

  • Caleb TanNo. 12

    Caleb Tan

    MSc Student

    LLM evaluations, reasoning

Also contributing this term.

  • Sebastian PetricVisiting Policy Fellow
  • Tristan NaidooResearch Affiliate

§05

Three doors in

Work with us. Pick a door.

Foundations, governments, and global operators have hired us to do three things. Read the briefs below.

  1. INo. 01 / III

    Workshops for industry teams

    On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

    Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

    Book a workshop
  2. IINo. 02 / III

    Tools co-built with engineering partners

    We work with engineering partners to turn lab work into tools other teams can run.

    Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

    See our builds
  3. IIINo. 03 / III

    Research partnerships

    Applied research collaborations with foundations, governments, and large companies.

    Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

    Start a conversation

In good company.

  • University of OxfordHost institution
  • Oxford Internet InstituteAffiliated department
  • Nature MedicinePublished 2026
  • ICMLSpotlight & papers, 2026
  • NeurIPSDatasets & Benchmarks, 2025
  • ICLRAccepted, 2026
  • EMNLPMultiple, 2025

§06 · Stop press

Dispatches from the lab.

Nine recent items, chronological, no filler. If you want the longer read, the Folio is next door.

paper
3
conference
3
award
3
·other
0
  1. May 2026

    Three OxRML papers accepted at ICML 2026 — including a Spotlight

    paper
  2. April 2026

    OxRML presenting at ICLR 2026

    conference
  3. February 2026

    New paper in Nature Medicine on LLMs as medical assistants

    paper
  4. February 2026

    Ryan Othniel Kearns wins MSc Thesis Prize

    award
  5. December 2025

    OxRML at NeurIPS 2025

    conference
  6. November 2025

    OxRML at EMNLP 2025

    conference
  7. June 2025

    Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025

    award
  8. February 2025

    New review paper in Information Fusion

    paper
  9. September 2024

    Winners of the 2024 PhysioNet Challenge

    award
The Subscription Card

The lab newsletter

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.

Cadence

Quarterly

Length

≈ 4 min read

Archive

Open access