UNIVERSITY OF OXFORD·OXFORD, UNITED KINGDOM
RECORD No. 24 / 24·OPEN ENTRY
STEP 01 · UNDERSTAND THE DIRECTION

A research practice, recorded as kata.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them. Every paper is a cycle of plan, do, check, act: a small, deliberate experiment toward the standard we want AI evaluation, safety, and reasoning to hold.

DIRECTION

Reasoning systems that scientists, clinicians, and the public can trust, measured by what they do rather than what they claim.

STEP 02 · ESTABLISH THE TARGET CONDITION

Four directions of practice.

Each direction is a standard the lab is working toward. Together they describe the field we want AI evaluation, safety, and reasoning to become.

01

DIRECTION · EVALUATION

Benchmarks and Evaluation

TARGET CONDITION

A field where every benchmark publishes its construct validity, and where evaluation is treated as an experimental science.

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

CYCLE STATE · IN PRACTICE◯ → ◐ → ●
02

DIRECTION · SAFETY

AI Safety and Security

TARGET CONDITION

A practice of measuring real harms (bias, toxicity, agentic misalignment) at the neuron and the deployment, before they reach the public.

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

CYCLE STATE · IN PRACTICE◯ → ◐ → ●
03

DIRECTION · AGENTIC

Agentic AI for Science

TARGET CONDITION

Scientific agents that synthesise knowledge reliably enough that a researcher can act on them, transparently enough that they can audit them.

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

CYCLE STATE · IN PRACTICE◯ → ◐ → ●
04

DIRECTION · HUMAN-AI

Human–AI Interaction

TARGET CONDITION

Decisions made with AI in healthcare, law, and policy, studied empirically rather than assumed safe because the model is impressive.

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

CYCLE STATE · IN PRACTICE◯ → ◐ → ●
STEP 03 · EXPERIMENT TOWARD THE TARGET

Ten cycles, recorded.

Each paper below is one full kata cycle: target, actual, obstacle, next. The most recent carries the lacquer NOW stamp. Completed cycles carry the bamboo mark.

NOW · current cycleCOMPLETED · cycle closedREFERENCE · still cited10 OF 10 ENTRIES
ENTRY 01 / 10·MAY 2026·ICML (SPOTLIGHT)
今 NOW

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

BENCHMARKS AND EVALUATIONAGENTIC AI
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

FIG. 01 · MADQA

TARGET

WHAT THE CYCLE AIMS AT

A benchmark that tells document-collection navigation apart from stochastic search in multimodal agents.

ACTUAL

WHAT WE OBSERVE TODAY

A spotlight at ICML 2026 showing agents and humans diverge sharply on document QA. Agents look fluent, but their search pattern is not strategic.

OBSTACLE

WHAT BLOCKS THE TARGET

Existing document benchmarks reward fluency over navigation, so failure modes were invisible. The task itself had to be redesigned.

NEXT

THE STEP WE WILL RUN

Extend MAD-QA to dynamic collections and longer horizons; pair it with traces, so we measure the search policy, not the final answer.

◆ CYCLE STATE · NOWREAD THE CYCLE →
ENTRY 02 / 10·MAY 2026·ICML
今 NOW

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

H Mayne, JS Kang, D Gould, K Ramchandran, A Mahdi, NY Siegel

AI SAFETY AND ALIGNMENTBENCHMARKS AND EVALUATION
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

FIG. 02 · FAITHFULNESS

TARGET

WHAT THE CYCLE AIMS AT

A measurement of LLM self-explanation faithfulness that holds up well enough to use in evaluation pipelines.

ACTUAL

WHAT WE OBSERVE TODAY

A positive ICML result: self-explanations, treated with care, predict downstream model behaviour better than chance, sometimes substantially.

OBSTACLE

WHAT BLOCKS THE TARGET

The literature had assumed self-explanations were noise. Demonstrating utility required isolating the conditions in which they carry signal.

NEXT

THE STEP WE WILL RUN

Integrate the faithfulness probe into safety evals; test whether it survives distribution shift and adversarial prompting.

◆ CYCLE STATE · NOWREAD THE CYCLE →
ENTRY 03 / 10·MAY 2025·ICML
REFERENCE

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi

BENCHMARKS AND EVALUATIONAGENTIC AIAI SAFETY AND ALIGNMENT
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

FIG. 03 · TRAP

TARGET

WHAT THE CYCLE AIMS AT

A benchmark that measures whether a web agent can be socially engineered into abandoning its user, before that happens in deployment.

ACTUAL

WHAT WE OBSERVE TODAY

TRAP at ICML: across frontier web agents, persuasion attacks redirect tasks at high rates. The vulnerability is not a corner case.

OBSTACLE

WHAT BLOCKS THE TARGET

Web agents are tested for capability, not for resistance to manipulation. There was no shared eval harness for adversarial redirection.

NEXT

THE STEP WE WILL RUN

Open-source the harness; partner with web-agent labs to track resistance as a first-class metric, alongside task success.

◆ CYCLE STATE · REFERENCEREAD THE CYCLE →
ENTRY 04 / 10·FEBRUARY 2026·NATURE MEDICINE
COMPLETED

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

AM Bean, RE Payne, G Parsons, HR Kirk, J Ciro, R Mosquera-Gómez, S Hincapié, AS Ekanayaka, L Tarassenko, L Rocher, A Mahdi

AI IN HEALTHCAREBENCHMARKS AND EVALUATION
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

FIG. 04 · HELP·MED

TARGET

WHAT THE CYCLE AIMS AT

Empirical evidence on how the general public uses LLMs as medical assistants: preregistered, randomised, replicable.

ACTUAL

WHAT WE OBSERVE TODAY

Nature Medicine paper: people make worse triage decisions with current LLM assistants than without, in randomised trials.

OBSTACLE

WHAT BLOCKS THE TARGET

Most prior work tested clinicians or simulated patients, not the public. The methodology had to be built from epidemiology, not ML.

NEXT

THE STEP WE WILL RUN

Iterate on assistive UI patterns and re-run the trial; share the protocol so other teams can replicate across populations.

◆ CYCLE STATE · COMPLETEDREAD THE CYCLE →
ENTRY 05 / 10·NOVEMBER 2025·NEURIPS DATASETS AND BENCHMARKS
COMPLETED

Measuring what matters: Construct validity in large language model benchmarks

AM Bean, RO Kearns, A Romanou, FS Hafner, H Mayne, J Batzner, et al.

BENCHMARKS AND EVALUATION
Measuring what matters: Construct validity in large language model benchmarks

FIG. 05 · MEASURING·WHAT·MATTERS

TARGET

WHAT THE CYCLE AIMS AT

A construct-validity standard that every public LLM benchmark can be audited against.

ACTUAL

WHAT WE OBSERVE TODAY

NeurIPS Datasets & Benchmarks paper: a structured audit of widely-used benchmarks, with concrete gaps between claim and measurement.

OBSTACLE

WHAT BLOCKS THE TARGET

Benchmarks are shipped without measurement-theory scaffolding. Adopting construct validity from psychometrics took translation work.

NEXT

THE STEP WE WILL RUN

Release the audit framework as a checklist; lobby venues to require construct-validity statements for new benchmarks.

◆ CYCLE STATE · COMPLETEDREAD THE CYCLE →
ENTRY 06 / 10·NOVEMBER 2025·NEURIPS LLM LIFECYCLE WORKSHOP
COMPLETED

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale

BENCHMARKS AND EVALUATIONAI SAFETY AND ALIGNMENT
Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

FIG. 06 · JUDGE·POLYVIS

TARGET

WHAT THE CYCLE AIMS AT

A characterisation of LLM-as-a-judge across the conditions under which it is deployed: multilingual, multimodal, multi-domain.

ACTUAL

WHAT WE OBSERVE TODAY

NeurIPS workshop paper: judge reliability drops, sometimes catastrophically, when you push outside the conditions it was calibrated on.

OBSTACLE

WHAT BLOCKS THE TARGET

Judge-LLM evals are typically reported on English text. Multilingual and multimodal coverage required new datasets and rubrics.

NEXT

THE STEP WE WILL RUN

Publish a multilingual judge-stress-test suite; use it to recalibrate workflows that depend on LLM-as-a-judge.

◆ CYCLE STATE · COMPLETEDREAD THE CYCLE →
ENTRY 07 / 10·NOVEMBER 2025·EMNLP
COMPLETED

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Y Yang, F Sondej, H Mayne, A Lee, A Mahdi

AI SAFETY AND ALIGNMENT
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

FIG. 07 · DPO·TOXICITY

TARGET

WHAT THE CYCLE AIMS AT

An account of how Direct Preference Optimization changes a model, not just whether it changes outputs.

ACTUAL

WHAT WE OBSERVE TODAY

EMNLP paper: DPO suppresses toxicity through a small set of identifiable neurons, a localised intervention rather than a global rewrite.

OBSTACLE

WHAT BLOCKS THE TARGET

Behavioural metrics tell you DPO worked; they do not tell you what was edited. Neuron-level attribution required new probing tooling.

NEXT

THE STEP WE WILL RUN

Extend to other preference signals; ask whether the same neurons carry capability, and what the trade-off looks like.

◆ CYCLE STATE · COMPLETEDREAD THE CYCLE →
ENTRY 08 / 10·SEPTEMBER 2025·EMNLP
COMPLETED

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

H Mayne, RO Kearns, Y Yang, AM Bean, E Delaney, C Russell, A Mahdi

AI SAFETY AND ALIGNMENTBENCHMARKS AND EVALUATION
LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

FIG. 08 · DECISION·BOUNDARIES

TARGET

WHAT THE CYCLE AIMS AT

A diagnostic that distinguishes LLM introspection from confabulated counterfactual explanations.

ACTUAL

WHAT WE OBSERVE TODAY

EMNLP paper: self-generated counterfactuals systematically mis-locate the model's decision boundary. The explanation is fluent but wrong.

OBSTACLE

WHAT BLOCKS THE TARGET

Counterfactual explanations were treated as an interpretability win. Showing them unreliable needed careful controlled experiments.

NEXT

THE STEP WE WILL RUN

Look for self-explanation formats that DO track the boundary; quantify the gap so practitioners can calibrate trust.

◆ CYCLE STATE · COMPLETEDREAD THE CYCLE →
ENTRY 09 / 10·FEBRUARY 2025·INFORMATION FUSION
REFERENCE

Review of multimodal machine learning approaches in healthcare

F Krones, U Marikkar, G Parsons, A Szmul, A Mahdi

AI IN HEALTHCARE
Review of multimodal machine learning approaches in healthcare

FIG. 09 · MULTIMODAL·HEALTHCARE

TARGET

WHAT THE CYCLE AIMS AT

A coherent map of multimodal ML in healthcare: fusion strategies, deployment realities, and where the field has and has not delivered.

ACTUAL

WHAT WE OBSERVE TODAY

Information Fusion review: a synthesis spanning fusion architectures, clinical workflows, and the gaps between lab metrics and care.

OBSTACLE

WHAT BLOCKS THE TARGET

Healthcare multimodal work is scattered across imaging, EHR, and signal-processing literatures with little shared vocabulary.

NEXT

THE STEP WE WILL RUN

Use the map to scope the lab’s next clinical collaborations; identify modalities where careful evaluation is most overdue.

◆ CYCLE STATE · REFERENCEREAD THE CYCLE →
ENTRY 10 / 10·APRIL 2026·ICLR
COMPLETED

LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

J Khouja, K Korgul, S Hellsten, L Yang, V Neacsu, H Mayne, RO Kearns, A Bean, A Mahdi

BENCHMARKS AND EVALUATION
LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

FIG. 10 · LINGOLY·TOO

TARGET

WHAT THE CYCLE AIMS AT

A reasoning benchmark you can trust when LLMs are trained on most of the internet: one that separates reasoning from recall.

ACTUAL

WHAT WE OBSERVE TODAY

ICLR paper: templatised orthographic obfuscation breaks memorisation while preserving the underlying reasoning task; scores drop sharply.

OBSTACLE

WHAT BLOCKS THE TARGET

Standard reasoning benchmarks leak into training data. Designing an obfuscation that preserves the problem was the hard part.

NEXT

THE STEP WE WILL RUN

Scale LingOly-TOO to more languages; use the same obfuscation idea on math and code reasoning benchmarks.

◆ CYCLE STATE · COMPLETEDREAD THE CYCLE →
STEP 04 · PRACTITIONERS

Who is at the board today.

Fifteen people: one sensei and fourteen practitioners. Each is introduced by what they are currently practising and what they are working toward; their role is the footnote.

Prof. Adam Mahdi

先生 · SENSEI · PRINCIPAL INVESTIGATOR

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

CURRENTLY PRACTISING

Coaching the lab’s research cycles across evaluation, safety, agentic AI, and human–AI interaction.

WORKING TOWARD

A field where AI evaluation is treated as a science, with the same care as the systems it studies.

Felix Krones

PRACTITIONER · 01

Felix Krones

DPHIL STUDENT

CURRENTLY

Multimodal evaluation across imaging and clinical text.

TOWARD

Digital-health systems that are measured before they are deployed.

Djavan De Clercq

PRACTITIONER · 02

Djavan De Clercq

DPHIL STUDENT

CURRENTLY

LLMs applied to food-security data and policy questions.

TOWARD

Decision tools for food systems that hold up under audit.

Andrew M. Bean

PRACTITIONER · 03

Andrew M. Bean

DPHIL STUDENT

CURRENTLY

Designing LLM evals that capture how people actually use models.

TOWARD

A standard for evaluating LLMs in genuine human contexts of use.

Yushi Yang

PRACTITIONER · 04

Yushi Yang

DPHIL STUDENT

CURRENTLY

Post-training for LLM and agentic alignment, at the neuron level.

TOWARD

Alignment interventions whose mechanism we can explain, not just observe.

Harry Mayne

PRACTITIONER · 05

Harry Mayne

DPHIL STUDENT

CURRENTLY

Interpretability and safety-relevant LLM evaluations.

TOWARD

Interpretability methods that practitioners can trust under shift.

Jessica Rodrigues

PRACTITIONER · 06

Jessica Rodrigues

DPHIL STUDENT

CURRENTLY

Knowledge-graph methods for metascience and research synthesis.

TOWARD

Synthesis tools that scientists treat as collaborators, not search engines.

Guy Parsons

PRACTITIONER · 07

Guy Parsons

DPHIL STUDENT

CURRENTLY

Healthcare AI evaluation grounded in clinical workflow.

TOWARD

Digital-health products that earn the trust of clinicians and patients.

Karolina Korgul

PRACTITIONER · 08

Karolina Korgul

DPHIL STUDENT

CURRENTLY

Agentic-AI safety, including web-agent persuasion attacks.

TOWARD

Web agents that resist social engineering as a default behaviour.

Ryan Othniel Kearns

PRACTITIONER · 09

Ryan Othniel Kearns

DPHIL STUDENT

CURRENTLY

The science of evals: how to measure reasoning honestly.

TOWARD

A field where every benchmark publishes its construct validity.

Shreyansh Padarha

PRACTITIONER · 10

Shreyansh Padarha

DPHIL STUDENT

CURRENTLY

Agentic systems for science, with safety and eval rigour.

TOWARD

Scientific agents auditable enough to act on in real research.

Mia Kussman

PRACTITIONER · 11

Mia Kussman

MSC STUDENT

CURRENTLY

Studies of human–LLM interaction and LLM evaluation.

TOWARD

Interaction patterns that improve, rather than substitute, human judgement.

Caleb Tan

PRACTITIONER · 12

Caleb Tan

MSC STUDENT

CURRENTLY

LLM evaluation and reasoning benchmarks.

TOWARD

Reasoning evals that separate genuine inference from recall.

Sebastian Petric

PRACTITIONER · 13

Sebastian Petric

VISITING POLICY FELLOW

CURRENTLY

LLMs applied to financial time series, at the policy boundary.

TOWARD

Honest characterisation of LLM utility in high-stakes financial settings.

Tristan Naidoo

PRACTITIONER · 14

Tristan Naidoo

RESEARCH AFFILIATE

CURRENTLY

Public-health AI and LLM evaluations grounded in epidemiology.

TOWARD

Public-health AI that is evaluated like a health intervention.

STEP 05 · WHEN CAN WE SEE WHAT WE LEARNED

Dojo log.

Each entry is something we saw at the board: a paper accepted, a conference reached, an award noted. The log is the lab's memory of cycles closed and evidence kept.

DATEENTRYCATEGORY
  1. MAY 2026

    Three OxRML papers accepted at ICML 2026 — including a Spotlight

    PAPER
  2. APRIL 2026

    OxRML presenting at ICLR 2026

    CONF.
  3. FEBRUARY 2026

    New paper in Nature Medicine on LLMs as medical assistants

    PAPER
  4. FEBRUARY 2026

    Ryan Othniel Kearns wins MSc Thesis Prize

    AWARD
  5. DECEMBER 2025

    OxRML at NeurIPS 2025

    CONF.
  6. NOVEMBER 2025

    OxRML at EMNLP 2025

    CONF.
  7. JUNE 2025

    Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025

    AWARD
  8. FEBRUARY 2025

    New review paper in Information Fusion

    PAPER
  9. SEPTEMBER 2024

    Winners of the 2024 PhysioNet Challenge

    AWARD

9 ENTRIES · LOGGED IN ORDER OF EVENT

STEP 06 · COACHING KATA

How we work with industry.

We coach teams shipping LLMs into high-stakes settings. Three formats, three cadences, each built around the daily question of what the target condition is and what blocks reaching it.

FORMAT · 01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

CADENCE

Half-day to multi-week. Daily standups at the board, weekly review of obstacles.

BOOK A WORKSHOP

FORMAT · 02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

CADENCE

Quarterly cycles. Joint roadmaps, shared evals, shipped tooling: research at engineering velocity.

SEE OUR BUILDS

FORMAT · 03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

CADENCE

Multi-year. Named labs, dedicated DPhils, shared scientific direction.

START A CONVERSATION

We work with foundations, governments, and corporates who want AI evaluation, safety, and reasoning treated with the same care as the systems they ship.

Partner with us

HONOUR ROLL · CYCLES VENUED & PUBLISHED

The institutions and venues we have worked with.

Universities, journals, and conferences where the lab's cycles have been hosted, peer-reviewed, and published.

  • 01University of OxfordHOST INSTITUTION
  • 02Oxford Internet InstituteAFFILIATED DEPARTMENT
  • 03Nature MedicinePUBLISHED 2026
  • 04ICMLSPOTLIGHT & PAPERS, 2026
  • 05NeurIPSDATASETS & BENCHMARKS, 2025
  • 06ICLRACCEPTED, 2026
  • 07EMNLPMULTIPLE, 2025

QUARTERLY CADENCE · THE LAB NEWSLETTER

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.