The Pavilion · A Catalogue in Eight Plates

University of Oxford·Oxford, United Kingdom·Plate I — Frontispiece

Reasoning with Machines Lab

A researchpractice, cataloguedin the open.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Standing direction

Reasoning systems that scientists, clinicians, and the public can trust — measured by what they actually do, not what they claim.

Partner with us Read our research

Founded: Oxford
Papers · 2025–26: 10 cited
Venues: Nature Med · ICML · ICLR
Status: Open for partners

Plate II · The four standing directions

Four hands at the same table.

Each direction is a standing condition the lab is working toward. Together they describe the field we want AI evaluation, safety, and reasoning to become.

Direction 01

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

Toward

A field where every benchmark publishes its construct validity, and where evaluation is treated as an experimental science.

Tag · EvaluationPanel Mauve

Direction 02

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

Toward

A practice of measuring real harms — bias, toxicity, agentic misalignment — at the neuron and the deployment, before they reach the public.

Tag · SafetyPanel Sage

Direction 03

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

Toward

Scientific agents that synthesise knowledge reliably enough that a researcher can act on them — and transparently enough that they can audit them.

Tag · AgenticPanel Rose

Direction 04

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

Toward

Decisions made with AI in healthcare, law, and policy that are studied empirically — not assumed safe because the model is impressive.

Tag · Human-AIPanel Brass

“There is hope in honest error; none in the icy perfection of the mere stylist.”— J. D. Sedding, after the Glasgow Four

Four directions · One table

Plate III · The Catalogue

Ten plates,each a small case.

Ten papers from the past eighteen months. Nature Medicine, ICML (with a Spotlight), ICLR, NeurIPS, EMNLP, Information Fusion. Each plate is one finished cycle — built, peer-reviewed, published in the open.

Plate · No.·Title·Venue & date

= currently on view

Plate 01Currently

ICML (Spotlight) · May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

Benchmarks and EvaluationAgentic AI

Read the plate

Plate 02Currently

ICML · May 2026

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

H Mayne, JS Kang, D Gould, K Ramchandran, A Mahdi, NY Siegel

AI Safety and AlignmentBenchmarks and Evaluation

Read the plate

Plate 03

ICML · May 2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi

Benchmarks and EvaluationAgentic AIAI Safety and Alignment

Read the plate

Plate 04

Nature Medicine · February 2026

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

AM Bean, RE Payne, G Parsons, HR Kirk, J Ciro, R Mosquera-Gómez, S Hincapié, AS Ekanayaka, L Tarassenko, L Rocher, A Mahdi

AI in HealthcareBenchmarks and Evaluation

Read the plate

Plate 05

NeurIPS Datasets and Benchmarks · November 2025

Measuring what matters: Construct validity in large language model benchmarks

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

AM Bean, RO Kearns, A Romanou, FS Hafner, H Mayne, J Batzner, et al.

Benchmarks and Evaluation

Read the plate

Plate 06

NeurIPS LLM Lifecycle Workshop · November 2025

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale

Benchmarks and EvaluationAI Safety and Alignment

Read the plate

Plate 07

EMNLP · November 2025

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.

Y Yang, F Sondej, H Mayne, A Lee, A Mahdi

AI Safety and Alignment

Read the plate

Plate 08

EMNLP · September 2025

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.

H Mayne, RO Kearns, Y Yang, AM Bean, E Delaney, C Russell, A Mahdi

AI Safety and AlignmentBenchmarks and Evaluation

Read the plate

Plate 09

Information Fusion · February 2025

Review of multimodal machine learning approaches in healthcare

A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.

F Krones, U Marikkar, G Parsons, A Szmul, A Mahdi

AI in Healthcare

Read the plate

Plate 10

ICLR · April 2026

LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.

J Khouja, K Korgul, S Hellsten, L Yang, V Neacsu, H Mayne, RO Kearns, A Bean, A Mahdi

Benchmarks and Evaluation

Read the plate

10 of 10 plates · catalogue revised quarterlyA full bibliography is held by the lab; ask for the long form.

Plate IV · The Hands

Fifteen hands,one room.

A Principal Investigator, ten DPhil students, two MSc students, a visiting fellow, and a research affiliate. Each is introduced by what they are currently working on — the focus is the headline; the role is the caption.

Master of the Pavilion · Principal Investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Currently practising

Coaching the lab's research cycles across evaluation, safety, agentic AI, and human–AI interaction — measuring what we ship before we ship it.

01 / 14

DPhil Student

Felix Krones

Multimodal evaluation across imaging and clinical text.

02 / 14

DPhil Student

Djavan De Clercq

LLMs applied to food-security data and policy questions.

03 / 14

DPhil Student

Andrew M. Bean

LLM evaluations that capture how people actually use models.

04 / 14

DPhil Student

Yushi Yang

Post-training for LLM and agentic alignment, at the neuron level.

05 / 14

DPhil Student

Harry Mayne

Interpretability and safety-relevant LLM evaluations.

06 / 14

DPhil Student

Jessica Rodrigues

Knowledge-graph methods for metascience and research synthesis.

07 / 14

DPhil Student

Guy Parsons

Healthcare AI evaluation grounded in clinical workflow.

08 / 14

DPhil Student

Karolina Korgul

Agentic-AI safety and web-agent persuasion attacks.

09 / 14

DPhil Student

Ryan Othniel Kearns

The science of evals — measuring reasoning honestly.

10 / 14

DPhil Student

Shreyansh Padarha

Agentic systems for science, with safety and eval rigour.

11 / 14

MSc Student

Mia Kussman

Studies of human–LLM interaction and LLM evaluation.

12 / 14

MSc Student

Caleb Tan

LLM evaluation and reasoning benchmarks.

13 / 14

Visiting Policy Fellow

Sebastian Petric

LLMs applied to financial time series, at the policy boundary.

14 / 14

Research Affiliate

Tristan Naidoo

Public-health AI and LLM evaluations grounded in epidemiology.

One sensei · Fourteen hands · Open to visiting researchersThe room is the architecture; the hands are the work.

Plate V · The Reading Room

What thelab is reading.

The reading-room ledger — papers accepted, conferences attended, honours noted. The most recent entry takes the rose tag. The ledger is kept in the open so collaborators always know what is current.

DateMarkEntryCategory

May 2026
Newest
Three OxRML papers accepted at ICML 2026 — including a Spotlight
Paper
April 2026
OxRML presenting at ICLR 2026
Convening
February 2026
New paper in Nature Medicine on LLMs as medical assistants
Paper
February 2026
Ryan Othniel Kearns wins MSc Thesis Prize
Honour
December 2025
OxRML at NeurIPS 2025
Convening
November 2025
OxRML at EMNLP 2025
Convening
June 2025
Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025
Honour
February 2025
New review paper in Information Fusion
Paper
September 2024
Winners of the 2024 PhysioNet Challenge
Honour

9 entries · ledger held in the open

Plate VI · The Salon

A room forlong conversations.

We work with foundations, governments, hyperscalers, and global corporates who want AI evaluation, safety, and reasoning treated with the same care as the systems they ship. Three formats — three cadences — one table.

Offering · 01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Cadence

Half-day to multi-week. On-site in your office or in Oxford; bespoke to the team and the question.

Book a workshop

Offering · 02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

Cadence

Quarterly cycles with a partner studio. Joint roadmaps, shared evals, shipped tooling.

See our builds

Offering · 03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Cadence

Multi-year. Named programmes, dedicated DPhil studentships, shared scientific direction.

Start a conversation

The invitation

If your team is shipping AI into a high-stakes setting — healthcare, finance, public infrastructure — there is a chair at the table.

Write directly. A brief note about the problem you are working on, the stakes, and the question you want answered. We'll reply within the week.

Direct linehello@oxrml.com Partner with us

Plate VII · Honour Roll

The institutions and venues we are part of.

Universities, journals, and conferences where the lab's cycles have been hosted, peer-reviewed, and published.

01University of OxfordHost institution
02Oxford Internet InstituteAffiliated department
03Nature MedicinePublished 2026
04ICMLSpotlight & papers, 2026
05NeurIPSDatasets & Benchmarks, 2025
06ICLRAccepted, 2026
07EMNLPMultiple, 2025

Plate VIII · The Visitors' Book

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.