OxRML · Patent Specification for a Reasoning-with-Machines Laboratory

Abstract

A LABORATORY [100] FOR THE EMPIRICAL STUDY OF MACHINE REASONING.

Disclosed herein is the Reasoning with Machines Lab [100], a research apparatus assembled at the University of Oxford [107] and directed by a Principal Investigator [101] overseeing a cohort [102] of DPhil and MSc students. The lab operates along four research axes [103], the outputs of which flow through a publication stream [104] and a partner channel [105].

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Partner with us Read our research

Inventor / PI
[101]

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

FIG. 1·Laboratory Schema

SCALE: 1 / N · ORTHOGRAPHIC

Fig. 1 illustrates the lab apparatus. [101] Principal Investigator; [102] cohort (14 researchers); [103] four research themes (vide infra, Claims 1–4); [104] publication stream (10+ outputs, 2024–26); [105] partnership channel; [106] reference structure (Radcliffe Camera, Oxford); [107] coordinate anchor, 51°45′N · 01°15′W.

SHEET 2 / 7 · FIG. 2 · RESEARCH THEMES / CLAIMS 1–4SHEET 02 / 07

Claims 1–4

WHAT THE LAB CLAIMS TO PRACTISE.

The four research axes [103] introduced in Fig. 1 are here disclosed as Claims 1–4. Each claim is reduced to practice through the shared empirical substrate [205]: open datasets, preregistered designs, mechanistic analyses, and independently auditable code.

201
CLAIM 1.Ref. 201
Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
202
CLAIM 2.Ref. 202
AI Safety and Security
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
203
CLAIM 3.Ref. 203
Agentic AI for Science
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
204
CLAIM 4.Ref. 204
Human–AI Interaction
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

FIG. 2·Research Themes (Claims 1–4)

EXPLODED VIEW

Four axes radiate from a common substrate [205]: [201] Benchmarks & Evaluation, [202] AI Safety & Security, [203] Agentic AI for Science, [204] Human–AI Interaction. Edges denote known cross-axis coupling (e.g. agent reliability[203] ↔ safety [202]).

SHEET 3 / 7 · DETAILED DESCRIPTION · PRIOR ART OF OUR OWNSHEET 03 / 07

Detailed Description

REDUCED TO PRACTICE.

The following enumerated outputs [104] are the prior art generated by the laboratory itself: each is a working embodiment of one or more claims [201] [202] [203] [204].

References [301]–[310] index the table at right.

Legend
§Section of specification (claim ref.)
¶Venue of publication
ΔDate of reduction

301
§01
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.
A benchmark that tells real navigation apart from stochastic search when agents work over document collections.
¶ ICML (Spotlight)Δ May 2026§ Benchmarks and Evaluation, Agentic AI
302
§02
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
H Mayne, JS Kang, D Gould, K Ramchandran, A Mahdi, NY Siegel
LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.
¶ ICMLΔ May 2026§ AI Safety and Alignment, Benchmarks and Evaluation
303
§03
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi
A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.
¶ ICMLΔ May 2025§ Benchmarks and Evaluation, Agentic AI, AI Safety and Alignment
304
§04
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
AM Bean, RE Payne, G Parsons, HR Kirk, J Ciro, R Mosquera-Gómez, S Hincapié, AS Ekanayaka, L Tarassenko, L Rocher, A Mahdi
A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.
¶ Nature MedicineΔ February 2026§ AI in Healthcare, Benchmarks and Evaluation
305
§05
Measuring what matters: Construct validity in large language model benchmarks
AM Bean, RO Kearns, A Romanou, FS Hafner, H Mayne, J Batzner, et al.
A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.
¶ NeurIPS Datasets and BenchmarksΔ November 2025§ Benchmarks and Evaluation
306
§06
Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints
S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale
How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.
¶ NeurIPS LLM Lifecycle WorkshopΔ November 2025§ Benchmarks and Evaluation, AI Safety and Alignment
307
§07
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
Y Yang, F Sondej, H Mayne, A Lee, A Mahdi
Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.
¶ EMNLPΔ November 2025§ AI Safety and Alignment
308
§08
LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
H Mayne, RO Kearns, Y Yang, AM Bean, E Delaney, C Russell, A Mahdi
Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.
¶ EMNLPΔ September 2025§ AI Safety and Alignment, Benchmarks and Evaluation
309
§09
Review of multimodal machine learning approaches in healthcare
F Krones, U Marikkar, G Parsons, A Szmul, A Mahdi
A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.
¶ Information FusionΔ February 2025§ AI in Healthcare
310
§10
LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
J Khouja, K Korgul, S Hellsten, L Yang, V Neacsu, H Mayne, RO Kearns, A Bean, A Mahdi
A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.
¶ ICLRΔ April 2026§ Benchmarks and Evaluation

SHEET 5 / 7 · FIG. 3 · PERSONNEL LAYOUTSHEET 05 / 07

Personnel

THE COHORT, ENUMERATED.

The lab [100] is staffed by a Principal Investigator [400] and a cohort of 14 researchers [102]. Each researcher is enumerated below with their primary research axis. Photographs are reproduced in monochrome per drawing conventions.

Portrait of Felix Krones — Fig. 3 enumerates personnel of the lab. [400] denotes the Principal Investigator; [401]–[414] denote the cohort. Hatched borders [102] indicate the cohort boundary.

Portrait of Djavan De Clercq — Fig. 3 enumerates personnel of the lab. [400] denotes the Principal Investigator; [401]–[414] denote the cohort. Hatched borders [102] indicate the cohort boundary.

Roster Reference

The cohort is multi-disciplinary. Roles include DPhil, MSc, visiting fellow, and research affiliate.

REF.	NAME	ROLE	FOCUS
[400]	Prof. Adam Mahdi	Principal Investigator	Oxford Internet Institute, University of Oxford
[401]	Felix Krones	DPhil Student	Multimodal AI, digital health
[402]	Djavan De Clercq	DPhil Student	AI and food security, LLMs
[403]	Andrew M. Bean	DPhil Student	LLM evaluations, human–LLM interaction
[404]	Yushi Yang	DPhil Student	LLM & agentic post-training, AI alignment
[405]	Harry Mayne	DPhil Student	LLM interpretability, AI safety, LLM evaluations
[406]	Jessica Rodrigues	DPhil Student	Knowledge graphs, metascience
[407]	Guy Parsons	DPhil Student	Healthcare AI, digital health
[408]	Karolina Korgul	DPhil Student	AI safety, agentic AI
[409]	Ryan Othniel Kearns	DPhil Student	Science of evals, reasoning in LLMs
[410]	Shreyansh Padarha	DPhil Student	AI for science, AI safety, LLM evaluations
[411]	Mia Kussman	MSc Student	Human–LLM interaction, LLM evaluations
[412]	Caleb Tan	MSc Student	LLM evaluations, reasoning
[413]	Sebastian Petric	Visiting Policy Fellow	LLMs and financial time series
[414]	Tristan Naidoo	Research Affiliate	Public health AI, LLM evaluations

SHEET 6 / 7 · INDUSTRIAL APPLICABILITYSHEET 06 / 07

Industrial Applicability

MODES OF DEPLOYMENT.

The laboratory [100] deploys its outputs through the partner channel [105] in three configurations: workshops [501], co-built technology [502], and research partnerships [503].

Partner with us

Correspondence: hello@oxrml.com

501
MODE 01
Workshops for industry teams
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
Book a workshop →
502
MODE 02
Tools co-built with engineering partners
We work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
See our builds →
503
MODE 03
Research partnerships
Applied research collaborations with foundations, governments, and large companies.
Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.
Start a conversation →

Filed Affiliations & CitationsReferences [510]–[516]

[510]
University of Oxford
Host institution
[511]
Oxford Internet Institute
Affiliated department
[512]
Nature Medicine
Published 2026
[513]
ICML
Spotlight & papers, 2026
[514]
NeurIPS
Datasets & Benchmarks, 2025
[515]
ICLR
Accepted, 2026
[516]
EMNLP
Multiple, 2025

SHEET 7 / 7 · NOTICE OF CONTINUING DISCLOSURESHEET 07 / 07

The lab newsletter

Notice

A QUARTERLY NOTE FROM THE LAB. NOTHING ELSE.

New papers, open positions, partnership opportunities, and what we have been reading.

The undersigned [601] agrees to receive quarterly correspondence pursuant to this specification. The publisher [602] shall maintain reasonable care over the data entered.

OXRML · REASONING WITH MACHINES LABORATORY

Field of the Invention

Abstract

A LABORATORY [100] FOR THE EMPIRICAL STUDY OF MACHINE REASONING.

Claims 1–4

WHAT THE LAB CLAIMS TO PRACTISE.

Benchmarks and Evaluation

AI Safety and Security

Agentic AI for Science

Human–AI Interaction

Detailed Description

REDUCED TO PRACTICE.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Measuring what matters: Construct validity in large language model benchmarks

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Review of multimodal machine learning approaches in healthcare

LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Personnel

THE COHORT, ENUMERATED.

Roster Reference

Industrial Applicability

MODES OF DEPLOYMENT.

Workshops for industry teams

Tools co-built with engineering partners

Research partnerships