APPL. NO. OXRML / 2026 / 001FILED OXFORD, GBSHEET 01 / 07

Field of the Invention

The present specification relates to a laboratory [100] for the empirical study of large language models, agentic systems, and the science of evaluation; established at Oxford, GB, and directed to the production of reproducible findings concerning machine reasoning.

Abstract

A LABORATORY [100] FOR THE EMPIRICAL STUDY OF MACHINE REASONING.

Disclosed herein is the Reasoning with Machines Lab [100], a research apparatus assembled at the University of Oxford [107] and directed by a Principal Investigator [101] overseeing a cohort [102] of DPhil and MSc students. The lab operates along four research axes [103], the outputs of which flow through a publication stream [104] and a partner channel [105].

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.


Inventor / PI
[101]
Prof. Adam Mahdi
Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

FIG. 1·Laboratory Schema
SCALE: 1 / N · ORTHOGRAPHIC
OXRML LABORATORYEVAL.SAFETYAGENTICHUMAN-AICOHORT (N=14)AMPRINC. INV.PUBL. STREAMPARTNERSHIPSN51°45′N 01°15′W101102103104105106107SHEET 01 / 07 · FIG. 1 · LABORATORY SCHEMA · DWG. NO. OX-RML-2026-01DRAWN BY: A. MAHDI
Fig. 1 illustrates the lab apparatus. [101] Principal Investigator; [102] cohort (14 researchers); [103] four research themes (vide infra, Claims 1–4); [104] publication stream (10+ outputs, 2024–26); [105] partnership channel; [106] reference structure (Radcliffe Camera, Oxford); [107] coordinate anchor, 51°45′N · 01°15′W.
SHEET 2 / 7 · FIG. 2 · RESEARCH THEMES / CLAIMS 1–4SHEET 02 / 07

Claims 1–4

WHAT THE LAB CLAIMS TO PRACTISE.

The four research axes [103] introduced in Fig. 1 are here disclosed as Claims 1–4. Each claim is reduced to practice through the shared empirical substrate [205]: open datasets, preregistered designs, mechanistic analyses, and independently auditable code.

  1. 201
    CLAIM 1.Ref. 201

    Benchmarks and Evaluation

    We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

  2. 202
    CLAIM 2.Ref. 202

    AI Safety and Security

    We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

  3. 203
    CLAIM 3.Ref. 203

    Agentic AI for Science

    Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

  4. 204
    CLAIM 4.Ref. 204

    Human–AI Interaction

    Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

FIG. 2·Research Themes (Claims 1–4)
EXPLODED VIEW
EMPIRICALSUBSTRATE205CLAIM 1BENCHMARKS & EVALUATIONconstruct validity · scoring201CLAIM 2AI SAFETY & SECURITYalignment · interpretability202CLAIM 3AGENTIC AI FOR SCIENCEplanning · tool-use · verification203CLAIM 4HUMAN–AI INTERACTIONhigh-stakes decisions · field studies204SHEET 02 / 07 · FIG. 2 · DWG. NO. OX-RML-2026-02SUBSTRATE REF. 205
Four axes radiate from a common substrate [205]: [201] Benchmarks & Evaluation, [202] AI Safety & Security, [203] Agentic AI for Science, [204] Human–AI Interaction. Edges denote known cross-axis coupling (e.g. agent reliability[203] ↔ safety [202]).
SHEET 3 / 7 · DETAILED DESCRIPTION · PRIOR ART OF OUR OWNSHEET 03 / 07

Detailed Description

REDUCED TO PRACTICE.

The following enumerated outputs [104] are the prior art generated by the laboratory itself: each is a working embodiment of one or more claims [201] [202] [203] [204].

References [301]–[310] index the table at right.

Legend

§
Section of specification (claim ref.)
Venue of publication
Δ
Date of reduction
  1. 303
    §03
    Illustration for: It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

    It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

    K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi

    A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

    ICMLΔ May 2025§ Benchmarks and Evaluation, Agentic AI, AI Safety and Alignment
SHEET 5 / 7 · FIG. 3 · PERSONNEL LAYOUTSHEET 05 / 07

Personnel

THE COHORT, ENUMERATED.

The lab [100] is staffed by a Principal Investigator [400] and a cohort of 14 researchers [102]. Each researcher is enumerated below with their primary research axis. Photographs are reproduced in monochrome per drawing conventions.

FIG. 3·Personnel Layout (Exploded View)
NOT TO SCALE
PA
Prof. Adam Mahdi
Principal Investigator
Oxford Internet Institute, University of Oxford
  • Portrait of Felix Krones
    Felix Krones
    DPhil Student
    Multimodal AI, digital health
  • Portrait of Djavan De Clercq
    Djavan De Clercq
    DPhil Student
    AI and food security, LLMs
  • Portrait of Andrew M. Bean
    Andrew M. Bean
    DPhil Student
    LLM evaluations, human–LLM interaction
  • Portrait of Yushi Yang
    Yushi Yang
    DPhil Student
    LLM & agentic post-training, AI alignment
  • Portrait of Harry Mayne
    Harry Mayne
    DPhil Student
    LLM interpretability, AI safety, LLM evaluations
  • Portrait of Jessica Rodrigues
    Jessica Rodrigues
    DPhil Student
    Knowledge graphs, metascience
  • Portrait of Guy Parsons
    Guy Parsons
    DPhil Student
    Healthcare AI, digital health
  • Portrait of Karolina Korgul
    Karolina Korgul
    DPhil Student
    AI safety, agentic AI
  • Portrait of Ryan Othniel Kearns
    Ryan Othniel Kearns
    DPhil Student
    Science of evals, reasoning in LLMs
  • Portrait of Shreyansh Padarha
    Shreyansh Padarha
    DPhil Student
    AI for science, AI safety, LLM evaluations
  • Portrait of Mia Kussman
    Mia Kussman
    MSc Student
    Human–LLM interaction, LLM evaluations
  • Portrait of Caleb Tan
    Caleb Tan
    MSc Student
    LLM evaluations, reasoning
  • Portrait of Sebastian Petric
    Sebastian Petric
    Visiting Policy Fellow
    LLMs and financial time series
  • Portrait of Tristan Naidoo
    Tristan Naidoo
    Research Affiliate
    Public health AI, LLM evaluations

Fig. 3 enumerates personnel of the lab. [400] denotes the Principal Investigator; [401]–[414] denote the cohort. Hatched borders [102] indicate the cohort boundary.

Roster Reference

The cohort is multi-disciplinary. Roles include DPhil, MSc, visiting fellow, and research affiliate.

REF.NAMEROLEFOCUS
[400]Prof. Adam MahdiPrincipal InvestigatorOxford Internet Institute, University of Oxford
[401]Felix KronesDPhil StudentMultimodal AI, digital health
[402]Djavan De ClercqDPhil StudentAI and food security, LLMs
[403]Andrew M. BeanDPhil StudentLLM evaluations, human–LLM interaction
[404]Yushi YangDPhil StudentLLM & agentic post-training, AI alignment
[405]Harry MayneDPhil StudentLLM interpretability, AI safety, LLM evaluations
[406]Jessica RodriguesDPhil StudentKnowledge graphs, metascience
[407]Guy ParsonsDPhil StudentHealthcare AI, digital health
[408]Karolina KorgulDPhil StudentAI safety, agentic AI
[409]Ryan Othniel KearnsDPhil StudentScience of evals, reasoning in LLMs
[410]Shreyansh PadarhaDPhil StudentAI for science, AI safety, LLM evaluations
[411]Mia KussmanMSc StudentHuman–LLM interaction, LLM evaluations
[412]Caleb TanMSc StudentLLM evaluations, reasoning
[413]Sebastian PetricVisiting Policy FellowLLMs and financial time series
[414]Tristan NaidooResearch AffiliatePublic health AI, LLM evaluations
SHEET 6 / 7 · INDUSTRIAL APPLICABILITYSHEET 06 / 07

Industrial Applicability

MODES OF DEPLOYMENT.

The laboratory [100] deploys its outputs through the partner channel [105] in three configurations: workshops [501], co-built technology [502], and research partnerships [503].

Partner with us

Correspondence: hello@oxrml.com

  1. 501
    MODE 01

    Workshops for industry teams

    On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

    Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

    Book a workshop
  2. 502
    MODE 02

    Tools co-built with engineering partners

    We work with engineering partners to turn lab work into tools other teams can run.

    Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

    See our builds
  3. 503
    MODE 03

    Research partnerships

    Applied research collaborations with foundations, governments, and large companies.

    Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

    Start a conversation

Filed Affiliations & CitationsReferences [510]–[516]
  • [510]
    University of Oxford
    Host institution
  • [511]
    Oxford Internet Institute
    Affiliated department
  • [512]
    Nature Medicine
    Published 2026
  • [513]
    ICML
    Spotlight & papers, 2026
  • [514]
    NeurIPS
    Datasets & Benchmarks, 2025
  • [515]
    ICLR
    Accepted, 2026
  • [516]
    EMNLP
    Multiple, 2025
SHEET 7 / 7 · NOTICE OF CONTINUING DISCLOSURESHEET 07 / 07
The lab newsletter

Notice

A QUARTERLY NOTE FROM THE LAB. NOTHING ELSE.

New papers, open positions, partnership opportunities, and what we have been reading.

The undersigned [601] agrees to receive quarterly correspondence pursuant to this specification. The publisher [602] shall maintain reasonable care over the data entered.

Form NL-26 · Subscriber RecordRev. 04 / 2026

Unsubscribe in one click. We never share your email.


Signature of subscriber
Date (YYYY-MM-DD)