Reasoning · Machines · Lab

Subscribe Partner

Independent research lab·Oxford Internet Institute

Reasoning with
Machines Lab
@ University of Oxford

> Abstract // Led by Prof. Adam Mahdi, our lab advances the science of AI evaluation, benchmarking, safety and security. Through rigorous empirical research, we study how LLMs and agentic systems reason, interact with humans and drive scientific discovery. We work with industry partners deploying AI where reliability matters.

Read our research Engage with OxRML

Papers / 2025-26: 10
Researchers: 15
Research themes: 04
Top-tier venues: 06

Citations of record7 entries

University of Oxford
Host institution
Oxford Internet Institute
Affiliated department
Nature Medicine
Published 2026
ICML
Spotlight & papers, 2026
NeurIPS
Datasets & Benchmarks, 2025
ICLR
Accepted, 2026
EMNLP
Multiple, 2025

§01

Research themes

Four research themes. Each runs on a multi-year horizon.

P.01Evaluation

ACTIVE

Benchmarks and Evaluation

We develop the science of LLM evaluation, setting the standard for rigorous assessment and identifying hidden risks before they matter.

Output / Papers · Benchmarks · Tools·Horizon / 2025–2028

P.02Safety

ACTIVE

AI Safety and Security

From bias and toxicity to agentic misalignment, we study the full spectrum of AI risk and develop the technical and governance tools to address it.

Output / Papers · Benchmarks · Tools·Horizon / 2025–2028

P.03Agentic

ACTIVE

Agentic AI for Science

We build agentic systems that automate scientific knowledge synthesis and discovery, with a focus on agents that are reliable, transparent and domain-grounded.

Output / Papers · Benchmarks · Tools·Horizon / 2025–2028

P.04Human-AI

ACTIVE

Human–AI Interaction

We run large-scale empirical studies on how people use AI for high stakes decisions, from healthcare and law to policy and beyond.

Output / Papers · Benchmarks · Tools·Horizon / 2025–2028

§02

Publication ledger

10 entries · last 18 months

A+ Spotlight / top-venue

A · NeurIPS / ICML / ICLR / Nature

A− · EMNLP / IF

Sort: chronological ↓

RankTitle / AuthorsVenueDateDoc

[01] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Ł Borchmann, J Van Landeghem, M Turski +4

Benchmarks and EvaluationAgentic AI

ICML (Spotlight)

Benchmarks and EvaluationAgentic AI

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

[02] A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

H Mayne, JS Kang, D Gould +3

AI Safety and AlignmentBenchmarks and Evaluation

AI Safety and AlignmentBenchmarks and Evaluation

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

[03] It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

K Korgul, Y Yang, A Drohomirecki +7

Benchmarks and EvaluationAgentic AIAI Safety and Alignment

Benchmarks and EvaluationAgentic AI

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

[04] Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

AM Bean, RE Payne, G Parsons +8

AI in HealthcareBenchmarks and Evaluation

Nature Medicine

AI in HealthcareBenchmarks and Evaluation

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

[05] Measuring what matters: Construct validity in large language model benchmarks

AM Bean, RO Kearns, A Romanou +4

Benchmarks and Evaluation

NeurIPS Datasets and Benchmarks

Benchmarks and Evaluation

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

[06] Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

S Padarha, E Semenova, B Vidgen +2

Benchmarks and EvaluationAI Safety and Alignment

NeurIPS LLM Lifecycle Workshop

Benchmarks and EvaluationAI Safety and Alignment

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

[07] How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Y Yang, F Sondej, H Mayne +2

AI Safety and Alignment

AI Safety and Alignment

Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.

Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.

[08] LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

H Mayne, RO Kearns, Y Yang +4

AI Safety and AlignmentBenchmarks and Evaluation

AI Safety and AlignmentBenchmarks and Evaluation

Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.

Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.

[09] Review of multimodal machine learning approaches in healthcare

F Krones, U Marikkar, G Parsons +2

AI in Healthcare

Information Fusion

AI in Healthcare

A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.

A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.

[10] LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

J Khouja, K Korgul, S Hellsten +6

Benchmarks and Evaluation

Benchmarks and Evaluation

A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.

A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.

END / LEDGERFull publication list →

§03

Personnel roster

16 researchers

> Hiring philosophy // OxRML is staffed in the proportion that should alarm a traditional VC: heavy concentration of DPhils, research engineers, and visiting fellows. The ratio of scientists to anything else is the point.

Director001

Portrait of Prof. Adam Mahdi

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks.

DPh/002●

Felix Krones

FK

Felix Krones

DPhil Student

Multimodal AI, digital health

DPh/003●

Djavan De Clercq

DD

Djavan De Clercq

DPhil Student

AI and food security, LLMs

DPh/004●

Andrew M. Bean

AM

Andrew M. Bean

DPhil Student

LLM evaluations, human–LLM interaction

DPh/005●

Yushi Yang

YY

Yushi Yang

DPhil Student

LLM & agentic post-training, AI alignment

DPh/006●

Harry Mayne

HM

Harry Mayne

DPhil Student

LLM interpretability, AI safety, LLM evaluations

DPh/007●

Jessica Rodrigues

JR

Jessica Rodrigues

DPhil Student

Knowledge graphs, metascience

DPh/008●

Guy Parsons

GP

Guy Parsons

DPhil Student

Healthcare AI, digital health

DPh/009●

Karolina Korgul

KK

Karolina Korgul

DPhil Student

AI safety, agentic AI

DPh/010●

Ryan Othniel Kearns

RO

Ryan Othniel Kearns

DPhil Student

Science of evals, reasoning in LLMs

DPh/011●

Shreyansh Padarha

SP

Shreyansh Padarha

DPhil Student

AI for science, AI safety, LLM evaluations

MSC/012●

Mia Kussman

MK

Mia Kussman

MSc Student

Human–LLM interaction, LLM evaluations

MSC/013●

Caleb Tan

CT

Caleb Tan

MSc Student

LLM evaluations, reasoning

VIS/014●

Sebastian Petric

SP

Sebastian Petric

Visiting Policy Fellow

LLMs and financial time series

AFF/015●

Tristan Naidoo

TN

Tristan Naidoo

Research Affiliate

Public health AI, LLM evaluations

DPh/016●

Josh Lawman

JL

Josh Lawman

Entrepreneur in Residence

Research-to-product translation

§04 / MOAT THESIS·FOR / Decision-makers·RE / Strategic AI capabilityPREPARED // OXRML

Work with the lab
whose moat is the math.

The hardest technical problems produce the most defensible products. If a competitor can replicate your evaluation stack in a hackathon, you have a feature, not a moat. We work with the organisations that understand the difference.

Who we engage

Foundations deploying AI into high-stakes social settings
Governments writing AI policy under empirical pressure
Global corporates shipping LLM products at scale
Studios building the production layer above lab research

Read our research hello@oxrml.com

M.01TIER-A

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Horizon: ½ day → 12 wk
Format: On-site / hybrid
Example: Evaluation hardening for clinical LLM deployments. 3 days, multi-team

Book a workshop

M.02TIER-S

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

Horizon: 3 → 12 months
Format: Lab + studio
Example: Eval harness + safety dashboard. Research IP, production engineering

M.03TIER-S+

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Horizon: 2 → 5 years
Format: Named programme
Example: Multi-year programme with foundation + dedicated DPhil studentships

Start a conversation

Signed / OxRML Leadership·Date / 2026-05-25·Inquiries welcome

§05

Signal dispatch

Subscribe + recent transmissions

AI NewsLIVE

From the OxRML lab.

New papers, open positions, partnership opportunities, and what we have been reading.

Cadence

Quarterly

Channels

Email only

Spam

Never

Lab log / latest11 entries

01
Papers accepted at ICML 2026!
PAPERMay 2026
02
OxRML at ICLR 2026
CONFApril 2026
03
Ryan Othniel Kearns Wins MSc Thesis Prize
AWARDFebruary 2026
04
New Paper in Nature Medicine!
PAPERFebruary 2026
05
OxRML @ NeurIPS 2025
CONFDecember 2025
06
OxRML @ EMNLP 2025
CONFNovember 2025
07
Prof. Adam Mahdi Wins Teaching Excellence Award 2025
AWARDJune 2025
08
New Paper in Information Fusion!
PAPERFebruary 2025
09
OxRML @ NeurIPS 2024
CONFDecember 2024
10
Winners of 2024 PhysioNet Challenge
AWARDSeptember 2024
11
OxRML @ ICML 2024
CONFJuly 2024