Reasoning with Machines Lab

Reasoning,
evaluated.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Portrait of Prof. Adam Mahdi
Principal Investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Recognised by

University of OxfordHost institution
Oxford Internet InstituteAffiliated department
Nature MedicinePublished 2026
ICMLSpotlight & papers, 2026
NeurIPSDatasets & Benchmarks, 2025
ICLRAccepted, 2026
EMNLPMultiple, 2025

The latest

From the lab

  • New paper·May 2026

    Three OxRML papers accepted at ICML 2026 — including a Spotlight

  • At conference·April 2026

    OxRML presenting at ICLR 2026

  • New paper·February 2026

    New paper in Nature Medicine on LLMs as medical assistants

  • Award·February 2026

    Ryan Othniel Kearns wins MSc Thesis Prize

  • At conference·December 2025

    OxRML at NeurIPS 2025

  • At conference·November 2025

    OxRML at EMNLP 2025

Four directions

We study the science of how machines reason.

Evaluation01 / 04

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

Safety02 / 04

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

Agentic03 / 04

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

Human-AI04 / 04

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

The work

Peer-reviewed,
published, cited.

A selection from the last eighteen months at ICML, NeurIPS, ICLR, EMNLP, and Nature Medicine. Each one measures something concrete about what current models do.

FeaturedICML (Spotlight)·May 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

Read the paper
ICML·May 2026

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

Read
ICML·May 2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

Read
Nature Medicine·February 2026

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

Read
NeurIPS Datasets and Benchmarks·November 2025

Measuring what matters: Construct validity in large language model benchmarks

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

Read
NeurIPS LLM Lifecycle Workshop·November 2025

Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

Read
EMNLP·November 2025

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.

Read
EMNLP·September 2025

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.

Read
Information Fusion·February 2025

Review of multimodal machine learning approaches in healthcare

A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.

Read
ICLR·April 2026

LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.

Read

The people

A small group
of careful people.

DPhil students, MSc researchers, visiting fellows, and affiliates. We come from machine learning, statistics, clinical medicine, and policy.

  • Felix Krones

    Felix Krones

    DPhil Student

    Multimodal AI, digital health

  • Djavan De Clercq

    Djavan De Clercq

    DPhil Student

    AI and food security, LLMs

  • Andrew M. Bean

    Andrew M. Bean

    DPhil Student

    LLM evaluations, human–LLM interaction

  • Yushi Yang

    Yushi Yang

    DPhil Student

    LLM & agentic post-training, AI alignment

  • Harry Mayne

    Harry Mayne

    DPhil Student

    LLM interpretability, AI safety, LLM evaluations

  • Jessica Rodrigues

    Jessica Rodrigues

    DPhil Student

    Knowledge graphs, metascience

  • Guy Parsons

    Guy Parsons

    DPhil Student

    Healthcare AI, digital health

  • Karolina Korgul

    Karolina Korgul

    DPhil Student

    AI safety, agentic AI

  • Ryan Othniel Kearns

    Ryan Othniel Kearns

    DPhil Student

    Science of evals, reasoning in LLMs

  • Shreyansh Padarha

    Shreyansh Padarha

    DPhil Student

    AI for science, AI safety, LLM evaluations

  • Mia Kussman

    Mia Kussman

    MSc Student

    Human–LLM interaction, LLM evaluations

  • Caleb Tan

    Caleb Tan

    MSc Student

    LLM evaluations, reasoning

  • Sebastian Petric

    Sebastian Petric

    Visiting Policy Fellow

    LLMs and financial time series

  • Tristan Naidoo

    Tristan Naidoo

    Research Affiliate

    Public health AI, LLM evaluations

Plus collaborators across the Oxford Internet Institute, the Department of Engineering Science, and the Big Data Institute.

Work with us

Three ways
to partner.

We pick collaborators with care. If you are building AI into a setting where being wrong has a cost, talk to us.

01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Or just say hello

hello@oxrml.com

The lab newsletter

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.