OxRML · University of OxfordOxford Internet Institute

Reasoning with Machines Lab

@ University of Oxford

Led by Prof. Adam Mahdi, we work on the science of evaluating, benchmarking and securing modern AI. Our empirical research asks how LLMs and agentic systems reason, collaborate with humans and accelerate scientific discovery — alongside industry partners deploying these systems where reliability matters.

Read our research Engage with OxRML

Research themes04

Benchmarks & Evaluation

AI Safety & Security

Agentic AI for Science

Human–AI Interaction

Benchmarks and Evaluation◇AI Safety and Security◇Agentic AI for Science◇Human–AI Interaction◇Mechanistic interpretability◇Red-teaming◇Capability elicitation◇Benchmark design◇Bias & toxicity◇Domain-grounded agents◇Scientific discovery◇AI governance◇Distributional robustness◇Self-explanation◇Reasoning under pressure◇Benchmarks and Evaluation◇AI Safety and Security◇Agentic AI for Science◇Human–AI Interaction◇Mechanistic interpretability◇Red-teaming◇Capability elicitation◇Benchmark design◇Bias & toxicity◇Domain-grounded agents◇Scientific discovery◇AI governance◇Distributional robustness◇Self-explanation◇Reasoning under pressure◇

01·What we do

Four research lines, asking whether we can trust what these systems do next.

01 / 04

Benchmarks and Evaluation

We develop the science of LLM evaluation, setting the standard for rigorous assessment and identifying hidden risks before they matter.

Benchmark design
Statistical evaluation
Capability elicitation
Contamination audits

02 / 04

AI Safety and Security

From bias and toxicity to agentic misalignment, we study the full spectrum of AI risk and develop the technical and governance tools to address it.

Mechanistic interpretability
Red-teaming
Agentic misalignment
Policy translation

03 / 04

Agentic AI for Science

We build agentic systems that automate scientific knowledge synthesis and discovery, with a focus on agents that are reliable, transparent and domain-grounded.

Literature synthesis
Hypothesis generation
Evidence grounding
Domain transfer

04 / 04

Human–AI Interaction

We run large-scale empirical studies on how people use AI for high stakes decisions, from healthcare and law to policy and beyond.

Field experiments
Decision-aid design
Clinical evaluation
Policy translation

02·Selected work

Recent publications

All publications

2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Ł Borchmann, J Van Landeghem, M Turski et al.

ICML

2026

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

H Mayne, JS Kang, D Gould et al.

ICML

2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

K Korgul, Y Yang, A Drohomirecki et al.

ICML

2026

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

AM Bean, RE Payne, G Parsons et al.

Nature Medicine

2025

Measuring what matters: Construct validity in large language model benchmarks

AM Bean, RO Kearns, A Romanou et al.

NeurIPS Datasets and Benchmarks

03·For industry

We help teams ship AI they can defend.

Two ways to work with us: third-party evaluation of your models and agents, or a focused engagement that turns one of our research outputs into a tool you own.

See partnership models Talk to the lab

Evaluation

Pre-deployment audits, custom benchmarks, agentic red-teaming.

Co-build

We work with engineering partners to turn lab work into tools other teams can run.

Sectors

SaaS, public sector, financial services, healthcare.

Engagement

12–24 weeks · NDA-friendly · publishable outcomes negotiable.

04·The lab