DOSSIER//OXRML-2026-Q2
UTC 0900
FILE / OXRML.LANDING.12·CLASS / TECHNICAL DOSSIER·PROVENANCE / UNIVERSITY OF OXFORD·STATUS / OPEN-INQUIRY
Deep TechIndependent research lab·Est. Oxford

The science of how
machines reason,
evaluated to the bone.

> Abstract // An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them. We publish in Nature Medicine, ICML, NeurIPS, ICLR and EMNLP. We partner with the organisations that ship AI into rooms where being wrong is unacceptable.

Papers / 2025-26
10
Researchers
14
Program areas
04
Top-tier venues
06
Citations of record
  • University of Oxford
    Host institution
  • Oxford Internet Institute
    Affiliated department
  • Nature Medicine
    Published 2026
  • ICML
    Spotlight & papers, 2026
  • NeurIPS
    Datasets & Benchmarks, 2025
  • ICLR
    Accepted, 2026
  • EMNLP
    Multiple, 2025
§01

Program areas

Four research programs. Each runs on a multi-year horizon.
P.01Evaluation
ACTIVE

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

Output / Papers · Benchmarks · ToolsHorizon / 2025–2028
P.02Safety
ACTIVE

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

Output / Papers · Benchmarks · ToolsHorizon / 2025–2028
P.03Agentic
ACTIVE

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

Output / Papers · Benchmarks · ToolsHorizon / 2025–2028
P.04Human-AI
ACTIVE

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

Output / Papers · Benchmarks · ToolsHorizon / 2025–2028
§02

Publication ledger

10 entries · last 18 months
A+ Spotlight / top-venue
A · NeurIPS / ICML / ICLR / Nature
A− · EMNLP / IF
A+
[01] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Ł Borchmann, J Van Landeghem, M Turski +4
Benchmarks and EvaluationAgentic AI
ICML (Spotlight)
May 2026
.pdf →

A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

A
[02] A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
H Mayne, JS Kang, D Gould +3
AI Safety and AlignmentBenchmarks and Evaluation
ICML
May 2026
.pdf →

LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

A
[03] It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
K Korgul, Y Yang, A Drohomirecki +7
Benchmarks and EvaluationAgentic AIAI Safety and Alignment
ICML
May 2025
.pdf →

A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

A
[04] Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
AM Bean, RE Payne, G Parsons +8
AI in HealthcareBenchmarks and Evaluation
Nature Medicine
February 2026
.pdf →

A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

A
[05] Measuring what matters: Construct validity in large language model benchmarks
AM Bean, RO Kearns, A Romanou +4
Benchmarks and Evaluation
NeurIPS Datasets and Benchmarks
November 2025
.pdf →

A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

A
[06] Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints
S Padarha, E Semenova, B Vidgen +2
Benchmarks and EvaluationAI Safety and Alignment
NeurIPS LLM Lifecycle Workshop
November 2025
.pdf →

How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

A-
[07] How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
Y Yang, F Sondej, H Mayne +2
AI Safety and Alignment
EMNLP
November 2025
.pdf →

Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.

A-
[08] LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
H Mayne, RO Kearns, Y Yang +4
AI Safety and AlignmentBenchmarks and Evaluation
EMNLP
September 2025
.pdf →

Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.

A-
[09] Review of multimodal machine learning approaches in healthcare
F Krones, U Marikkar, G Parsons +2
AI in Healthcare
Information Fusion
February 2025
.pdf →

A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.

A
[10] LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
J Khouja, K Korgul, S Hellsten +6
Benchmarks and Evaluation
ICLR
April 2026
.pdf →

A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.

§03

Personnel roster

15 researchers · PI + 14 affiliated

> Hiring philosophy // OxRML is staffed in the proportion that should alarm a traditional VC: heavy concentration of DPhils, research engineers, and visiting fellows. The ratio of scientists to anything else is the point.

Personnel / PI001
Portrait of Prof. Adam Mahdi
Principal Investigator
Prof. Adam Mahdi
Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

DPh/002
Felix Krones
FK
Felix Krones
DPhil Student
Multimodal AI, digital health
DPh/003
Djavan De Clercq
DD
Djavan De Clercq
DPhil Student
AI and food security, LLMs
DPh/004
Andrew M. Bean
AM
Andrew M. Bean
DPhil Student
LLM evaluations, human–LLM interaction
DPh/005
Yushi Yang
YY
Yushi Yang
DPhil Student
LLM & agentic post-training, AI alignment
DPh/006
Harry Mayne
HM
Harry Mayne
DPhil Student
LLM interpretability, AI safety, LLM evaluations
DPh/007
Jessica Rodrigues
JR
Jessica Rodrigues
DPhil Student
Knowledge graphs, metascience
DPh/008
Guy Parsons
GP
Guy Parsons
DPhil Student
Healthcare AI, digital health
DPh/009
Karolina Korgul
KK
Karolina Korgul
DPhil Student
AI safety, agentic AI
DPh/010
Ryan Othniel Kearns
RO
Ryan Othniel Kearns
DPhil Student
Science of evals, reasoning in LLMs
DPh/011
Shreyansh Padarha
SP
Shreyansh Padarha
DPhil Student
AI for science, AI safety, LLM evaluations
MSC/012
Mia Kussman
MK
Mia Kussman
MSc Student
Human–LLM interaction, LLM evaluations
MSC/013
Caleb Tan
CT
Caleb Tan
MSc Student
LLM evaluations, reasoning
VIS/014
Sebastian Petric
SP
Sebastian Petric
Visiting Policy Fellow
LLMs and financial time series
AFF/015
Tristan Naidoo
TN
Tristan Naidoo
Research Affiliate
Public health AI, LLM evaluations
§04 / MOAT THESIS·FOR / Decision-makers·RE / Strategic AI capability

Work with the lab
whose moat is the math.

The hardest technical problems produce the most defensible products. If a competitor can replicate your evaluation stack in a hackathon, you have a feature, not a moat. We work with the organisations that understand the difference.

Who we engage
  • Foundations deploying AI into high-stakes social settings
  • Governments writing AI policy under empirical pressure
  • Global corporates shipping LLM products at scale
  • Studios building the production layer above lab research
M.01TIER-A

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

Horizon
½ day → 12 wk
Format
On-site / hybrid
Example
Evaluation hardening for clinical LLM deployments. 3 days, multi-team
Book a workshop
M.02TIER-S

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

Horizon
3 → 12 months
Format
Lab + studio
Example
Eval harness + safety dashboard. Research IP, production engineering
See our builds
M.03TIER-S+

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

Horizon
2 → 5 years
Format
Named programme
Example
Multi-year programme with foundation + dedicated DPhil studentships
Start a conversation
Signed / OxRML Leadership·Date / 2026-05-23·Inquiries welcome
§05

Signal dispatch

Subscribe + recent transmissions
The lab newsletterLIVE

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.
Cadence
Quarterly
Channels
Email only
Spam
Never
Lab log / latest9 entries
  1. 01
    Three OxRML papers accepted at ICML 2026 — including a Spotlight
    PAPERMay 2026
  2. 02
    OxRML presenting at ICLR 2026
    CONFApril 2026
  3. 03
    New paper in Nature Medicine on LLMs as medical assistants
    PAPERFebruary 2026
  4. 04
    Ryan Othniel Kearns wins MSc Thesis Prize
    AWARDFebruary 2026
  5. 05
    OxRML at NeurIPS 2025
    CONFDecember 2025
  6. 06
    OxRML at EMNLP 2025
    CONFNovember 2025
  7. 07
    Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025
    AWARDJune 2025
  8. 08
    New review paper in Information Fusion
    PAPERFebruary 2025
  9. 09
    Winners of the 2024 PhysioNet Challenge
    AWARDSeptember 2024