A research lab run as a gift economy·University of Oxford

All our research is free for individuals.
Teams partner with us to scale it.

We are OxRML, the Reasoning with Machines Lab at the University of Oxford. Every paper, benchmark, dataset, and prompt we produce is open access, forever, with no asterisks. If you work alone or in a small group, that's yours to read, cite, fork, and ship. If your team wants the people behind it, that's where we come in.

Read our research. jump straight to the list.

Prof. Adam Mahdi
Prof. Adam Mahdi
Principal Investigator · Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Papers
10
open access
Themes
4
free to read
Team
14
+ PI

Three doors in. Pick the one that fits.

Three tiers by design, and the cheapest tier gets the most. Public science is the product. Workshops and partnerships exist so the lab can keep producing it.

A short note. “Free” here means free: no email wall, no preview-only PDFs, no “contact us for the dataset”. What we publish is what you get.

For individuals

Free forever
£0/ paper, dataset, prompt

Read, cite, fork, and ship anything we publish. Built for researchers, students, hobbyist evaluators, and engineers who want the source. No license gymnastics, no signups, no quota.

  • 10 peer-reviewed publications, open access
  • All 4 research themes
  • Benchmarks & datasets, fully documented
  • Prompts & eval rigs in plain text
  • Lab reading list (quarterly digest)
  • The work of all 14 researchers
Start reading

we mean it: no email required, no “preview” PDFs, arxiv + openreview + nature links all linked direct.

For teams

Most common
Workshop

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

  • Half-day to multi-week formats
  • Evaluation, safety, and agent reliability
  • Designed for product & ML teams shipping LLMs
  • On-site or in Oxford, your call
  • tools co-built with engineering partners available on request
  • Slack channel with the researchers
Book a workshop

booked by teams at healthcare, finance, retail & gov shops shipping LLM products into high-stakes settings.

For enterprises

Multi-year
Partner

Applied research collaborations with foundations, governments, and large companies.

  • Multi-year programmes with shared roadmaps
  • Dedicated DPhil studentships
  • Named labs & co-authored publications
  • Tools co-built with engineering partners, production-grade
  • Foundations, governments & global corporates
  • Quarterly research reviews on site
Start a conversation

we work with partners who care about getting AI right. small number of slots per year.

How it works. Individual researchers read the work, build with it, bring it to their team. When the team needs to scale a benchmark, run an evaluation cohort, or audit an agent at production load, that's when teams book a workshop. When the workshop turns into a three-year roadmap, it becomes a partnership. We built it in this order on purpose.

Four research themes. Every output free to read.

The questions the lab is built around. If your day job touches any of them, you're in the right place.

  • Theme 01evaluation

    Benchmarks and Evaluation

    We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

  • Theme 02safety

    AI Safety and Security

    We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

  • Theme 03agentic

    Agentic AI for Science

    Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

  • Theme 04human-ai

    Human–AI Interaction

    Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

10 papers. Read any of them right now.

ICML spotlights, Nature Medicine, NeurIPS Datasets & Benchmarks, ICLR, EMNLP. Click through and you're on arXiv or OpenReview, not a request form.

all links resolve direct to the venue
  1. 01

    Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

    A benchmark that tells real navigation apart from stochastic search when agents work over document collections.

    Ł Borchmann, J Van Landeghem, M Turski, S Padarha, RO Kearns, A Mahdi, et al.

    ICML (Spotlight)May 2026
    Benchmarks and EvaluationAgentic AI
    Read →
  2. 02

    A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

    LLM self-explanations are usually dismissed as unreliable. Measured the right way, they predict model behavior.

    H Mayne, JS Kang, D Gould, K Ramchandran, A Mahdi, NY Siegel

    ICMLMay 2026
    AI Safety and AlignmentBenchmarks and Evaluation
    Read →
  3. 03

    It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

    A benchmark for whether web agents can be socially engineered into abandoning the user's task. Today's agents fall for it.

    K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P H S Torr, A Mahdi, A Bibi

    ICMLMay 2025
    Benchmarks and EvaluationAgentic AIAI Safety and Alignment
    Read →
  4. 04

    Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

    A preregistered randomized study in Nature Medicine on how reliably LLMs serve as medical assistants for the general public.

    AM Bean, RE Payne, G Parsons, HR Kirk, J Ciro, R Mosquera-Gómez, S Hincapié, AS Ekanayaka, L Tarassenko, L Rocher, A Mahdi

    Nature MedicineFebruary 2026
    AI in HealthcareBenchmarks and Evaluation
    Read →
  5. 05

    Measuring what matters: Construct validity in large language model benchmarks

    A construct-validity audit of widely-used LLM benchmarks: what they claim to measure versus what they capture.

    AM Bean, RO Kearns, A Romanou, FS Hafner, H Mayne, J Batzner, et al.

    NeurIPS Datasets and BenchmarksNovember 2025
    Benchmarks and Evaluation
    Read →
  6. 06

    Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

    How LLM judges degrade across languages, modalities, and domains, and where the failure modes sit.

    S Padarha, E Semenova, B Vidgen, A Mahdi, S A Hale

    NeurIPS LLM Lifecycle WorkshopNovember 2025
    Benchmarks and EvaluationAI Safety and Alignment
    Read →
  7. 07

    How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

    Direct Preference Optimization reduces toxicity. We trace where it acts, neuron by neuron.

    Y Yang, F Sondej, H Mayne, A Lee, A Mahdi

    EMNLPNovember 2025
    AI Safety and Alignment
    Read →
  8. 08

    LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

    Ask an LLM "what would change your answer?" and it looks like introspection. It is often confabulation.

    H Mayne, RO Kearns, Y Yang, AM Bean, E Delaney, C Russell, A Mahdi

    EMNLPSeptember 2025
    AI Safety and AlignmentBenchmarks and Evaluation
    Read →
  9. 09

    Review of multimodal machine learning approaches in healthcare

    A survey of multimodal ML in clinical practice, from data-fusion strategies through to deployment.

    F Krones, U Marikkar, G Parsons, A Szmul, A Mahdi

    Information FusionFebruary 2025
    AI in Healthcare
    Read →
  10. 10

    LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

    A benchmark that obfuscates orthography to strip memorised knowledge out of reasoning problems, showing how much "reasoning" was recall.

    J Khouja, K Korgul, S Hellsten, L Yang, V Neacsu, H Mayne, RO Kearns, A Bean, A Mahdi

    ICLRApril 2026
    Benchmarks and Evaluation
    Read →

The 15 people who do the work.

A small team in Oxford. We write the papers, tag the datasets, and answer the workshop emails. When you book the lab, you get these people, not an account exec.

14 researchers + 1 PI
Prof. Adam Mahdi
Principal Investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

  • Felix Krones

    Felix Krones

    DPhil Student

    Multimodal AI, digital health

  • Djavan De Clercq

    Djavan De Clercq

    DPhil Student

    AI and food security, LLMs

  • Andrew M. Bean

    Andrew M. Bean

    DPhil Student

    LLM evaluations, human–LLM interaction

  • Yushi Yang

    Yushi Yang

    DPhil Student

    LLM & agentic post-training, AI alignment

  • Harry Mayne

    Harry Mayne

    DPhil Student

    LLM interpretability, AI safety, LLM evaluations

  • Jessica Rodrigues

    Jessica Rodrigues

    DPhil Student

    Knowledge graphs, metascience

  • Guy Parsons

    Guy Parsons

    DPhil Student

    Healthcare AI, digital health

  • Karolina Korgul

    Karolina Korgul

    DPhil Student

    AI safety, agentic AI

  • Ryan Othniel Kearns

    Ryan Othniel Kearns

    DPhil Student

    Science of evals, reasoning in LLMs

  • Shreyansh Padarha

    Shreyansh Padarha

    DPhil Student

    AI for science, AI safety, LLM evaluations

  • Mia Kussman

    Mia Kussman

    MSc Student

    Human–LLM interaction, LLM evaluations

  • Caleb Tan

    Caleb Tan

    MSc Student

    LLM evaluations, reasoning

  • Sebastian Petric

    Sebastian Petric

    Visiting Policy Fellow

    LLMs and financial time series

  • Tristan Naidoo

    Tristan Naidoo

    Research Affiliate

    Public health AI, LLM evaluations

What we've been up to.

A working diary: papers shipped, talks given, awards we didn't see coming. Most recent first.

9 entries · updated quarterly
  1. 01May 2026
    Paper

    Three OxRML papers accepted at ICML 2026 — including a Spotlight

  2. 02April 2026
    Conference

    OxRML presenting at ICLR 2026

  3. 03February 2026
    Paper

    New paper in Nature Medicine on LLMs as medical assistants

  4. 04February 2026
    Award

    Ryan Othniel Kearns wins MSc Thesis Prize

  5. 05December 2025
    Conference

    OxRML at NeurIPS 2025

  6. 06November 2025
    Conference

    OxRML at EMNLP 2025

  7. 07June 2025
    Award

    Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025

  8. 08February 2025
    Paper

    New review paper in Information Fusion

  9. 09September 2024
    Award

    Winners of the 2024 PhysioNet Challenge

Where the work has landed.

Venues and institutions that have published or hosted our research. Each one ties to a paper or programme in the diary above.

  • 01University of OxfordHost institution
  • 02Oxford Internet InstituteAffiliated department
  • 03Nature MedicinePublished 2026
  • 04ICMLSpotlight & papers, 2026
  • 05NeurIPSDatasets & Benchmarks, 2025
  • 06ICLRAccepted, 2026
  • 07EMNLPMultiple, 2025
The lab newsletter

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.