An atelier for the science of AI

What does it take to bring honest AI into the world?

You already know your model can do remarkable things. You already sense where it falls short. We are the lab you sit beside while you sort the one from the other, patiently and on the record.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

What would your team measure if no one was watching?
Where does your model already disagree with itself?
Which evaluation would you publish without hedging?
What is trying to emerge in your roadmap this year?
Portrait of Prof. Adam Mahdi

Principal Investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Oxford, United Kingdomhello@oxrml.com

In good company

  • University of Oxford
  • Oxford Internet Institute
  • Nature Medicine
  • ICML
  • NeurIPS
  • ICLR
  • EMNLP

From the journal

Recent passages

  1. A paper

    May 2026

    Three OxRML papers accepted at ICML 2026 — including a Spotlight

  2. A gathering

    April 2026

    OxRML presenting at ICLR 2026

  3. A paper

    February 2026

    New paper in Nature Medicine on LLMs as medical assistants

  4. A milestone

    February 2026

    Ryan Othniel Kearns wins MSc Thesis Prize

  5. A gathering

    December 2025

    OxRML at NeurIPS 2025

  6. A gathering

    November 2025

    OxRML at EMNLP 2025

Four conditions

AI is tended into safety,
by particular people,
in particular places.

Four conditions our group cultivates. Each one is a long act of looking at models, at the people who use them, and at the gap between what is claimed and what is true. None of them are solutions. They are postures.

I. Holding space

Benchmarks and Evaluation

What is true about this system, under load?

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

we measure, we wait, we record

II. Tending the edges

AI Safety and Security

Where does it break, and who is hurt when it does?

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

we watch, we name, we protect

III. Building a midwife

Agentic AI for Science

Can an agent help science itself emerge faster?

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

we design, we ground, we test in the open

IV. Listening to people

Human–AI Interaction

What happens when a stranger holds your decision?

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

we run trials, we collect testimony, we publish

Recent harvest

These came up this season.
Tend a few, take what is useful.

4 more in the journal, ripening.

The circle

The companions who hold this work

Twelve are pictured below; two more are listed at the bottom. Each member of OxRML is here because they are chasing a question, a method, or a stubborn intuition. They are here to find out.

  • Felix Krones

    Felix Krones

    DPhil Student

    Multimodal AI, digital health

  • Djavan De Clercq

    Djavan De Clercq

    DPhil Student

    AI and food security, LLMs

  • Andrew M. Bean

    Andrew M. Bean

    DPhil Student

    LLM evaluations, human–LLM interaction

  • Yushi Yang

    Yushi Yang

    DPhil Student

    LLM & agentic post-training, AI alignment

  • Harry Mayne

    Harry Mayne

    DPhil Student

    LLM interpretability, AI safety, LLM evaluations

  • Jessica Rodrigues

    Jessica Rodrigues

    DPhil Student

    Knowledge graphs, metascience

  • Guy Parsons

    Guy Parsons

    DPhil Student

    Healthcare AI, digital health

  • Karolina Korgul

    Karolina Korgul

    DPhil Student

    AI safety, agentic AI

  • Ryan Othniel Kearns

    Ryan Othniel Kearns

    DPhil Student

    Science of evals, reasoning in LLMs

  • Shreyansh Padarha

    Shreyansh Padarha

    DPhil Student

    AI for science, AI safety, LLM evaluations

  • Mia Kussman

    Mia Kussman

    MSc Student

    Human–LLM interaction, LLM evaluations

  • Caleb Tan

    Caleb Tan

    MSc Student

    LLM evaluations, reasoning

And also, holding from a distance

Sebastian Petric · Tristan Naidoo

LLMs and financial time series · Public health AI, LLM evaluations

For decision-makers at Gates Foundation, Schwarz Group, and the rest of you

You already have a roadmap.
We bring patient expertise
to the parts that are still becoming.

Three ways we sit alongside organisations whose decisions matter. Each one is a different rhythm of involvement. None of them are us doing the work for you.

  1. First trimester01

    Workshops for industry teams

    You bring

    A team that ships AI into places where mistakes are costly.

    We hold

    Half-day to multi-week formats on evaluation, safety, and agent reliability, adapted to your stack, your stakes, and your timeline.

    What arrives

    Your team leaves with shared language, working harnesses, and a clearer view of what you are shipping.

    Book a workshop
  2. Second trimester02

    Tools co-built with engineering partners

    You bring

    A research idea that needs to become a tool other people can use.

    We hold

    We partner with first-class development studios to turn a paper or prototype into a production-grade tool: evaluation harnesses, safety dashboards, and agentic-research platforms.

    What arrives

    A piece of software that carries the research methodology and survives contact with real users.

    See our builds
  3. A long gestation03

    Research partnerships

    You bring

    A foundation, government, or company that wants AI to go well in your domain.

    We hold

    Multi-year applied programmes with shared roadmaps, dedicated DPhil studentships, named labs, and joint publications. We pick partners who care about getting AI right.

    What arrives

    A body of work that outlasts a single project, and a relationship built on the public record.

    Start a conversation

Talk to us about the work that matters most this year.

Begin a conversation

The lab newsletter

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.