An atelier for the science of AI

What does it take to bring honest AI into the world?

You already know your model can do remarkable things. You already sense where it falls short. We are the lab you sit beside while you sort the one from the other, patiently and on the record.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them.

Begin a conversation Read our research

“What would your team measure if no one was watching?”

“Where does your model already disagree with itself?”

“Which evaluation would you publish without hedging?”

“What is trying to emerge in your roadmap this year?”

Principal Investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

Oxford, United Kingdomhello@oxrml.com

In good company

University of Oxford
Oxford Internet Institute
Nature Medicine
ICML
NeurIPS
ICLR
EMNLP

From the journal

Recent passages

Small markers of what has arrived in the last few seasons.

A paper
May 2026
Three OxRML papers accepted at ICML 2026 — including a Spotlight
A gathering
April 2026
OxRML presenting at ICLR 2026
A paper
February 2026
New paper in Nature Medicine on LLMs as medical assistants
A milestone
February 2026
Ryan Othniel Kearns wins MSc Thesis Prize
A gathering
December 2025
OxRML at NeurIPS 2025
A gathering
November 2025
OxRML at EMNLP 2025

Four conditions

AI is tended into safety,
by particular people,
in particular places.

Four conditions our group cultivates. Each one is a long act of looking at models, at the people who use them, and at the gap between what is claimed and what is true. None of them are solutions. They are postures.

I. Holding space

Benchmarks and Evaluation

“What is true about this system, under load?”

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

we measure, we wait, we record

II. Tending the edges

AI Safety and Security

“Where does it break, and who is hurt when it does?”

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

we watch, we name, we protect

III. Building a midwife

Agentic AI for Science

“Can an agent help science itself emerge faster?”

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

we design, we ground, we test in the open

IV. Listening to people

Human–AI Interaction

“What happens when a stranger holds your decision?”

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

we run trials, we collect testimony, we publish

Recent harvest

These came up this season.
Tend a few, take what is useful.

Each paper began as a question someone in the group could not put down. Follow the links to the full work.

4 more in the journal, ripening.

The circle

The companions who hold this work

Twelve are pictured below; two more are listed at the bottom. Each member of OxRML is here because they are chasing a question, a method, or a stubborn intuition. They are here to find out.

Felix Krones
DPhil Student
Multimodal AI, digital health
Djavan De Clercq
DPhil Student
AI and food security, LLMs
Andrew M. Bean
DPhil Student
LLM evaluations, human–LLM interaction
Yushi Yang
DPhil Student
LLM & agentic post-training, AI alignment
Harry Mayne
DPhil Student
LLM interpretability, AI safety, LLM evaluations
Jessica Rodrigues
DPhil Student
Knowledge graphs, metascience
Guy Parsons
DPhil Student
Healthcare AI, digital health
Karolina Korgul
DPhil Student
AI safety, agentic AI
Ryan Othniel Kearns
DPhil Student
Science of evals, reasoning in LLMs
Shreyansh Padarha
DPhil Student
AI for science, AI safety, LLM evaluations
Mia Kussman
MSc Student
Human–LLM interaction, LLM evaluations
Caleb Tan
MSc Student
LLM evaluations, reasoning

And also, holding from a distance

Sebastian Petric · Tristan Naidoo

LLMs and financial time series · Public health AI, LLM evaluations

For decision-makers at Gates Foundation, Schwarz Group, and the rest of you

You already have a roadmap.
We bring patient expertise
to the parts that are still becoming.

Three ways we sit alongside organisations whose decisions matter. Each one is a different rhythm of involvement. None of them are us doing the work for you.

First trimester01
Workshops for industry teams
You bring
A team that ships AI into places where mistakes are costly.
We hold
Half-day to multi-week formats on evaluation, safety, and agent reliability, adapted to your stack, your stakes, and your timeline.
What arrives
Your team leaves with shared language, working harnesses, and a clearer view of what you are shipping.
Book a workshop
Second trimester02
Tools co-built with engineering partners
You bring
A research idea that needs to become a tool other people can use.
We hold
We partner with first-class development studios to turn a paper or prototype into a production-grade tool: evaluation harnesses, safety dashboards, and agentic-research platforms.
What arrives
A piece of software that carries the research methodology and survives contact with real users.
See our builds
A long gestation03
Research partnerships
You bring
A foundation, government, or company that wants AI to go well in your domain.
We hold
Multi-year applied programmes with shared roadmaps, dedicated DPhil studentships, named labs, and joint publications. We pick partners who care about getting AI right.
What arrives
A body of work that outlasts a single project, and a relationship built on the public record.
Start a conversation

Talk to us about the work that matters most this year.

Begin a conversation

The lab newsletter

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.