Each direction is a standing condition the lab is working toward. Together they describe the field we want AI evaluation, safety, and reasoning to become.
Direction 01
Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
Toward
A field where every benchmark publishes its construct validity, and where evaluation is treated as an experimental science.
Tag · EvaluationPanel Mauve
Direction 02
AI Safety and Security
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
Toward
A practice of measuring real harms — bias, toxicity, agentic misalignment — at the neuron and the deployment, before they reach the public.
Tag · SafetyPanel Sage
Direction 03
Agentic AI for Science
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
Toward
Scientific agents that synthesise knowledge reliably enough that a researcher can act on them — and transparently enough that they can audit them.
Tag · AgenticPanel Rose
Direction 04
Human–AI Interaction
Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.
Toward
Decisions made with AI in healthcare, law, and policy that are studied empirically — not assumed safe because the model is impressive.
Tag · Human-AIPanel Brass
“There is hope in honest error; none in the icy perfection of the mere stylist.”— J. D. Sedding, after the Glasgow Four
Four directions · One table
Plate III · The Catalogue
Ten plates,each a small case.
Ten papers from the past eighteen months. Nature Medicine, ICML (with a Spotlight), ICLR, NeurIPS, EMNLP, Information Fusion. Each plate is one finished cycle — built, peer-reviewed, published in the open.
10 of 10 plates · catalogue revised quarterlyA full bibliography is held by the lab; ask for the long form.
Plate IV · The Hands
Fifteen hands,one room.
A Principal Investigator, ten DPhil students, two MSc students, a visiting fellow, and a research affiliate. Each is introduced by what they are currently working on — the focus is the headline; the role is the caption.
Master of the Pavilion · Principal Investigator
Prof. Adam Mahdi
Oxford Internet Institute, University of Oxford
Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.
Currently practising
Coaching the lab's research cycles across evaluation, safety, agentic AI, and human–AI interaction — measuring what we ship before we ship it.
01 / 14
DPhil Student
Felix Krones
Multimodal evaluation across imaging and clinical text.
02 / 14
DPhil Student
Djavan De Clercq
LLMs applied to food-security data and policy questions.
03 / 14
DPhil Student
Andrew M. Bean
LLM evaluations that capture how people actually use models.
04 / 14
DPhil Student
Yushi Yang
Post-training for LLM and agentic alignment, at the neuron level.
05 / 14
DPhil Student
Harry Mayne
Interpretability and safety-relevant LLM evaluations.
06 / 14
DPhil Student
Jessica Rodrigues
Knowledge-graph methods for metascience and research synthesis.
07 / 14
DPhil Student
Guy Parsons
Healthcare AI evaluation grounded in clinical workflow.
08 / 14
DPhil Student
Karolina Korgul
Agentic-AI safety and web-agent persuasion attacks.
09 / 14
DPhil Student
Ryan Othniel Kearns
The science of evals — measuring reasoning honestly.
10 / 14
DPhil Student
Shreyansh Padarha
Agentic systems for science, with safety and eval rigour.
11 / 14
MSc Student
Mia Kussman
Studies of human–LLM interaction and LLM evaluation.
12 / 14
MSc Student
Caleb Tan
LLM evaluation and reasoning benchmarks.
13 / 14
Visiting Policy Fellow
Sebastian Petric
LLMs applied to financial time series, at the policy boundary.
14 / 14
Research Affiliate
Tristan Naidoo
Public-health AI and LLM evaluations grounded in epidemiology.
One sensei · Fourteen hands · Open to visiting researchersThe room is the architecture; the hands are the work.
Plate V · The Reading Room
What thelab is reading.
The reading-room ledger — papers accepted, conferences attended, honours noted. The most recent entry takes the rose tag. The ledger is kept in the open so collaborators always know what is current.
DateMarkEntryCategory
May 2026✦
Newest
Three OxRML papers accepted at ICML 2026 — including a Spotlight
Paper
April 2026◇
OxRML presenting at ICLR 2026
Convening
February 2026✦
New paper in Nature Medicine on LLMs as medical assistants
Paper
February 2026✻
Ryan Othniel Kearns wins MSc Thesis Prize
Honour
December 2025◇
OxRML at NeurIPS 2025
Convening
November 2025◇
OxRML at EMNLP 2025
Convening
June 2025✻
Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025
Honour
February 2025✦
New review paper in Information Fusion
Paper
September 2024✻
Winners of the 2024 PhysioNet Challenge
Honour
9 entries · ledger held in the open
Plate VI · The Salon
A room forlong conversations.
We work with foundations, governments, hyperscalers, and global corporates who want AI evaluation, safety, and reasoning treated with the same care as the systems they ship. Three formats — three cadences — one table.
Offering · 01
Workshops for industry teams
On-site sessions for product and ML teams on evaluation, safety, and agent reliability.
Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.
Cadence
Half-day to multi-week. On-site in your office or in Oxford; bespoke to the team and the question.
We work with engineering partners to turn lab work into tools other teams can run.
Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.
Cadence
Quarterly cycles with a partner studio. Joint roadmaps, shared evals, shipped tooling.