STREAM-ACTIVEUniversity of Oxford · Oxford, United Kingdom|messages/sec ≈ 4.2 · uptime 99.97

The lab as a reactive⟶ system for AI evaluation, safety & reasoning.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them. OxRML publishes empirical research on LLMs and agentic systems, built to stay responsive under pressure and useful to the teams shipping AI into the world.

partner with us→read our research↘

principal_publisher

Prof. Adam Mahdi

Principal Investigator · Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

tail -f oxrml.events4 topics · backpressure: ok

topic.evaluationtopic.safetytopic.agentic-sciencetopic.human-ai

▸-11devaluationmeasuring-what-matters → NeurIPS
▸-14dsafetydpo-toxicity neuron-level analysis → EMNLP
▸-21dagenticTRAP web-agent persuasion benchmark → ICML
▸-42dhuman-aiNature Medicine: LLMs as med assistants
▸-9devaluationLingOly-TOO disentangles reasoning vs recall
▸-33dsafetyself-generated counterfactuals unreliable

papers / yr

20+

DPhils + MScs

ICML / NeurIPS / Nature Med

2025-26

subscribers ::University of OxfordOxford Internet InstituteNature MedicineICMLNeurIPSICLREMNLP

live.broadcast

May 2026·PAPER·Three OxRML papers accepted at ICML 2026 — including a Spotlight⏵⏵April 2026·CONF·OxRML presenting at ICLR 2026⏵⏵February 2026·PAPER·New paper in Nature Medicine on LLMs as medical assistants⏵⏵February 2026·AWARD·Ryan Othniel Kearns wins MSc Thesis Prize⏵⏵December 2025·CONF·OxRML at NeurIPS 2025⏵⏵November 2025·CONF·OxRML at EMNLP 2025⏵⏵June 2025·AWARD·Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025⏵⏵February 2025·PAPER·New review paper in Information Fusion⏵⏵September 2024·AWARD·Winners of the 2024 PhysioNet Challenge⏵⏵May 2026·PAPER·Three OxRML papers accepted at ICML 2026 — including a Spotlight⏵⏵April 2026·CONF·OxRML presenting at ICLR 2026⏵⏵February 2026·PAPER·New paper in Nature Medicine on LLMs as medical assistants⏵⏵February 2026·AWARD·Ryan Othniel Kearns wins MSc Thesis Prize⏵⏵December 2025·CONF·OxRML at NeurIPS 2025⏵⏵November 2025·CONF·OxRML at EMNLP 2025⏵⏵June 2025·AWARD·Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025⏵⏵February 2025·PAPER·New review paper in Information Fusion⏵⏵September 2024·AWARD·Winners of the 2024 PhysioNet Challenge⏵⏵

[01] · research.streams

Four streams. One responsive system.

Each research direction publishes to a topic. Partners and students subscribe independently. Failures stay isolated; load scales horizontally.

── flow direction →

topic.evaluationhealthy

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

p99

17ms

throughput

4.1 msg/s

partitions

topic.safetyhealthy

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

p99

23ms

throughput

3.4 msg/s

partitions

topic.agentic-sciencebackpressure

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

p99

41ms

throughput

2.2 msg/s

partitions

topic.human-aihealthy

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

p99

29ms

throughput

1.9 msg/s

partitions

[02] · message.queue

Recent messages, processed.

Each publication is a message: produced by the lab, consumed by venues, acknowledged by the field. Ten in-flight, six pinned below.

── flow direction →

producer━━▶oxrml.research━━▶venue.consumeroffset 10 · ack=all

#payloadauthorsvenueack

showing 6 of 10 · lag = 0consume.all() →

[03] · worker.pool

The team that keeps the system responsive.

Fourteen researchers, partitioned into consumer groups by role. Each subscribes to multiple research streams; together they keep the lab elastic and resilient.

── flow direction →

consumer-group.dphil10 workers · lag = 0

worker-01
Felix Krones
DPhil Student
subscribes :: Multimodal AI, digital health
worker-02
Djavan De Clercq
DPhil Student
subscribes :: AI and food security, LLMs
worker-03
Andrew M. Bean
DPhil Student
subscribes :: LLM evaluations, human–LLM interaction
worker-04
Yushi Yang
DPhil Student
subscribes :: LLM & agentic post-training, AI alignment
worker-05
Harry Mayne
DPhil Student
subscribes :: LLM interpretability, AI safety, LLM evaluations
worker-06
Jessica Rodrigues
DPhil Student
subscribes :: Knowledge graphs, metascience
worker-07
Guy Parsons
DPhil Student
subscribes :: Healthcare AI, digital health
worker-08
Karolina Korgul
DPhil Student
subscribes :: AI safety, agentic AI
worker-09
Ryan Othniel Kearns
DPhil Student
subscribes :: Science of evals, reasoning in LLMs
worker-10
Shreyansh Padarha
DPhil Student
subscribes :: AI for science, AI safety, LLM evaluations

consumer-group.msc2 workers · lag = 0

worker-01
Mia Kussman
MSc Student
subscribes :: Human–LLM interaction, LLM evaluations
worker-02
Caleb Tan
MSc Student
subscribes :: LLM evaluations, reasoning

consumer-group.fellows2 workers · lag = 0

worker-01
Sebastian Petric
Visiting Policy Fellow
subscribes :: LLMs and financial time series
worker-02
Tristan Naidoo
Research Affiliate
subscribes :: Public health AI, LLM evaluations

[04] · work.with(us)

Three ways to integrate the lab into your roadmap.

We design partnerships the way reactive systems are designed: bounded scope, clear contracts, graceful escalation. Pick a latency that fits.

── flow direction →

pillar.01open

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

fn workshop(team) → resilience

latency

½ day → 4 weeks

SLA

on-site · scoped to team

Book a workshop→

pillar.02open

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

fn build(lab, studio) → product

latency

8 → 24 weeks

SLA

engineering partner co-owned

See our builds→

pillar.03open

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

fn partner(org) → multi_year

latency

12 → 36 months

SLA

shared roadmap · named lab

Start a conversation→

The lab newsletter :: subscribe(quarterly)

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.