STREAM-ACTIVEUniversity of Oxford · Oxford, United Kingdom|messages/sec ≈ 4.2 · uptime 99.97

The lab as a reactive system for AI evaluation, safety & reasoning.

An empirical research group at the Oxford Internet Institute. We study LLM evaluation, safety, reasoning, and the agentic systems built from them. OxRML publishes empirical research on LLMs and agentic systems, built to stay responsive under pressure and useful to the teams shipping AI into the world.

Portrait of Prof. Adam Mahdi
principal_publisher
Prof. Adam Mahdi
Principal Investigator · Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks. He won the Oxford Teaching Excellence Award in 2025.

tail -f oxrml.events4 topics · backpressure: ok
topic.evaluationtopic.safetytopic.agentic-sciencetopic.human-ai
  • -11devaluationmeasuring-what-matters → NeurIPS
  • -14dsafetydpo-toxicity neuron-level analysis → EMNLP
  • -21dagenticTRAP web-agent persuasion benchmark → ICML
  • -42dhuman-aiNature Medicine: LLMs as med assistants
  • -9devaluationLingOly-TOO disentangles reasoning vs recall
  • -33dsafetyself-generated counterfactuals unreliable
papers / yr
20+
DPhils + MScs
14
ICML / NeurIPS / Nature Med
2025-26
subscribers ::University of OxfordOxford Internet InstituteNature MedicineICMLNeurIPSICLREMNLP
May 2026·PAPER·Three OxRML papers accepted at ICML 2026 — including a Spotlight⏵⏵April 2026·CONF·OxRML presenting at ICLR 2026⏵⏵February 2026·PAPER·New paper in Nature Medicine on LLMs as medical assistants⏵⏵February 2026·AWARD·Ryan Othniel Kearns wins MSc Thesis Prize⏵⏵December 2025·CONF·OxRML at NeurIPS 2025⏵⏵November 2025·CONF·OxRML at EMNLP 2025⏵⏵June 2025·AWARD·Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025⏵⏵February 2025·PAPER·New review paper in Information Fusion⏵⏵September 2024·AWARD·Winners of the 2024 PhysioNet Challenge⏵⏵May 2026·PAPER·Three OxRML papers accepted at ICML 2026 — including a Spotlight⏵⏵April 2026·CONF·OxRML presenting at ICLR 2026⏵⏵February 2026·PAPER·New paper in Nature Medicine on LLMs as medical assistants⏵⏵February 2026·AWARD·Ryan Othniel Kearns wins MSc Thesis Prize⏵⏵December 2025·CONF·OxRML at NeurIPS 2025⏵⏵November 2025·CONF·OxRML at EMNLP 2025⏵⏵June 2025·AWARD·Prof. Adam Mahdi wins Oxford Teaching Excellence Award 2025⏵⏵February 2025·PAPER·New review paper in Information Fusion⏵⏵September 2024·AWARD·Winners of the 2024 PhysioNet Challenge⏵⏵
[01] · research.streams

Four streams. One responsive system.

Each research direction publishes to a topic. Partners and students subscribe independently. Failures stay isolated; load scales horizontally.

topic.evaluationhealthy

Benchmarks and Evaluation

We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.

p99
17ms
throughput
4.1 msg/s
partitions
6
topic.safetyhealthy

AI Safety and Security

We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.

p99
23ms
throughput
3.4 msg/s
partitions
5
topic.agentic-sciencebackpressure

Agentic AI for Science

Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.

p99
41ms
throughput
2.2 msg/s
partitions
4
topic.human-aihealthy

Human–AI Interaction

Empirical studies of how people use AI in high-stakes settings: healthcare, law, and policy.

p99
29ms
throughput
1.9 msg/s
partitions
4
[02] · message.queue

Recent messages, processed.

Each publication is a message: produced by the lab, consumed by venues, acknowledged by the field. Ten in-flight, six pinned below.

producer━━▶oxrml.research━━▶venue.consumer
showing 6 of 10 · lag = 0consume.all()
[03] · worker.pool

The team that keeps the system responsive.

Fourteen researchers, partitioned into consumer groups by role. Each subscribes to multiple research streams; together they keep the lab elastic and resilient.

consumer-group.dphil10 workers · lag = 0
  • Felix Krones
    worker-01
    Felix Krones
    DPhil Student

    subscribes :: Multimodal AI, digital health

  • Djavan De Clercq
    worker-02
    Djavan De Clercq
    DPhil Student

    subscribes :: AI and food security, LLMs

  • Andrew M. Bean
    worker-03
    Andrew M. Bean
    DPhil Student

    subscribes :: LLM evaluations, human–LLM interaction

  • Yushi Yang
    worker-04
    Yushi Yang
    DPhil Student

    subscribes :: LLM & agentic post-training, AI alignment

  • Harry Mayne
    worker-05
    Harry Mayne
    DPhil Student

    subscribes :: LLM interpretability, AI safety, LLM evaluations

  • Jessica Rodrigues
    worker-06
    Jessica Rodrigues
    DPhil Student

    subscribes :: Knowledge graphs, metascience

  • Guy Parsons
    worker-07
    Guy Parsons
    DPhil Student

    subscribes :: Healthcare AI, digital health

  • Karolina Korgul
    worker-08
    Karolina Korgul
    DPhil Student

    subscribes :: AI safety, agentic AI

  • Ryan Othniel Kearns
    worker-09
    Ryan Othniel Kearns
    DPhil Student

    subscribes :: Science of evals, reasoning in LLMs

  • Shreyansh Padarha
    worker-10
    Shreyansh Padarha
    DPhil Student

    subscribes :: AI for science, AI safety, LLM evaluations

consumer-group.msc2 workers · lag = 0
  • Mia Kussman
    worker-01
    Mia Kussman
    MSc Student

    subscribes :: Human–LLM interaction, LLM evaluations

  • Caleb Tan
    worker-02
    Caleb Tan
    MSc Student

    subscribes :: LLM evaluations, reasoning

consumer-group.fellows2 workers · lag = 0
  • Sebastian Petric
    worker-01
    Sebastian Petric
    Visiting Policy Fellow

    subscribes :: LLMs and financial time series

  • Tristan Naidoo
    worker-02
    Tristan Naidoo
    Research Affiliate

    subscribes :: Public health AI, LLM evaluations

[04] · work.with(us)

Three ways to integrate the lab into your roadmap.

We design partnerships the way reactive systems are designed: bounded scope, clear contracts, graceful escalation. Pick a latency that fits.

pillar.01open
01

Workshops for industry teams

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats. For teams shipping LLM products in healthcare, finance, retail, and government.

fn workshop(team) → resilience
latency
½ day → 4 weeks
SLA
on-site · scoped to team
Book a workshop
pillar.02open
02

Tools co-built with engineering partners

We work with engineering partners to turn lab work into tools other teams can run.

Evaluation harnesses, safety dashboards, agentic-research platforms. We build them with partners we trust, carrying the research methods through to the code.

fn build(lab, studio) → product
latency
8 → 24 weeks
SLA
engineering partner co-owned
See our builds
pillar.03open
03

Research partnerships

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes: shared roadmaps, sponsored DPhil studentships, named labs.

fn partner(org) → multi_year
latency
12 → 36 months
SLA
shared roadmap · named lab
Start a conversation
The lab newsletter :: subscribe(quarterly)

A quarterly note from the lab. Nothing else.

New papers, open positions, partnership opportunities, and what we have been reading.

Unsubscribe in one click. We never share your email.