Reasoning with Machines Lab
@ University of Oxford

Led by Prof. Adam Mahdi, the Reasoning with Machines Lab advances the science of AI evaluation, benchmarking, safety and security. Through rigorous empirical research, we study how LLMs and agentic systems reason, interact with humans, and drive scientific discovery. We work with industry partners deploying AI where reliability matters — teams and enterprises partner with the lab.

Read our research Engage with OxRML

Papers: 10; open access
Themes: 4; open access
Team: 15; + PI

Evaluation Safety Agentic Human-AI

Three doors in. Pick the one that fits.

Public science is the product. Workshops and partnerships exist so the lab can keep producing it.

For individuals

Open

Read, cite, fork, and ship anything we publish. Built for researchers, students, hobbyist evaluators, and engineers who want the source. No license gymnastics, no signups, no quota.

10 peer-reviewed publications, open access
All 4 research themes
Benchmarks & datasets, fully documented
Prompts & eval rigs in plain text
Lab reading list (quarterly digest)
The work of all 15 researchers

Start reading

arxiv, openreview & nature links direct — read, cite, fork, ship.

For teams

Most common

Workshop

On-site sessions for product and ML teams on evaluation, safety, and agent reliability.

Half-day to multi-week formats
Evaluation, safety, and agent reliability
Designed for product & ML teams shipping LLMs
On-site or in Oxford, your call
tools co-built with engineering partners available on request
Slack channel with the researchers

Book a workshop

booked by teams at healthcare, finance, retail & gov shops shipping LLM products into high-stakes settings.

For enterprises

Multi-year

Partner

Applied research collaborations with foundations, governments, and large companies.

Multi-year programmes with shared roadmaps
Dedicated DPhil studentships
Named labs & co-authored publications
Tools co-built with engineering partners, production-grade
Foundations, governments & global corporates
Quarterly research reviews on site

Start a conversation

we work with partners who care about getting AI right. small number of slots per year.

How it works. Individual researchers read the work, build with it, bring it to their team. When the team needs to scale a benchmark, run an evaluation cohort, or audit an agent at production load, that's when teams book a workshop. When the workshop turns into a three-year roadmap, it becomes a partnership. We built it in this order on purpose.

Research Themes

The questions the lab is built around. If your day job touches any of them, you're in the right place.

Theme 01evaluation
Benchmarks and Evaluation
We develop the science of LLM evaluation, setting the standard for rigorous assessment and identifying hidden risks before they matter.
Read papers in this theme →
Theme 02safety
AI Safety and Security
From bias and toxicity to agentic misalignment, we study the full spectrum of AI risk and develop the technical and governance tools to address it.
Read papers in this theme →
Theme 03agentic
Agentic AI for Science
We build agentic systems that automate scientific knowledge synthesis and discovery, with a focus on agents that are reliable, transparent and domain-grounded.
Read papers in this theme →
Theme 04human-ai
Human–AI Interaction
We run large-scale empirical studies on how people use AI for high stakes decisions, from healthcare and law to policy and beyond.
Read papers in this theme →

10 papers. Read any of them right now.

ICML spotlights, Nature Medicine, NeurIPS Datasets & Benchmarks, ICLR, EMNLP. Click through and you're on arXiv or OpenReview, not a request form.

all links resolve direct to the venue

The 16 people who do the work.

A small team in Oxford. We write the papers, tag the datasets, and answer the workshop emails. When you book the lab, you get these people, not an account exec.

15 researchers + 1 PI

Principal Investigator

Prof. Adam Mahdi

Oxford Internet Institute, University of Oxford

Adam leads OxRML. The group studies how language models reason, how people work with them, and how agentic systems behave on real scientific and decision-making tasks.