01 / 03
Benchmarks and Evaluation
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
- Benchmark design
- Statistical evaluation
- Capability elicitation
- Contamination audits
Reasoning with Machines Lab is an Oxford research lab. We study how language models and agentic systems behave under pressure, and how to deploy them where the stakes are real.
We work on the science of LLM evaluation: what benchmarks measure, where they mislead, and how to build ones that hold up.
We work on bias, toxicity, and agentic misalignment, and on the technical and governance tools that address them.
Agentic systems for scientific work. We focus on keeping them reliable, transparent, and grounded in the domain.
Two ways to work with us: third-party evaluation of your models and agents, or a focused engagement that turns one of our research outputs into a tool you own.