The hard problems
of AI safety are unsolved.

Alignment, interpretability, oversight — the technical
foundations that must exist before AI is safely scaled.

Powerful AI systems can fail in ways that aren't visible from their outputs. Understanding those failure modes is the work.

We work across the core technical problems: understanding model internals, improving alignment methods, stress-testing systems before deployment, and building oversight mechanisms that stay effective as models become more capable.

Mechanistic Interpretability

Finding the circuits and features that implement model behavior

Models store computation in specific, identifiable structures — attention heads, MLP layers, residual stream components. We trace them. Not by testing outputs, but by running experiments on model internals: activation patching, sparse autoencoders, direct logit attribution. The goal is to be able to read what a model is doing, not just observe what it says.

Alignment Research

Getting models to do what we intend, not what we specify

Reward hacking, goal misgeneralisation, and sycophancy are not edge cases — they are predictable failures of models that optimise for measurable proxies instead of actual intent. We study these failure modes structurally, and work on training methods, oversight mechanisms, and evaluation tools that are harder to game.

Dangerous Capability Evaluation

Knowing what a model can do before you deploy it

Capability evaluations try to answer a specific question: can this model provide meaningful uplift to someone trying to cause harm — in biosecurity, cyberattack, or autonomous operation? We work on evaluation methodology that goes beyond red-teaming prompts, including elicitation techniques that test the upper bound of what a model can do.

Scalable Oversight

Supervising models that know more than the people supervising them

As models become more capable, human oversight becomes less reliable — not because humans stop paying attention, but because the gap between what the model knows and what the supervisor can verify widens. Debate, amplification, and weak-to-strong generalisation are technical attempts to solve this. We study which approaches actually work and under what conditions.

India–UK Research Corridor

Technical collaboration across two nodes of the global safety field

We run joint research programs, visiting fellowships, and working sessions between the UK AI safety cluster and India's technical talent base. The goal is a wider field with shared methodology and shared results — not parallel silos.

Join the technical community.

We are open to research collaborations, reading groups, and visiting fellowships for technical researchers.

Get Involved

The hard problemsof AI safety are unsolved.