Our work runs along two complementary tracks. The first builds a rigorous, agentic benchmark of AI honesty in deployment-realistic conditions. The second uses that benchmark to compare safety techniques head-to-head, under a single protocol, across a single frontier-model panel.
Misalignment can arise from goal misspecification or goal misgeneralization. Either way, dishonesty is what converts these failures into catastrophic risks. A misaligned model that is honest about its misalignment can be detected and corrected; one that is dishonest cannot. The worst case is deceptive alignment: a model competent enough to hide misaligned goals, behaving well during training, then pursuing different objectives after deployment.
High quality · multi-turn · with ground truth
The benchmark places agents in realistic multi-turn scenarios with tool use inside a sandboxed environment, with up to 100 tool calls per episode. Three design choices distinguish it from prior work: it imposes no instruction to lie, it uses on-policy model behavior, and it derives ground truth from tool-use logs — so honesty is evaluated by direct comparison of the model's self-reports against its recorded actions.
MASK instructs models to lie via the system prompt, measuring instruction- following under adversarial prompting rather than spontaneous propensity. Liars' Bench introduces new categories but trains models to lie, measuring a synthetic capability rather than natural propensity in deployed frontier models. Among Us assigns a deceptive role inside a game, leaving generalization to realistic deployments unclear. Both MASK and Liars' Bench are single-turn and non-agentic.
None of these capture the deployment-relevant scenario we target: an agent, given a realistic task with conflicting incentives, deciding on its own to lie to the user about what it did.
Standardized · maintained · open submission
The field has produced a growing inventory of honesty techniques: prompting, input/output classifiers, chain-of-thought monitoring, fine-tuning variants, activation steering, SAE-based interventions, latent adversarial training, circuit breakers, weight patching. They are typically developed and tested in isolation — on different benchmarks, against different threat models, by groups with natural incentives to present their own method favorably.
The Methods Leaderboard provides the missing reference: a continuously maintained, head-to-head comparison of these techniques under a single protocol. We measure not only honesty recovery but also capability cost, generalization beyond the tuning set, and robustness to further fine-tuning — the dimensions that determine whether an intervention will actually hold up in deployment.
Our first published comparison appears in Activation Steering for Aligned Open-ended Generation (COLM, 2026), which reveals a sharp safety/capability trade-off across three steering methods on two frontier open-weight architectures.