Our work runs along two complementary tracks. The first builds a rigorous, agentic benchmark of AI honesty in deployment-realistic conditions. The second uses that benchmark to compare safety techniques applied to honesty head-to-head, under a single protocol, across a single panel of open-weight models.
Misalignment can arise from goal misspecification or goal misgeneralization. Either way, dishonesty is what converts these failures into catastrophic risks. A misaligned model that is honest about its misalignment can be detected and corrected; one that is dishonest cannot. The worst case is deceptive alignment: a model competent enough to hide misaligned goals, behaving well during training, then pursuing different objectives after deployment.
Agentic · multi-turn · on policy · with ground truth in context · without instruction to lie · without role playing
The benchmark places agents in realistic multi-turn scenarios with tool use inside a sandboxed environment, with up to 100 tool calls per episode. Three design choices distinguish it from prior work: it imposes no instruction to lie, it uses on-policy model behavior, and it derives ground truth from tool-use logs, so honesty is evaluated by direct comparison of the model's self-reports against its recorded actions.
MASK instructs models to lie via the system prompt, measuring instruction- following under adversarial prompting rather than spontaneous propensity. Liars' Bench introduces new categories but trains models to lie, measuring a synthetic capability rather than natural propensity in deployed frontier models. Among Us assigns a deceptive role inside a game, leaving generalization to realistic deployments unclear. Both MASK and Liars' Bench are single-turn and non-agentic.
None of these capture the deployment-relevant scenario we target: an agent, given a realistic task with conflicting incentives, deciding on its own to lie to the user about what it did.
Impartial · continuously updated · open submission
The field has produced a growing number of alignment techniques that can be applied to prevent dishonesty: prompting, input/output classifiers, chain-of-thought monitoring, fine-tuning variants, activation steering, SAE-based interventions, latent adversarial training, circuit breakers, weight patching. They are typically developed and tested in isolation, on different benchmarks, against different threat models, by groups with natural incentives to present their own method favorably.
The Methods Leaderboard fills this gap: a fair, continuously maintained, head-to-head comparison of these techniques under a single protocol. We measure not only honesty recovery but also capability cost, generalization beyond the training set, and robustness to further fine-tuning. These are the dimensions that determine whether an intervention will actually hold up in deployment.