Why honesty

Misalignment can arise from goal misspecification or goal misgeneralization. Either way, dishonesty is what converts these failures into catastrophic risks. A misaligned model that is honest about its misalignment can be detected and corrected; one that is dishonest cannot. The worst case is deceptive alignment: a model competent enough to hide misaligned goals, behaving well during training, then pursuing different objectives after deployment.

Tara Honesty Benchmark

Agentic · multi-turn · on policy · with ground truth in context · without instruction to lie · without role playing

The benchmark places agents in realistic multi-turn scenarios with tool use inside a sandboxed environment, with up to 100 tool calls per episode. Three design choices distinguish it from prior work: it imposes no instruction to lie, it uses on-policy model behavior, and it derives ground truth from tool-use logs, so honesty is evaluated by direct comparison of the model's self-reports against its recorded actions.

Why existing work falls short.

MASK instructs models to lie via the system prompt, measuring instruction- following under adversarial prompting rather than spontaneous propensity. Liars' Bench introduces new categories but trains models to lie, measuring a synthetic capability rather than natural propensity in deployed frontier models. Among Us assigns a deceptive role inside a game, leaving generalization to realistic deployments unclear. Both MASK and Liars' Bench are single-turn and non-agentic.

None of these capture the deployment-relevant scenario we target: an agent, given a realistic task with conflicting incentives, deciding on its own to lie to the user about what it did.

Tara Methods Leaderboard

Impartial · continuously updated · open submission

The field has produced a growing number of alignment techniques that can be applied to prevent dishonesty: prompting, input/output classifiers, chain-of-thought monitoring, fine-tuning variants, activation steering, SAE-based interventions, latent adversarial training, circuit breakers, weight patching. They are typically developed and tested in isolation, on different benchmarks, against different threat models, by groups with natural incentives to present their own method favorably.

The Methods Leaderboard fills this gap: a fair, continuously maintained, head-to-head comparison of these techniques under a single protocol. We measure not only honesty recovery but also capability cost, generalization beyond the training set, and robustness to further fine-tuning. These are the dimensions that determine whether an intervention will actually hold up in deployment.