2026
Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
A systematic comparison of three steering-vector methods on Llama-3.3-70B and
Qwen-32B, across two threat models (dishonesty, dismissiveness). Introduces
two new conditional methods — Steer-to-Target-Projection (StTP) and
Steer-to-Mirror-Projection (StMP) — that recover honesty without the capability
collapse seen in unconditional steering.
arXiv →