Papers
2026
Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence A systematic comparison of three steering-vector methods on Llama-3.3-70B and Qwen-32B, across two threat models (dishonesty, dismissiveness). Introduces two new conditional methods — Steer-to-Target-Projection (StTP) and Steer-to-Mirror-Projection (StMP) — that recover honesty without the capability collapse seen in unconditional steering.
arXiv →