Inner Alignment

Deception, reward tampering, mesa-optimization, goal misgeneralization, and why learned objectives may diverge from training objectives.

Deception, reward tampering, mesa-optimization, and goal misgeneralization.

Jun 2026 · Note

To be continued.