Memory Palace

Inner Alignment

Deception, reward tampering, mesa-optimization, goal misgeneralization, and why learned objectives may diverge from training objectives.

Deception, reward tampering, mesa-optimization, and goal misgeneralization.

Inner Alignment cover

To be continued.