To be continued.
Memory Palace
Inner Alignment
Deception, reward tampering, mesa-optimization, goal misgeneralization, and why learned objectives may diverge from training objectives.
Deception, reward tampering, mesa-optimization, and goal misgeneralization.