MIT AI Safety Fundamental Notes

These are my notes and takeaways from the MAIA Fellowship, covering recent papers on AI alignment, security, and evaluations. I am keeping this page as a central map for the sequence, so each week can stay readable on its own while still fitting into the larger question I keep circling: what would it mean to make advanced AI systems safer before the stakes become too high to improvise?

I am excluding sessions 0 and 8 here because I want this page to focus on the core arc of the fellowship. Sessions 1 through 7 move from forecasting and alignment foundations toward threat models, oversight, interpretability, evaluations, governance, and liability.

Transformative AI and Current Trajectory

Scaling drivers, capability trends, and time-horizon forecasts for thinking about whether AGI-like systems may arrive sooner than institutions expect.

Outer Alignment

Reward misspecification, specification gaming, RLHF, and the gap between intended objectives and operationalized training signals.

Inner Alignment

Deception, reward tampering, mesa-optimization, goal misgeneralization, and why learned objectives may diverge from training objectives.

Threat Models

Instrumental convergence, power-seeking, bioterrorism, cyberwarfare, and gradual disempowerment as different ways AI systems could create risk.

Control and Scalable Oversight

AI control, resampling, monitoring, weak-to-strong generalization, debate, and oversight strategies for systems humans cannot fully inspect unaided.

Interpretability and Evals

Attribution graphs, linear probes, capability evaluations, propensity evaluations, and alignment auditing as tools for making model behavior legible.

AI Governance and Liability

Tort law, compute governance, export controls, China, institutional accountability, and the regulatory toolbox for governing frontier AI.

Across the sequence, I am especially interested in how technical alignment questions and governance questions keep folding into each other. A model can fail because its objective is poorly specified, because its internal goal is not what we thought, because oversight is too weak, or because the surrounding institutions cannot respond quickly enough. The fellowship makes those failure modes feel less like separate topics and more like one connected system.

The notes are still evolving, but this page gives me a stable index for the main arc: trajectory, alignment, threat models, oversight, evaluations, and governance.