daily shelf | Notion

5 jan

Inner Alignment: The new problem introduced by this work - ensuring that a mesa-optimizer's internal objective (the mesa-objective) aligns with the base objective it was trained under.

This leads to pseudo or fake alignment problem.

paper link: https://arxiv.org/pdf/1906.01820