
Paul Colognese
Biography
Title: Goal Detection for AI Catastrophe Prevention
Absrtact: As AI agents advance in capabilities, they promise immense benefits but also pose significant risks. If we build powerful AI agents that pursue misaligned goals, the consequences could be catastrophic. Training AI to consistently pursue aligned goals is challenging for various reasons, including the potential for deceptive AI agents to actively mislead oversight processes. This talk explores whether goals might be represented in the computational substrates running the agent and how we might leverage these representations for goal oversight, thereby contributing to the prevention of potential AI-driven catastrophes.






















