Neel Nanda On Avoiding An

Media Summary: When Anthropic tested Claude Sonnet 4.5 for alignment, the model appeared perfectly behaved — but it turned out the model had ... Warning: This is an ad-libbed talk, and I'm sure I got some facts wrong. This is a talk I gave to my MATS 9.0 training program on ... We don't know how AIs think or why they do what they do. Or at least, we don't know much. That fact is only becoming more ...

Neel Nanda On Avoiding An - Detailed Analysis & Overview

When Anthropic tested Claude Sonnet 4.5 for alignment, the model appeared perfectly behaved — but it turned out the model had ... Warning: This is an ad-libbed talk, and I'm sure I got some facts wrong. This is a talk I gave to my MATS 9.0 training program on ... We don't know how AIs think or why they do what they do. Or at least, we don't know much. That fact is only becoming more ... PART 1* — a comprehensive update on mechanistic interpretability: At 26, ... This is a talk I gave to my MATS scholars, with a stylised history of the field of mechanistic interpretability, as I see it (with a focus ... SPONSOR MESSAGES: *** CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide ...

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at ... This is a talk I gave to my MATS 9.0 training scholars about the big picture of mech interp - as of Oct 2025, what had changed?