Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Neel Nanda is also very active in the field and writes some potentially more approachable articles, if you're interested: https://www.neelnanda.io/mechanistic-interpretability

Much of their work is focused on discovering "circuits" that occur between layer activations as they process data, which correspond to dynamics the model has learned. So, as a simple hypothetical example, instead of embedding the answer to 1 million arbitrary addition problems in the weights, models might learn a circuit that approximates the operation of addition.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: