Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

After a brief scan, I'm not competent to evaluate the essay by Chris Olah you posted.

I probably could get an LLM to do so, but I won't....



Neel Nanda is also very active in the field and writes some potentially more approachable articles, if you're interested: https://www.neelnanda.io/mechanistic-interpretability

Much of their work is focused on discovering "circuits" that occur between layer activations as they process data, which correspond to dynamics the model has learned. So, as a simple hypothetical example, instead of embedding the answer to 1 million arbitrary addition problems in the weights, models might learn a circuit that approximates the operation of addition.


I ran it through an LLM it said the paper was absolutely outstanding and perhaps the best paper of all time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: