Much of their work is focused on discovering "circuits" that occur between layer activations as they process data, which correspond to dynamics the model has learned. So, as a simple hypothetical example, instead of embedding the answer to 1 million arbitrary addition problems in the weights, models might learn a circuit that approximates the operation of addition.
Much of their work is focused on discovering "circuits" that occur between layer activations as they process data, which correspond to dynamics the model has learned. So, as a simple hypothetical example, instead of embedding the answer to 1 million arbitrary addition problems in the weights, models might learn a circuit that approximates the operation of addition.