I appreciate the explanation, but I don't know what junior-dev would understand ...

throwawaymaths · on May 17, 2023

Yeah sorry, it still requires math and probably some exposure to ML basics.

shadowgovt · on May 17, 2023

I think one hole in the description for simplicity is that "differentiable" it's not an adjective that applies to hash tables.

Differentiable relative to what? What is (x) in the d(hashtable)/d(x) equation?

lhnz · on May 17, 2023

I think it applies because lookups can be done by multiplying one-hot vectors (or masks) with matrices; this is roughly analogous to what happens when we multiply Q with K^T in a self-attention head.

Read this: https://e2eml.school/transformers.html#table_lookup

And then read: https://e2eml.school/transformers.html#attention

Matrix multiplication is differentiable as it is continuous. Therefore you can calculate partial derivatives of these operations. The ability to do that is what allows gradient descent optimization via the chain rule.

  > Differentiable relative to what?
  > What is (x) in the d(hashtable)/d(x) equation?

I think the derivative we actually care about is `∂L/∂(hashtable-lookup)` but here the `hashtable-lookup` is actually the matrix multiplication mentioned above. We want to know how much the "loss" (or error) changes with respect to each of our "lookups". Knowing how each of our "lookups" causes our model to err from the output we expect, allows us to optimise it.

Note: it's not really a "lookup" in the normal sense of the word. Hashtable lookups are discontinuous since we either have a value for a particular key or we do not. Our matrix multiplication is sort of like a fuzzy, probabilistic, continuous form of lookup in which nearby keys get partially activated based on their similarity to the query, and in which a small change in this query causes continous changes to the keys produced.

As far as my understanding of the self-attention equation (e.g. `softmax(QK^T / sqrt(d_k))V`) goes, its actually quite important that we get this fuzzy output in which lots of keys get partially activated for a particular query. If it only picked the maximum similarity and ignored the rest, there would be less information propagating through the network and it'd be harder for the network to learn relationships/interactions between inputs. This is why we scale `QK^T` by `sqrt(d_k)` in order to pass a tighter range of values into the `softmax()` function (which importantly generates probabilities that sum to 1, but contains exponentials which give it a tendency to over-emphasize the maximum value and ignore other values if they are too far apart).

throwawaymaths · on May 17, 2023

That's exactly the point, though! It's surprising. A hashtable is a map from keys to values. Making it differentiable means that a small change in the key also makes a small change in the value!

moultano · on May 18, 2023

differentiable relative to the model parameters. The attention mechanism does store(key(x), value(x)), followed by lookup(query(y)), where key(), value(), lookup(), query() are all designed to be differentiable.