One of my favorite mini games at my job is rewriting classic algorithms to run in batched mode on gpu/tpu. The speed improvements often improve model training time by days, and it's always a lovely intellectual challenge. (The basic challenge is to rewrite the algorithms in terms of matrix operations which operate on many examples of the problem at once, while eliminating all branching.)