The major ML conferences all have pretty tight page limits, so more expository sentences usually get cut. This also means that papers usually only explain how their work is different from previous work, so they assume you are familiar with the papers they cite or are willing to read the cited papers.
This means that people who have an up-to-date knowledge of a given subfield can quickly get a lot out of a new papers. Unfortunately, it also means that it usually takes a pretty decent stack of papers to get up to speed on a new subfield since you have to read the important segments of the commonly cited papers in order to gain the common knowledge that papers are being diffed against.
Traditionally, this issue is solved by textbooks, since the base set of ideas in a given field or subfield is pretty stable. ML has been moving pretty fast in recent years, so there is still a sizable gap between the base knowledge required for productive paper reading and what you can get out of a textbook. For example, Goodfellow et al [1] is a great intro to the core ideas of deep learning, but it was published before transformers were invented, so it doesn’t mention them at all.
I think it’s worthwhile to distinguish between throughput and latency for these sorts of discussions, rather than just talking about performance since scratchpads are usually better for latency (even best-case latency) and caches are usually better for throughput. Though of course in this as in any sort of discussion of computer performance, caveats abound.
I think it’s important to note that you are assuming a gradual increase in activity over time as well as adequate food and rest between sessions. In the limit case things like rhabdomyolysis definitely exist and in the less-extreme case, you can have problems with connective tissue injuries from increasing the load or volume of exercise too fast. Over the long term, you can also have health problems and injuries caused by doing too much exercise relative to how much food and rest you are getting. All of these are real possibilities if someone started an intense exercise program after 20 years of completely sedentary life as an adult.
That said, we are discussing this in the context of walking a bit more, which should be very well-tolerated for pretty much everyone as long as they don’t have other significant health problems and work up gradually. For example, if you take 1000 steps a day, space out the increase by each week increasing your daily steps by 500-1000 steps a day until you reach 10000 steps a day over the course of a few months (temporarily pause or reverse the increases if you see anything other than passing muscular discomfort).
This also explains how ultramarathon (or other “extreme” physical activities) can be tolerated without much issue—bodies are very good at adapting to the loads that are placed on them as long as that increase is gradual and they have enough food and rest to rebuild after particularly intense bouts of activity.
A further corollary is that over the long term, being sedentary is not well-tolerated at all, so maintaining a certain baseline level of physical activity is definitely a better idea and as long as your body isn’t giving you any feedback to the contrary, you shouldn’t avoid high loads or intense activity just out of abstract fear of injury.
Jax is a great tool, but it’s really best for training and experimentation. The transformations outlined in this post (amongst others) make it easy to turn simple and straightforward code into high performance parallel code. While this is changing, inferences hasn’t been a historical area of emphasis for the project, so it wouldn’t be my first choice if that was your primary goal.
Yeah inference is a real weakness. I've used jax2tf to get tf saved models and tflite models, but it was a real journey...
There's some effort going into systems for saving and restoring the computation graph for Jax programs, which will help a lot. I'm surprised it didn't happen sooner, as it seems like quite a natural fit with the jax architecture.
It really depends on what you want in terms of implementation. On one hand, a big chunk of digital logic is about implementing large, high-speed state machines using circuits.
On the other hand, regular expressions and things like lexers are largely based on finite-state machines so there is a ton of information related to those, but they often implement things more complex than finite-state machines in order to be more expressive.
In terms of manually implementing them yourself, I think the naive implementation based on the mathematical definition of a deterministic finite automata is pretty good for a small number of states for things like tracking program state. In particular, explicitly listing the total set of expected inputs/transition criteria, explicitly listing the states, and having a single function that takes the current state and current transition-relevant data and produces a new state.
This representation is nice because it makes the behavior inspectable in one location. This makes it easier to notice the edge cases and prevent the state representation and possible transitions from getting spread all over code. The transition function can be a switch statement or a table. Really any way of writing a two-input function with a finite number of inputs and outputs will work. Many people have also explored strongly-typed versions of this, which are worth a look as well.
The combination of lazy evaluation and state mutation/side effects can be pretty difficult to reason about. For example, if you have a function that changes a global variable as a part of a lazy computation, once that function could have been called you have no way of knowing if or when that global variable will change in the future. If you have other functions that depend on the value of that variable, their future behavior is now much more challenging to reason about than in a strict language. You can also imagine something akin to a race condition in which there are multiple lazy computations which could eventually set that variable to different values and the actual sequence of state transitions depends entirely on the dependency order of a possibly unrelated piece of code. In practice, this means that in languages that are strict by default, lazy computations are often forced to run in order to reason about the code, rather than because the actual results of the computation are required.
Since pure functions compute the same results under lazy or strict evaluation and require that any data dependencies they have are explicitly provided as inputs, they interact with lazy computations in a much more tractable way. This means that adding a strictness operator to a lazy language is much easier than adding a laziness operator a a strict language.
An alternate approach is what python did with generators where there is a data type for lazy computation, but it lives apart from the rest of the language, so it is mostly used for e.g. stream processing where a default-lazy approach is conceptually straightforward and is less likely to lead to extremely non-trivial control flow. This approach does, however, basically give up on having a laziness operator that will turn a strict computation into a lazy one.
Funny enough, monoids/semigroups are one of my favorite parts of Haskell and I really miss having an explicit abstraction for them in other languages.
A lot of the code I write is looping over all of the values of a data structure to either look for a certain condition being true (any/all boolean monoid), look for an extremal value (min/max semigroup), or combine all of the values in the data structure (sum/product/average monoid). The product of a monoid is also a monoid, so if you want to do two or more of those operations at once, the code remains simple and orthogonal while only traversing the structure once.
In particular, all of the those operations are nicely expressed as one-lines using foldMap (https://hackage.haskell.org/package/base-4.9.1.0/docs/Data-F...). If you have a data structure called `xs` and each element in data structure has an integer field called `size` you can do the following by using foldMap with different Monoid instances.
Sum the sizes:
let (Sum total_size) = foldMap (\x -> Sum (size x)) xs
Check if any size is greater than 5:
let (Any over_threshold) = foldMap (\x -> Any ((size x) > 5)) xs
Do both in one pass:
let (Sum total_size, Any over_threshold) = foldMap (\x -> (Sum (size x), Any ((size x) > 5))) xs
Get the first entry with a size greater than 5:
let (First over_threshold) = foldMap (\x -> First (if (size x) > 5 then Just x else Nothing)) xs
Get all entries with a size greater than 5:
let over_threshold = foldMap (\x -> if (size x) > 5 then [x] else []) xs
Since all of these operations are associative, we can completely change the data structure or run the operations in parallel or concurrently and the code will still behave exactly the same. These nice properties mean that when I'm thinking about these sorts of tasks in any language, I think about what monoid or semigroup captures a given operation, rather than any of the mechanics of writing the loop. I find clarifies my thinking a lot.
But lenses as functional references are very common in the Haskell world, whereas the only place I've heard of anamorphisms (unfolds) being referred to as lenses is in this paper. It also confused me quite a bit when I first ran into it, but the name just comes from the notation used in the paper. It is super hard to google for other references to anamorphisms being called lenses for short though because of yet another unrelated concept: https://en.wikipedia.org/wiki/Anamorphic_format
This is a great paper, but I found it to be very challenging to read when I first ran into it, even though I had some experience programming in Haskell. As a one sentence pitch, this paper is to recursion what if and while are to goto. For a more detailed intro to the concepts see: https://blog.sumtypeofway.com/posts/introduction-to-recursio... If you are comfortable with Haskell, it should be relatively accessible. If you aren’t, but want a bit more flavor than my one sentence, the intro section in that blog post is still worth a read.
I also remember struggling with this paper quite a bit at first. This is how I would explain catamorphisms to past me:
Suppose you have a recursive `tree` data structure, and another one `tree' x` where `tree' x` is just like `tree`, but with recursive occurrences of `tree` replaced with type x. So maybe something like, in Haskell-ish syntax:
type tree = Leaf int | Node tree tree
type tree' x = Leaf int | Node x x
Then one can easily write a `fold` over trees with type
fold: tree -> (tree' x -> x) -> x
where the second parameter (called an algebra) evaluates the result over a tree' assuming recursive subtrees have been replaced with the result of their evaluation (this is the key).
Now the beauty of recursion schemes is that whenever `tree'` is a functor, you can get `tree` (as a fixpoint: `tree = tree' tree`) and `fold` for free (generally called `cata`), which can be seen by rewriting `fold` above using just `map`.
This skips a fair amount of boilerplate code you no longer have to touch when modifying recursive types.
Minor nitpick, you don't get `fold` as written for free. You would still need to provide a function `toTreeF :: tree -> tree'` then compose that with `cata`, which you do get for free: `fold t alg = cata alg (toTreeF t)`.
Sorry my last comment was sort of wrong because I neglected to mention Fix (I'm going to claim that I was tired :) ).
They say a picture is worth a thousand words, so hopefully this clears up the difference betwene `out` and `toTreeF`:
-- | Our basic binary tree type
data Tree a = Leaf a | Node (Tree a) (Tree a)
-- | A pattern functor for our binary tree
data TreeF a r = LeafF a | NodeF r r
-- | The fixed point of 'f'.
newtype Fix f = Fix { unfix :: f (Fix f) }
-- | Given an F-Algebra on 'f' we have a catamorphism on 'Fix f'
cata :: Functor f => Algebra f a -> Fix f -> a
cata alg fix = alg $ fmap (cata alg) $ unfix fix
-- | We can sum the elements of a 'Fix (TreeF Int)' using 'cata'
sumTreeF :: Fix (TreeF Int) -> Int
sumTreeF = cata $ \case
LeafF i x -> x + i
NodeF l r x -> x + l + r
-- | To operate our binary tree we need to map from 'Tree a' to 'TreeF
-- a'.
toTreeF :: Tree a -> Fix (TreeF a)
toTreeF = \case
Leaf a -> LeafF a
Node l r -> NodeF (toTreeF l) (toTreeF r)
sumTree :: Tree Int -> Int
sumTree = sumTreeF . toTreeF
I didn't compile this code but it should be correct.
out :: Fix (TreeF a) -> TreeF a (Fix (TreeF a))
toTreeF -> Tree a -> Fix (TreeF a)
Yeah you could do that but now you have thrown away the recursive `Tree` type and are always working in `TreeF`. This is totally fine but is a different design decision from where we started.
If I understand the idea of the linked article correctly, I sometimes do a similar thing in Java by defining an Iterator that implements some sort of traversal over a recursive data structure, then can apply any operations to the returned items using a for loop, applying an anonymous function to all the items, etc.
From some of the sibling comments, seems like there is a strong connection between these recursion schemes and iterator strategies?
so its structured recursion i.e. it reifies recursion into variables that than be passed around and modified instead of being implicit/embedded in the control flow?
if so how is that different from Observable streams or event emitters.
The .NET RX Observable streams where designed by Erik Meijer the author of this paper. The GOF book came up with the "Observer" pattern, but failed to spot and exploit the duality with iterators. RX and LINQ both take ideas from functional programming.
While the first author of this paper went on to be heavily involved in Microsoft's Reactive Extensions (among many other things), I think it's better to think of it as making common recursive patterns explicit and introducing abstraction for those patterns. For example, a common use of goto is to run a certain block of code until a condition becomes true. This is abstracted into a pattern called "while" which takes a condition and a block of code to run. A common use of recursion is the reduce (also called fold) operation which, in the case of a list, computes a single value from a list by combining all of the values of the list with a binary function. As an example, this code sums all of the values in an integer list:
def sum(xs):
if xs == []:
return 0
else:
x = xs.pop(0)
return x + sum(xs)
This sort of reduction operation is very broadly useful, so it has been abstracted in frameworks like MapReduce as well as library functions like functools.reduce (https://docs.python.org/3/library/functools.html#functools.r...). Recursion schemes build on explicit reduce functions by being strongly typed (which, in addition to reducing bugs, enables a something like a super-powered visitor pattern), very orthogonal (which reduce redundancy and code duplication as a user of the abstraction), and very general (which let you solve a lot of problems with the same small set of programming tools without needing to remember special cases). In particular, the way that they look at data structures lets you interleave the recursion across nested data structures in a way that would be a huge pain in the butt with e.g. Python's __iter__ interface. There are some other nifty things that the approach brings to the table as well, but I think those are the major wins from the perspective of someone not already interested in strongly-typed functional programming.
While I covered the case of things that are sort of like reduce (called catamorphisms in the language of recursion schemes), this paper also has analogous abstractions for things that are sort of like itertools.accumulate (https://docs.python.org/3/library/itertools.html#itertools.a..., called anamorphisms in recursion schemes), as well as combinations thereof (called hylomorphisms). They all use a relatively small number of building blocks and combine in a quite beautiful and useful way, but it's hard to describe precisely without leaning quite heavily on the language of strongly-typed functional programming.
This means that people who have an up-to-date knowledge of a given subfield can quickly get a lot out of a new papers. Unfortunately, it also means that it usually takes a pretty decent stack of papers to get up to speed on a new subfield since you have to read the important segments of the commonly cited papers in order to gain the common knowledge that papers are being diffed against.
Traditionally, this issue is solved by textbooks, since the base set of ideas in a given field or subfield is pretty stable. ML has been moving pretty fast in recent years, so there is still a sizable gap between the base knowledge required for productive paper reading and what you can get out of a textbook. For example, Goodfellow et al [1] is a great intro to the core ideas of deep learning, but it was published before transformers were invented, so it doesn’t mention them at all.
[1] https://www.deeplearningbook.org/