You've got to be careful with PDFs : We can't see how they are rendered internally for the LLM, so there may be differences in how it's treating the margin/gutters/bleeds that we should account for (and cannot).
"Make this better in a loop" is less powerful than using evolution on a population. While it may seem like evolution is just single steps in a loop, something qualitatively different occurs due to the population dynamics - since you get the opportunity for multiple restarts / interpolation (according to an LLM) between examples / and 'novelty' not being instantly rejected.
I think the "Aha" is that the RL caused it to use an anthropomorphic tone.
One difference from the initial step is that the second time around includes the initial step and the aha comment in the context : It is, after all, just doing LLM token-wise prediction.
OTOH, the RL process means that it has potentially learned the impact of statements that it makes on the success of future generation. This self-direction makes it go somewhat beyond vanilla-LLM pattern mimicry IMHO.
Google Colab gives you $free GPU (usually a 16Gb T4) preloaded with frameworks, ready to run. Later, you might be tempted by the Pro(+) version, but there's plenty of scope to move up the learning curve before spending any money.
I should check that out. Jetbrains just integrated remote management for code and notebooks into their IDEs and this seems like the perfect way to test. Thanks for the tip!