The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.
> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
Their technique really stretched the definition of extracting text from the LLM.
They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.
You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.
To make some vague claims explicit here, for interested readers:
> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) [...]"
So, yes, it is not "literally verbatim" (~96% verbatim), and there is indeed A LOT (hundreds or thousands of prompting attempts) to make this happen.
I leave it up to the reader to judge how much this weakens the more basic claims of the form "LLMs have nearly perfectly memorized some of their source / training materials".
I am imagining a grueling interrogation that "cracks" a witness, so he reveals perfect details of the crime scene that couldn't possibly have been known to anyone that wasn't there, and then a lawyer attempting the defense: "but look at how exhausting and unfair this interrogation was--of course such incredible detail was extracted from my innocent client!"
The one-shot performance of their recall attempts is much less impressive. The two best-performing models were only able to reproduce about 70% of a 1000-token string. That's still pretty good, but it's not as if they spit out the book verbatim.
In other words, if you give an LLM a short segment of a very well known book, it can guess a short continuation (several sentences) reasonably accurately, but it will usually contain errors.
Right, and this should be contextualized with respect to code generation. It is not crazy to presume that LLMs have effectively nearly perfectly memorized certain training sources, but the ability to generate / extract outputs that are nearly identical to those training sources will of course necessarily be highly contingent on the prompting patterns and complexity.
So, dismissals of "it was just translating C compilers in the training set to Rust" need to be carefully quantified, but, also, need to be evaluated in the context of the prompts. As others in this post have noted, there are basically no details about the prompts.
Sure, maybe it's tricky to coerce an LLM into spitting out a near verbatim copy of prior data, but that's orthoginal to whether or not the data to create a near verbatim copy exists in the model weights.
Especially since the recalls achieved in the paper are 96% (based on block largest-common substring approaches), the effort of extraction is utterly irrelevant.
I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.
If you overlay 30 prime number frequency waves plus 30 more even frequency waves, you're going to have an enormous number of local peaks.
Look at a chart of sin(x) + sin(x/2) + sin(x/3) + sin(x/5) + sin(x/7) + sin(x/11) + sin(x/13) + sin(x/17) + sin(x/19) + sin(x/23) + sin(x/29) + sin(x/31) + sin(x/37) + sin(x/43), you can find a local peak close to practically any number; the chart is effectively entirely composed of peaks.
It's extremely unsurprising that you would find peaks near mathematically relevant numbers, since there are peaks near any number whatsoever. You could pick ten random numbers out of a hat and fine tune those to 99.999%+ accuracy as well using the same scaling procedure.
You're right — that script (hamiltonian_perfect_finder) IS a parameter search tool. It will find matches to whatever targets you give it. That's not the core claim.
The core claim is in the white noise tests and the basic resonance chamber: with FIXED geometry and RANDOM input, the same constants keep appearing. We're not searching for them — they emerge.
Try running topology_wave_generator_tests.py with white noise input. No parameter optimization. See what ratios appear without being told what to look for.
The question isn't 'can we fit these numbers' — it's 'why do these specific numbers keep showing up when we're not looking for them?
Ran hierarchical analysis. At 1% tolerance, 23% of ratios match algebraic combinations of constants (harmonics, products, ratios). 77% unexplained. We're not finding constants everywhere — we're finding a specific ~23% algebraic structure.
The breakdown: 16% are harmonics (2φ, 3π, etc.), 13% are ratios between constants (π/φ, e/√2). This is a coherent algebraic system, not random peak-picking.
Interestingly, the 77/23 split approximates Menger sponge geometry (74/26). Whether that's meaningful or coincidence — worth investigating.
The Press Secretary isn't the Supreme Court. Her say-so doesn't change the plain text of the order, and you're rolling the dice as to which any given border agent is going to choose to believe.
There's probably at least one game out there somewhere that uses Go's map iteration order to shuffle a deck of cards, and would thus be broken by Go removing the thing that's supposed to prevent you from depending on implementation details.
The problem is that people commonly don't even realize they're depending on implementation quirks.
For example, they write code that unintentionally depends on some distantly-invoked async tasks resolving in a certain order, and then the library implementation changes performance characteristics and the other order happens instead, and it creates a new bug in the application.
The problem with the belay test as it exists today is that it tests whether you know all the peculiarities of each gym's beliefs around things like the exact order your hands should move when taking slack, whether tails on figure 8s are important (if so, how long, and what kind of knot may or must terminate them), whether the length of the belay loop matters, and so on. These things change seemingly on a whim and aren't always motivated by good evidence.
I learned to belay at Vertical World in 2005 and would fail Vertical World's belay test today, for multiple reasons, if I used the same method they themselves taught me!
Meanwhile, as you point out, no test can determine whether or not a person will be paying attention during an actual climb.
Standards change and improved methods are discovered. In the 50s and 60s the "hip belay" was the standard and considered safe. Once ATC/tube style belay devices became ubiquitous, the "pinch and slide" technique took over. The "pinch and slide" technique you likely learned is no longer considered the safest method of belaying. The AMGA belaying technique is now considered standard and for awhile gyms would still pass "pinch and slide" users but I'm not surprised they have stopped.
Safety standards do change for the better, but insurance and legal risks do have gyms on edge. I think his point is that gyms tend to be overly strict in areas that do not matter, but are easy to regulate/check. I.e requiring you have an unnecessary “backup” knot above your figure 8, requiring 2 Tri-locking carabiners for autobelay in response to accidents where people simply didn’t clip into the autobelay, knowing your gyms mnemonic for checking your knot, and disallowing wearing a single earbud when autobelaying (saying you won’t be able to hear if there is an emergency). These are all things I’ve seen required in gyms that IMO do not actually improve safety. Having friends that work in gyms, I’ve heard a lot of these policies are due to demands by insurance companies.
Meanwhile, I very frequently see people belaying in manners where their climber would hit the ground if they fell (usually the first 3-4 bolts up). The difference is, this is much harder for gym staff to notice and correct. Furthermore, I’m sure most of these climbers are capable of using better technique and do so when taking a belay test, but then get complacent afterwards.
There’s huge variability even in some of the gyms in the article, whether from site to site or inter-tester variability. Whether or not it improves safety, if it helps places like this stay open and solvent I guess that’s a win, but I wouldn’t rely solely on someone’s passing a gym’s test for me to let them catch me in a lead fall.
I’ve also been failed in seemingly spurious details that I was subsequently passed on with different testers at several gyms.
You shouldn’t be getting downvoted, this is sometimes true. Most often what happens is a junior staff member is overly rigid in applying what they were taught.
I once almost failed a belay test because I did not know that gym’s particular trick for “counting strands” to prove the figure 8 was tied correctly. I just know what a correct knot looks like after decades of tying them. Ultimately I asked them to check with a manager, who passed me.
That said, I’ve also seen experienced climbers with terrible belay technique; catching them with a modern test would seem like a good thing to me.
Had a young-ish gym employee berate me for not holding the brake strand with TWO hands when catching the leader recently… clearly against manufacturer instructions
(OP author here) Lots of people reading too much into the tea leaves here; this is just a matter of picking the best tool for this particular task, and our task (porting a JS codebase to the fastest available native target that still works in terms of not altering program structure as part of the port) is pretty unusual as far as things go
I would also recommend reading kdy1's observations when faced with the same task: https://kdy1.dev/2022-1-26-porting-tsc-to-go . The only caveat I'd add is that I can't tell where the 62x measurement in that post came from. Our own experiments doing Rust and Go implementations showed them within the margin of error, with certain phases being faster in one vs the other.
Since you wrote this it looks like Anders replied [1] to one of the threads.
I have to agree with the sentiment that is a success story that the team is allowed to use the best tool for the job, even if it suffers from "not built here".
This is really healthy and encouraging to see in these large OSS corporate-sponsored projects, so kudos to you and the team for making the pragmatic choice.
Would you say C# AOT could have been competitive in startup time and overall performance, and the decision came down to the other factors you've noted? I think everyone would have expected C# AOT to be the default choice so it might be nice to see some more concrete examples of where Go proved itself more suitable for your use-case.
C# AOT is quite strong and would be a great choice in a lot of contexts.
Go was just very, very strong on port-friendliness when coming from the TS codebase in particular. If you pull up both codebases side-by-side, it's hard to even tell which is which. C# is more class-oriented and the core JS checker uses no classes at all, so the porting story gets quite a bit more difficult there.
Couldn't you just use static classes? I don't see how that would be a factor at all, seems like a very superficial reason that would be easy to work around.
Remember, code is written for humans. It is not so much a technical limitation as it is a social limitation. Working in a codebase that does not adhere to the idioms of a language will quickly become a pain point.
The Go code is not that far off how one would write it even if it were being written from scratch. A C# project littered with static classes is not at all how one would write it from scratch.
Static methods and classes are commonplace and a normal practice in C#, particularly as extension methods (which, quite often, you guessed it, act on data). There isn't that much difference between a type defining simple instance methods and defining extension methods for that type separately if we look at codebases which need to have specific logic grouped in a single multi-KLOC file like, apparently, TS compiler does. There are other issues you could argue about but I think it's more about perception here and structuring the code would've been the smallest issue when porting.
The ship has sailed so not much can be done at this point.
> Static methods and classes are commonplace and a normal practice in C#
Certainly. The feature is there for a reason. That does not mean that you would write the codebase in that way if you were starting from scratch. You would leverage the entire suite of features offered by C# and stick to the idioms of the language. You would not constrain yourself to writing Go code that just happens to have C# syntax.
> The ship has sailed so not much can be done at this point.
It has been ported to a new language before. It can be ported to a new language again. But there wasn't a compelling reason to choose C# last time, and nothing significant has changed since to rethink that.
> It has been ported to a new language before. It can be ported to a new language again. But there wasn't a compelling reason to choose C# last time, and nothing significant has changed since to rethink that.
> This sounds a bit like being phrased in bad faith in my opinion.
Why? That certainly wasn't the intent, but I am happy to edit it if you are willing to communicate where I failed.
> I do not understand why Go community feels like it has to come up with new lies (e.g. your other replies) every time.
I don't know anything about this Go community of which you speak, but typically "community" implies a group. My other replies are not of that nature.
But if you found a lie in there that I am unaware of, I am again happy to correct. All I've knowingly said is that Go was chosen because its idioms most closely resemble the original codebase and that C# has different idioms. Neither are a lie insofar as it is understood. There is an entire FAQ entry from the Typescript team explaining that.
If the Typescript team is lying, that is beyond my control. To pin that on me is, well, nonsensical.
Actually, let me fix my previous comment. I was responding to quoted text in your comment which was obviously not your opinion. Sorry.
> It has been ported to a new language before. It can be ported to a new language again. But there wasn't a compelling reason to choose C# last time, and nothing significant has changed since to rethink that.
I still think this is phrased in an unfortunate way. To reiterate, my point is the damage has already been done and obviously TSC is not going to be ported again any time soon. I do not think Anders or TS team are up-to-date on where .NET teams are nor I think they communicated internally (I may be wrong but this is such a common occurrence that it would be an exception to the rule). I stand by the fact that this is a short-sighted decision that has every upside in the short-term wins at the cost and no advantage in long-term perspective. Especially since WASM story is unclear and .NET having good NativeAOT-LLVM-based prototype as a replacement to Mono-based WASM target (which is already proven and works decently well).
Having to prioritize ease of porting for such a foundational piece of software as a compiler over everything else is not a good place to be in. I guess .NET concerns are simply so small compared to the sheer size of TS that might as well accept whatever harm will come to it.
> I still think this is phrased in an unfortunate way.
I do not discount your notion, but why?
> To reiterate, my point is the damage has already been done
What damage has been done, exactly?
Call it a lie if you will, but the Typescript team claims to be ecstatic about how the port is nearly indistinguishable from the original codebase, meaning that nothing was lost - all while significant performance improvements and a generally better end user experience was gained.
Perhaps you mean the project has always been fundamentally flawed, being damaged from the day the first Typescript/Javascript line was written? Maybe that is true, but neither C# nor any other language is going to be able to come in and save that day. Brainfuck would have been just as good of a choice if that is truly where things lie. To stand by C# here doesn't make sense.
> I do not think Anders or TS team are up-to-date on where .NET teams are nor I think they communicated internally
Whether or not that is the case, did they need to? Static methods and classes in C# are most likely of Anders' very own creation. At very least he was right there when they were added. There is no way he, of all people, was obvious to them.
> Having to prioritize ease of porting for such a foundational piece of software as a compiler over everything else is not a good place to be in.
Ease of porting was a nice benefit, I'm sure, but they indicate that familiarity was the bigger driver. Anyone familiar with the old code can jump right in and keep on contributing without missing a beat. Given that code is written first and foremost for people, that is an important consideration.
Idiomatic C# is typically quite different and heavily class-based, but then a compiler would usually look very different than an Enterprise C# application anyway. You can use C# more in a functions and data structures way, I don't think there is something fundamental blocking this. But I guess there are many more subtle differences here than I can think of. Go is still quite a bit lower level than C#.
Our best estimate for how much faster the Go code is (in this situation) than the equivalent TS is ~3.5x
In a situation like a game engine I think 1.5x is reasonable, but TS has a huge amount of polymorphic data reading that defeats a lot of the optimizations in JS engines that get you to monomorphic property access speeds. If JS engines were better at monomorphizing access to common subtypes across different map shapes maybe it'd be closer, but no engine has implemented that or seems to have much appetite for doing so.
I used to work on compilers & JITs, and 100% this — polymorphic calls is the killer of JIT performance, which is why something native is preferable to something that JIT compiles.
Also for command-line tools, the JIT warmup time can be pretty significant, adding a lot to overall command-to-result latency (and in some cases even wiping out the JIT performance entirely!)
> If JS engines were better at monomorphizing access to common subtypes across different map shapes maybe it'd be closer, but no engine has implemented that or seems to have much appetite for doing so.
I really wish JS VMs would invest in this. The DOM is full of large inheritance hierarchies, with lots of subtypes, so a lot of DOM code is megamorphic. You can do tricks like tearing off methods from Element to use as functions, instead of virtual methods as usual, but that quite a pain.
It's not just the overhead of showing up, it's the opportunity cost of not doing contiguous hours at a bigger job. It's very difficult to fill up a day with 45-minute jobs all over town, so he's basically working part-time if he takes a small job.
Where I live , there is a "call out fee". that prevents tradesmen from not being paid for time costs driving to job investigations .... The customer also pays for the needed diagnosis time too. So I do not see a reason why not do small jobs.