So essentially a startup in this context has a small number of people and a large amount of money for training clusters. The article describes many operation leasing servers - that you assume to go many startups (or existing firms).
So it seems like you have the various LLM creators all doing roughly the same sort of thing (training with text and image data) with similar hardware and similar data. Each of these naturally has their own brand of "secret sauce" for distinguishing their venture. The various secret sauces can make a difference in the quality of an LLM's output.
Yet overall, this seems like a massive, energy intensive exercise in redundancy.
"this seems like a massive, energy intensive exercise in redundancy"
This is commonly refered to as a market working as intended. Yes, the waste from this type of redundency can be massive, especially if you realize that ultimately just a tiny percentage of these efforts will result in even moderate success. But it is the price to pay at the edge of progress. A planned monopoly might be more efficient (despite popular banter that just compares a megacorp or a gov, which is basically the same, to a single succesfull startup ignoring the 999 that tried and failed), but those seldom beat a market on innovation.
> This is commonly refered to as a market working as intended.
Is it? Seems like market is unable to separate wheat from the chaff and is just throwing money around hoping to hit the jackpot. While AI has massive chance of affecting our lives, the investment market paints a pretty similar picture to what happened during the crypto boom.
Our inability to predict future success from failiure is exactly why we have (massively inefficient) markets outcompeting centralized planned approaches.
Still have many teams trying to achieve the goal, but prevent corporate secrecy - effectively allowing competitors to look over each others shoulders and copy good data and ideas.
Such a system probably wants to compensate those whose ideas were copied, but that isn't strictly necessary - another approach is to simply make it illegal not to share data/results. Your compensation is your freedom from prison.
I don't think most of them have any kind of secret sauce. I think the founders hope to get bought out simply for being able to train "near-SOTA" LLMs. I guess achieving that level of skill and infra could be valuable enough to build upon.
There was a guy who followed a tutorial about how to fine tune mistral with DPO, who has zero computer science skills and his model ended up at the top of the hugging face leader board among the opensource models with 7 billion parameters. Some random guy managed to outdo the creators of the LLM.
Good point, so the only real differentiator would be the size & quality of the data being fed and the fine tuning done on the model? I wonder what else differentiates LLMs from each other
Alignment just means making it do what you want. LLMs just continue the sequence, the chat questions and response style we have now is an example of alignment (to what humans want).
Alignment can mean making sure your LLM doesn't continue the sequence in embarrassing ways, eg by spouting politically incorrect sequences of words (even though those might have been common in the training data).
Since the entity releasing the model obviously has certain goals aligning/censoring model in some ways is good for their particular short-term goal.
In the grand scheme these alignments are harmful as they place a reality distortion field. Authors create model of what language is and then contort that model to fit an opinionated idea of what language should be. Smells a bit Orwellian, right?
No, seems perfectly fine by me. You are already shaping your results by your selection of training data. Eg do you want to train a model that speaks English, or German, or both? Do you want to run your training data past a spam filter first? Do you want to do a character based model, or one of those weird encodings that is popular with LLMs these days?
Doing some other procedures afterwards to make sure your LLM doesn't say embarrassing things is small fries by comparison.
Also it's good practice for trying to get alignment with more important values (like "don't kill all humans") later when models might get powerful enough to be able to kill all humans.
Playing some little games where OpenAI tries to keep you from making their model say embarrassing things, and people keep trying to make it say embarrassing things, is a good low stakes practice ground.
I agree but this entire conversation misses my point that "alignment" originally only meant making the LLM act as you want it.
A GPT that hasn't been aligned does not work how we expect - you give it a prompt, and it will autogenerate until it reaches an end state.
To even make the GPT answer the question in the prompt, and not autocomplete it into nonsense, is an example of alignment.
It took a lot of fine tuning and data curation to get ChatGPT up to its current chat-like interface.
But this is not the only alignment you can do. The original Transformer paper was about machine translation, turning the prompt into the translated text. Once it was done it was done.
We could choose to have the model do something else, say translate the prompt into 5 languages at once instead of one, just as an example. This would be another alignment decision.
There is nothing political or selection bais or anything inherent to the original definition, its only recently "alignment" has morphed into this "align with human morals" concept.
Even in the Andrej Karpathy's build-your-own-gpt YT video, which is highly talked about around here, he uses the phase like this. The end of the video you are left with a GPT, but not a question-and-response model, and he says it would need to be aligned to answer questions like ChatGPT.
Maybe it’s simpler than that. Instead of spending money on compute that costs X and that cloud providers charge 20*X for, they could spend the money creating training data, but that story is way too hard to tell to investors.
>Yet overall, this seems like a massive, energy intensive exercise in redundancy.
Keep in mind that this is also chaff to distract people from the real secret sauce. I imagine that just as many startups are hiring writers and photographers to create extremely well labelled uncontaminated data for training.
One only need to look at the perverts over at civitai to see how far you can go with intensive labeling on a tiny compute budget.
There are not that many of these startups actually. Most use cases of LLM can be backed with a fine-tune of an off-the-shelf foundation model. If you're training foundation models from scratch, you're entering a difficult-to-monetize market where the big boys could eat your lunch by just releasing a new foundation model that might be able to do more than 95% of what yours does.
for context Yi Tay was Tech Lead on Google PaLM, UL2, Flan, Bard, etc and now is cofoudner at Reka (which has shipped some v interesting small multimodal models that have featured on here). I prompted him for this post as an ex-Googler now training LLMs as an independent startup https://twitter.com/YiTayML/status/1765105066263052718
(update: i submitted this yesterday and it didnt get traction, i guess @dang must’ve merged the old submission in here. you really didnt have to, but its a nice gesture. thanks dang!!)
aw thank you for listening. some weeks its very much a labor of love lol.
no events planned near term but come to the big shindig in june https://ti.to/software-3/ai-engineer-worlds-fair . last year's summit was the first time i really understood how much of a reach we have and how many good AI people we've managed to gather as friends.
I learned about reka.ai from this post; their LLMs don’t seem to have been discussed much on HN yet [1]. So, out of curiosity, I spent the last hour testing prompts with their chat interface [2] in comparison with ChatGPT 4, Gemini Advanced, Claude 3, and Mistral Large. I put the results at [3]. Overall, Reka Flash doesn’t seem significantly worse or better than the others. A lot more testing would be necessary to be sure, of course.
It's worth taking a second to note that the author just assumes that readers understand "the wilderness" to mean "not Google".
This post gives a lot of credit to Google's infra and hardware teams, and I'd love to read a perspective from one of those insiders who then went on to do related work elsewhere.
> I was completely taken aback by the failure rate of GPUs as opposed to my experiences on TPUs at Google
Should be "I was completely unaware of the failure modes of GPUs, because all my career I've been inside Google and used Google TPUs and was well-acquainted with those failure modes."
I've used GPUs mostly, and when I tried TPUs the jobs failed all the time for really hard-to-debug reasons. Often the indirection between the x86 chip and the TPU device caused hours of hair-pulling, stuff you never get with x86+nvidia+pytorch.
10-15 years ago, Google minted many $10m+ data scientists (aka Sawzall engineers) who also ventured "into the wilderness" and had very similar reactions. This blog post is much more about the OP hyping his company and personal brand than contributing useful notes to the community.
When was this? I use JAX+TPUs to train LLMs and haven't experienced many issues. IMO it was way easier to set up distributed training, sharding, etc compared to Pytorch+GPUs.
OP mentions the failure rate of GPUs as "If this were in GPU land, it would have failed within the first few days for sure.".
In my humble opinion, we never had failures of GPU even for large scale training. Our current training batch job is a 20GB json file which takes 6 hours just to load and has been running for more than 15 days with not a hiccup. And we are using the older Tesla T4.
GPUs have memory constraint issues but if you can plan and work around it, I havent seen it crash in real life.
That's an undemanding and well-debugged chip by this point (6 years ago!). So you aren't experiencing any of the pain people using A100s or H100s (never mind people who have to stand up clusters with B100s soon) are going through now.
Well it would depend on the specifics of the JSON file but eyeballing the stats at https://github.com/miloyip/nativejson-benchmark/tree/master seems to indicate that even on a 2015 MacBook the parsing proceeds using e.g. Configuru parser at several megabytes per second.
I took the phrase to mean "outside any large company". It seems like a fairly obvious metaphor; if you have a starup working on a large scale infrastructure project, you have to set your own logistics just a camp in the literal wildness.
Agreed. It reads like Seven of Nine realizing she's separated from the Collective and needs to rely lowly human capabilities. The insights into vendors was informative.
Newbie question - What happens after when an LLM training job experience a hardware failure? I don't suppose you lose all the training progress do you? Then the pain is mostly in the diagnostic of the problem and getting the cluster running again, but no need to worry about data loss right?
The main Reka.AI page looks like a regular ChatGPT clone, an LLM you pay for by the token. How is this different from all these other companies? Pricing seems to be comparable to ChatGPT 3.5-Turbo.
Training LLM from scratch is a super important issue that affects the pace and breadth of iteration of AI almost as much as the raw hardware improvements do. The blog is fun but somewhat shallow and not technical or very surprising if you’ve worked with clusters of GPUs in any capacity over the years. (I liked the perspective of a former googler, but I’m not sure why past colleagues would recommend Jax over pytorch for LLMs outside of Google.) I hope this newco eventually releases a more technical report about their training adventures, like the PDF file here: https://github.com/facebookresearch/metaseq/tree/main/projec...
To be honest, most researchers in applied ML in the bay say the opposite. If you are trying to be nimble and prototype, use pytorch. If you're trying to gain some optimizations as you near deployment, rewrite in Jax.
Interesting perspective about possible Jax optimizations. Assuming these models are trained and deployed on non-TPU hardware, are there any real advantages in using Jax for deployment on GPU? I’d have assumed that inference is largely a solved optimization for large transformer based models (with any low hanging fruits from custom CUDA code already written) and the details are shifting towards infrastructure tradeoffs and availability of efficient GPUs. But I may be out of the loop with the latest gossip. Or do you simply mean that maybe there exist cases where TPU inference makes sense financially and using jax makes a difference?
Tensorflow has been falling behind since they stopped caring about backward compatibility. PyTorch is the leading framework. Jax is getting some traction at Google and was used to train Gemini.
They don't. This is probably one reason why VCs invest in these companies. There is a natural moat since there is only a very finite number of people in the world has the right experience to raise, and only those who can raise can ever have the experience.
At least until compute cost drop to a cheap enough level...
> All in all, this is only a small part of the story of how we started a company, raised some money, bought some chips and matched Gemini pro/GPT 3.5 and outperformed many others in less than a year having to build everything from scratch.
I wonder what was the budget spent for the chips/cloud GPUs to achieve GPT 3.5 level LLM - at least in the order to magnitude - 2-5 millions?
> I think this could be more about the competency of the hardware team that manages your accelerators rather than the underlying chip.
Google's systems are reliable because of the tens of billions of dollars that Google has invested into developing datacenter hardware, software, and processes over 25 years. Highly-competent teams at smaller and less-mature organizations will always deliver a much worse product.
Another thing to consider is priorities. Google prioritizes reliability. They retire parts that fail repeatedly, even if the failures are relatively infrequent. Smaller and less-sophisticated datacenters keep parts in service even with frequent failures, or don't even monitor failure rates of certain parts. Smaller datacenters buy and use Google's old parts and unreliable parts.
Therefore unreliable machines does not imply anything about the competency of the hardware team.
If the low reliability of the hardware is making your work slow, then how about improving the software so it can tolerate the unreliable hardware, or switching to a more reliable (more expensive) hardware provider?
> In the end it took us only a very small number of smaller scale & shorter ablation runs to get to the strong 21B Reka Flash and 7B edge model (and also our upcoming largest core model). Finding a solid recipe with a very limited number of runs is challenging and requires changing many variables at once given the ridiculously enormous search space. In order to do this, one has to abandon the systematicity of Bigtech and rely a lot on “Yolo”, gut feeling and instinct.
> Thankfully, I (and many of us in the team) have built up this intuition quite a bit in our ML careers to get it right within a substantially short amount of tries. While we’ve trained really good models before in our previous jobs, differences in training infrastructure, data, incorporation of new ideas and other environmental issues can still cause non-trivial differences in outcomes. That said, a strong prior helps to significantly cut down the search space and is probably one of the easiest explanations to why we were able to train really strong models with so few trials, resources and experimentation.
Then what happens when the LLM or AI performs worse than expected? Spend more
money fine tuning?
By the time you get it all working, not only you've spend lots of your VC capital on training alone, your competitors (Google, Meta, etc) already released a more powerful model much better and quicker than you before you could your run the second training epoch.
Another example of a startup incinerating the VC pump and dump scheme for vaporware AI snake-oil.
TL:DR: LLM training is highly susceptible to GIGO.
(GIGO is what one gets when feeding LLM with "G"arbage "I"n, "G"arbage "O"ut.)
This is the current problem about making a vaccine signature fitting like a glove ... as tight as possible ... when populating the anti-malware (i.e. IDS/IPS/NDS/XNS) search pattern engine for use by Aho-Corasick-variant algorithms (such as Parallel-Failureless Aho Corasick).
However, LLM as a binary code-based detector for malware detection has a very limited benefit (it is there but only as a backend topical add-on after all other conditionals have been identified).
LLM lacks qualifying conditionals surrounding a premise data, and I have my doubts of using LLM for medical diagnosis as well: until we start having LLM denote the much-needed weighted combo-conditionals by "percentages".
> To be very frank, I would have to say the quality of codebases externally significantly lag behind those I’ve been used to at Google
Haven't worked at Google, anyone else share this sentiment? I always feel like working with Google code is typically not idiomatic and super difficult to go "under the hood" if anything isn't precisely on the happy path.
I thought the quality was pretty high, largely because there were a lot of rails constraining how code should be written. Most of the code I dealt with was written using somewhat rigid (but generally well-designed) frameworks with programmatically-enforced style guides.
Also, most work seemed to involve some balance of junior and more experienced people, which helped keep quality higher. Outside of Google, I've seen pretty large projects written by new grads with little supervision (and on a tight timeline). Those codebases can be pretty hairy.
The thing that impressed me most about Google was the encoding-of-cultural-norms-in-various-CI-jobs.
It lets them extract usable SWE horsepower from pretty much anyone who steps inside and at least tries to be useful and not just coast. They can ingest a startup engineer, someone who's been a mid-tier enterprise codemonkey, yr mythical 10xer, the whole statistical gamut.
That honestly does seem like a recipe for good code. And sure, there's tons of open source out there of dubious quality.
@resource0x in a sibling comment made the point that it's possible to write great code even if the program is a flawed design. I'm probably conflating those things.
A recent ex-googler here: quality of Google3 in general is pretty good, but the LLM training bits are so abysmal that I know people who have resigned instead of working on it. And it’s also extra slow because getting a couple local GPUs is not really an option. So you’re forced to “develop in Colab” which works for some things and not for others and in general sucks ass if you’re working on anything substantial. For anything more substantial you’ll be launching stuff on some resource pool, waiting for like 10-15 minutes until it starts (much longer for large models), and then trying to divine why it failed from voluminous and sometimes indecipherable crash logs which also hang your browser when cluster UI tries to load them.
Rumors of Google’s AI code superiority are vastly overblown in 2024. I’m currently at another major AI lab, and the code here can actually be understood and worked on, which I consider to be a massive advantage.
Google has superb robustness and code quality, with garbage-level usability. Once you're setup, you can kick off many massive training jobs and compare results easily. However, getting to that point is really hard. You'll never figure out how to use the ML infrastructure and libraries on your own. You can only get it to work by meeting with the teams that wrote the infra so they can find and fix every error and misconfiguration. Usually, there is one single way to get things working together, and neither the documentation nor the error messages will get you to that brittle state.
It's near impossible to get a VM with a TPU or GPU attached, so there's no way to debug issues that happen between the library and the accelerator. Plus somehow they've made Python take longer to build (??!!) and run than C++ takes, so your iteration cycle is several minutes for what would take seconds at any other place. Fun stuff! Somehow it's still one of the best places to do ML work, but they sure try to make it as difficult as possible.
> Haven't worked at Google, anyone else share this sentiment?
I worked there, and the quality is definitely much higher and the code tends to be far more maintainable. However, there is often a cost for that, which is velocity.
Some of this is reduced by the sheer amount of automation in tooling (i.e. bots that block style violations and common bugs before a code change is submitted).
Google's codebase is idiomatic to Google due to their strict language tooling. e.g. their C++ code stays away from advanced features. The tooling teams at Google have very strong say.
I get that sense too. Probably does work awesome if you're inside. But man it's a mess when they externalize stuff. Just one example: their cloud platform CLI includes an entire python installation and takes 1.7G on disk, just to make API calls...
Did you install all the components? Because if so you also installed emulators for the pubsub and big table (maybe others, I don't remember) which explain the big footprint.
I have never understood why cloud providers seem to think it is OK to write their CLIs in Python. The AWS one is too, and the Azure one went from Node.js to Python some time ago.
Packaging and stability reasons. Same for why it’s a 1.7gb install - probably where they landed after having tons of support issues on some random Python version they didn’t test or some issue with a dependency that had that issue. Freezing the entire set of artifacts is more stable and Python lets you move pretty quick. I can’t speak to why nodejs vs Python though - maybe Python is easier to embed?
Yeah, I imagine that was the decision calculus. "Instead of spending some more effort to save millions of unnecessary downloads of python's runtime using a different language, let's just bundle Python!"
I wouldn't be surprised if it was version 2.7 too...
What? They only get package and stability because they include the runtime. If they just went with a compiled language they could distribute native binaries and have actual packaging and stability.
Yes, but it’s not just a single metric. Another is how easy it is for them to hire productive members of the team and how much that costs them - middling Python developers churning out fine”ish” code are cheaper than Rust developers doing the same. It’s hard to find a language where you can be as productive as a developer in Python that also has AOT compilation to generate standalone binaries.
Tldr: there’s multiple factors to consider here and it’s more interesting to understand the pressures that cause the decisions, especially if you want to try to create a world where different decisions are made.
> It’s hard to find a language where you can be as productive as a developer in Python that also has AOT compilation to generate standalone binaries.
Outside specific cases around machine learning, it’s really not: Go is that language. It’s not like each of those platforms doesn’t have to have a similar team that understand Go anyway (for their SDK), so they could save their customers the abject pain of Python dependency management by just writing their CLIs using it.
It makes “sense” based on the domain of the cloud provider being DevOps teams who are maintaining and using these CLI tools. Ie. What they use day to day.
For anything more advanced they offer language specific SDKs in Rust, Swift, Kolton, etc…
There probably is a sense in which the API's are constantly changing, so maybe an interpreted language might make sense? I imagine there has to be a better way to do with with Go or Rust though (even lua?) for a smaller binary.
Google python binaries are more akin to docker or even vm images, even if the actual technology used predates docker and even linux VMs. They contain something like a slimmed-down linux distribution, not just a binary.
EXTREME predictability (e.g. as never ever using the system's libssl), in trade for huge binaries. They go pretty damn far in this: you won't catch a Google binary even using most of libc.
> e.g. their C++ code stays away from advanced features
Which honestly is a GOOD thing because it would make it much easier for newcomers to ramp up on existing codebases. Most people aren't used to working with spaceships and constexprs.
Readability is also far more valuable to a large team than efficiency for anything that isn't a number-crunching loop.
"Externally", no one could possibly beat Google's track record of not committing to products before finally killing them. But the code was beautiful, though!
Well, GPU failure modes (from what Ive heard from ML infra people) are often subtle, eg incorrect multiplication results. So its not as simple as the usual ‘treat resources like cattle not pets’ because tou dont know which cows have mad cow disease before committing to an expensive training run
I agree with the gp- while ground zero can have neutral or even positive vibes, the meaning is always that of point of origin of an explosion, real or metaphoric.
And I think that the meaning "most basic level" is wrong- the dictionaries might be simply recording a rare and wrong usage of the expression.
Not necessarily. In the context you reference, you could say Silicon Valley is ground zero for technological revolutions, or the explosive spread of social media. Deliberately building a thing from the ground up doesn't match the nuances of the many and varied meanings of the phrase; it's not quite right.
The primary semantic connection is the misuse of the phrase "from the ground up" and conflating it with "ground zero." There's no direct or indirect semantic relationship between something explosive or chaotic or destructive, and deliberately and methodically building a language model, in a calm and rational process. The meaning doesn't map to the phrase in the context of this article's title.
If you were building an etymology website from the ground up, you'd hope it would be ground zero for an explosion of awareness about how different words, phrases, and sayings could be used correctly in different contexts.
Something like that - I was trying to be considerate about the Ground Zero memorial, and the many ways in which "ground zero" can refer to things that people are deeply invested in. My personal stake is just an observation that it's not quite the right turn of phrase for the use to which it was put.
It seems likely that "from the ground up" was the intended phrase, and it might have gotten awkwardly revised, since the word startup is also in the title?
> We use it in phases like "a nuclear holocaust" while at the same time it refers to a specific horrible incident.
The historical and still sometimes contemporary meaning of "holocaust" is a burnt offering at an altar. But, if you're using that term now in a general context to mean merely a burnt offering it is so overcome by historical events that you will confuse your audience, cloud out your message, and probably cause widespread offense. Well if there were similar more recent examples of that in english it would be key vocabulary surrounding September 11 2001, like "Ground Zero." Therefore, while the OP is technically incorrect about the dictionary definition of the phrase, they aren't wrong about the meaning of the phrase to contemporary english speakers.
> it is so overcome by historical events that you will confuse your audience
I really don't think so. I think almost every US president has used that exact phrase ("nuclear holocaust") in the last 20 years. Trump, Biden, Obama, Bush, have all casually used the phase "nuclear holocaust" while speaking to the American public.
> they aren't wrong about the meaning of the phrase to contemporary english speakers.
I am a contemporary English speaker, I grew up during 9/11. "Ground Zero" as a noun means a reference to 9/11. Using it normally like "starting from ground zero" is completely understandable and non-controversial to me. My opinion of course.
I was just trying to be considerate to the most recent and arguably quintessential use of the phrase, with regards to ground zero and 9/11.
The point I was aiming at was the general semantic meaning of the phrase. It seemed mismatched from the meaning intended by the sentence. With startups, there are various phrases, like "get in on the ground floor/level" or "build it from the ground up" where the full semantic metaphorical meaning of the phrase maps to the meaning intended. If you said you wanted to get in at ground zero, there's a semantic mismatch, so the meaning doesn't fully apply in the context in which it's used.
I don't think it's necessarily controversial, either. Comedians use the term "bombing" for performances gone bad. If a night went particularly bad, they could call it a holocaust. They also refer to "murdering" a crowd, when sets go really well - "my bit last night was a nuclear holocaust" might work to convey great success.
I gauge the level of correctness to the various levels of meaning and metaphor, so if multiple levels don't track, or if a singular level is mismatched to the context, then I don't consider it as a very correct use of the word or phrase. It's not totally wrong, but of the many ways in which that phrase is used correctly, it's not very right, either.
Anyway - My mistake for highlighting an admittedly minor issue among language model enthusiasts. We seem to enjoy delving into language's technicalities, and LLMs are great for exploring how meanings align with a sentence's intended message. I get caught up in these things easily.
I do agree with you that a different phase would have been better and also the author might have been mixing metaphors like you alluded to ("get in on the ground floor/level" or "build it from the ground up").
I didn't read the complete article so I missed the phrase originally, it might have given me a slight pause but I would understand the meaning.
My apologies too for the nitpicking, I enjoy language and yes, interesting were on an LLM thread discussing this.
Perhaps dramatically ironic, but this is precisely the kind of thing that a LLM would help with. The opening paragraph uses "from scratch" which is a more appropriate choice also.
So it seems like you have the various LLM creators all doing roughly the same sort of thing (training with text and image data) with similar hardware and similar data. Each of these naturally has their own brand of "secret sauce" for distinguishing their venture. The various secret sauces can make a difference in the quality of an LLM's output.
Yet overall, this seems like a massive, energy intensive exercise in redundancy.