Hacker Newsnew | past | comments | ask | show | jobs | submit | foundval's commentslogin

You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.

If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....

(disclaimer, I am a Groq employee)


We also added Llama 3.1 405B to our VSCode copilot extension for anyone to try coding with it.

Free trial gets you 50 messages, no credit card required - https://double.bot

(disclaimer, I am the co-founder)


would be great if there was a page showing benchmarks compared to other auto completion tools


Groq's TSP architecture is one of the weirder and more wonderful ISAs I've seen lately. The choice of SRAM in fascinating. Are you guys planning on publishing anything about how you bridged the gap between your order-hundreds-megabytes SRAM TSP main memory and multi-TB model sizes?


There is a lot out here.

I gave a seminar about the overall approach recently, abstract: https://shorturl.at/E7TcA, recording: https://shorturl.at/zBcoL.

This two-part AMA has a lot more detail if you're already familiar with what we do:

https://www.youtube.com/watch?v=UztfweS-7MU

https://www.youtube.com/watch?v=GOGuSJe2C6U


Thanks!


You can chat with all these models for free and ultra-low latency using this hosted website https://nat.dev/chat for free by GitHub Founder


Just checked it out. Is pay-as-you-go API access available at all? It says 'Coming Soon'

https://console.groq.com/settings/billing


I've found Bedrock to be nice with pay-as-you-go, but they take a long time to adopt new models.


And twice as expensive in comparison to the source providers’ APIs


I think you answered it yourself? It’s coming soon, so it is not available now, but soon.


It's been coming soon for a couple of months now, meanwhile Groq churns out a lot of other improvements, so to an outsider like me it looks like it's not terribly high on their list of priorities.

I'm really impressed by what (&how) they're doing and would like to pay for a higher rate limit, or failing that at least know if "soon" means "weeks" or "months" or "eventually".

I remember TravisCI did something similar back in the day, and then Circle and GitHub ate their lunch.


405B is already being served on WhatsApp!

https://ibb.co/kQ2tKX5


How do you get that option?



At what quantisation are you running these?


If you're interested in this sort of stuff, you might like this diff-based CLI tool I wrote:

https://github.com/freuk/iter

It runs on Groq (the company I work for), so it's super snappy.


(Groq Employee) As I'm sure you're aware, XTX takes its name from a particular linear algebra operation that happens to be used a lot in Finance.

Groq happens to be excellent at doing huge linear algebra operations extremely fast. If they are latency sensitive, even better. If they are meant to run in a loop, best - that reduces the bandwidth cost of shipping data into and outside of the system. So think linear algebra driven search algorithms. ML Training isn't in this category because of the bandwidth requirements. But using ML inference to intelligently explore a search space? bingo.

If you dig around https://wow.groq.com/press, you'll find multiple such applications where we exceeded existing solutions by orders of magnitude.


(Groq Employee) Thanks for the feedback :) We're always improving that demo.


(Groq Employee) Yes! Determinism + Simplicity are superpowers for ALU and interconnect utilization rates. This system is powered by 14nm chips, and even the interconnects aren't best in class.

We're just that much better at squeezing tokens out of transistors and optic cables than GPUs are - and you can imagine the implications on Watt/Token.

Anyways.. wait until you see our 4nm. :)


(Groq Employee) Agreed, one should care, and especially since this particular service is very differentiated by its speed and has no competitors.

That being said, until there's another option at anywhere that speed.. That point is moot, isn't it :)

For now, Groq is the only option that can let you build an UX with near-instant response times. Or a live agents that help with a human-to-human interaction. I could go on and on about the product categories this opens.


Why go so fast? Aren't Nvidias products fast enough from a TPS perspective?


OpenAI have a voice powered chat mode in their app and there's a noticeable delay of a few seconds between finishing your sentence and the bot starting to speak.

I think the problem is that for realistic TTS you need quite a few tokens because the prosody can be affected by tokens that come a fair bit further down the sentence, consider the difference in pitch between:

"The war will be long and bloody"

vs

"The war will be long and bloody?"

So to begin TTS you need quite a lot of tokens, which in turn means you have to digest the prompt and run a whole bunch of forward passes before you can start rendering. And of course you have to keep up with the speed of regular speech, which OpenAI sometimes struggles with.

That said, the gap isn't huge. Many apps won't need it. Some use cases where low latency might matter:

- Phone support.

- Trading. Think digesting a press release into an action a few seconds faster than your competitors.

- Agents that listen in to conversations and "butt in" when they have something useful to say.

- RPGs where you can talk to NPCs in realtime.

- Real-time analysis of whatever's on screen on your computing device.

- Auto-completion.

- Using AI as a general command prompt. Think AI bash.

Undoubtably there will be a lot more though. When you give people performance, they find ways to use it.


You've got good ideas. What I like to personally say is that Groq makes the "Copilot" metaphor real. A copilot is supposed to be fast enough to keep up with reality and react live :)


Hi foundval, can we connect on Linkedin please? :


(Groq Employee) It's hard to discuss Tok/sec/$ outside of the context of a hardware sales engagement.

This is because the relationship between Tok/s/u, Tok/s/system, Batching, and Pipelining is a complex one that involves compute utilization, network utilization, and (in particular) a host of compilation techniques that we wouldn't want to share publicly. Maybe we'll get to that level of transparency at some point, though!

As far as Batching goes, you should consider that with synchronous systems, if all the stars align, Batch=1 is all you need. Of course, the devil is in the details, and sometimes small batch numbers still give you benefits. But Batch 100's generally gives no advantages. In fact, the entire point of developing deterministic hardware and synchronous systems is to avoid batching in the first place.


(Groq Employee) You're right - we are comparing to independently-clocked logic.

I wonder whether async logic would be feasible for reconfigurable "Spatial Processor" type architectures [1]. As far as LPU architectures go, they fall in the "Matrix of Processing Engines"[1] family of architectures, which I would naively guess is not the best suited to leverage async logic.

1: I'm using the "Spatial Processor" (7:14) and "Matrix of Processing Engines" (8:57) terms as defined in https://www.youtube.com/watch?v=LUPWZ-LC0XE. Sorry for a video link, I just can't think of another single reference that explains the two approaches.


Curiously, almost all of this video is mostly covered by computer architectures lit in the late 90's early 00's. At the time, I recall Tom Knight had done most of the analysis in this video, but I don't know if he ever published it. It was extrapolating into the distant future.

To answer your questions:

- Spatial processors are an insanely good fit for async logic

- Matrix of processing engines are a moderately good fit -- definitely could be done, but I have no clue if it'd be a good idea.

In SP, especially in an ASIC, each computation can start as soon as the previous one finishes. If you have a 4-bit layer, and 8-bit layer, and a 32-bit layer, those will take different amounts of time to run. Individual computations can take different amounts of time too (e.g. an ADD with a lot of carries versus one with just a few). In an SP, a compute will take as much time as it needs, and no more.

Footnote: Personally, I think there are a lot of good ideas in 80's era and earlier processors for the design of individual compute units which have been forgotten. The basic move in architectures up through 2005 was optimizing serial computation speed at the cost of power and die size (Netburst went up to 3.8GHz two decades ago). With much simpler old-school compute units, we can have *many* more of them than a modern multiply unit. Critically, they could be positioned closer to the data, so there would be less data moving around. Especially the early pipelined / scalar / RISC cores seem very relevant. As a point of reference, a 4090 has 16k CUDA cores running at just north of 2GHz. It has the same number of transistors as 32,000 SA-110 processors (running at 200MHz on a 350 nanometer process in 1994).

TL;DR: I'm getting old and either nostalgic or grumpy. Dunno which.


This was sort of the dream of KNL but today I noticed

    Xeon Phi CPUs support (a.k.a. Knight Landing and Knight Mill) are marked as deprecated. GCC will emit a warning when using the -mavx5124fmaps, -mavx5124vnniw, -mavx512er, -mavx512pf, -mprefetchwt1, -march=knl, -march=knm, -mtune=knl or -mtune=knm compiler switches. Support will be removed in GCC 15.
the issue was that coordinating across this kind of hierarchy wasted a bunch of time. If you already knew how to coordinate, mostly, you could instead get better performance

you might be surprised but we're getting to the point that communicating over a super computer is on the same order of magnitude as talking across a numa node.


I actually wasn't so much talking from that perspective, as simply from the perspective of the design of individual pieces. There were rather clever things done in e.g. older multipliers or adders or similar which, I think, could apply to most modern parallel architectures, be that GPGPU, SP, MPE, FPGA, or whatever, in order to significantly increase density at a cost of slightly reduced serial performance.

For machine learning, that's a good tradeoff.

Indeed, with some of the simpler architectures, I think computation could be moved into the memory itself, as long dreamed of.

(Simply sticking 32,000 SA-110 processors on a die would be very, very limited by interconnect; there's a good reason for the types of architectures we're seeing not being that)


Truth is that there is another startup called graph core that is doing exactly that, and also a really big chip


I assume no one will read this, but good places to look for super-clever ways to reduce transistor count while maintaining good performance:

- Early mainframes / room-sized computers (era of vacuum tubes and discrete transistors), especially at the upper-end , where there was enough budget to have modern pipelined and scalar architectures.

- Cray X-MP and successors

- DEC Alpha / StrongARM (referenced SA-110)

Bad places to look are all the microcode architectures. These optimized transistor count, often sacrificing massive amounts of performance in order to save on cost. Ditto for some of the minicomputers, where the goal was to make an "affordable" computer. Something like the PDP was super-clever in cost-cutting, which made sense at the time, does much less to maintain performance.

There's a ton of long-forgotten cleverness.


They do what you were talking about, not what I was.

They seem annoying. "The IPU has a unique memory architecture consisting of large amounts of In-Processor-Memory™ within the IPU made up of SRAM (organised as a set of smaller independent distributed memory units) and a set of attached DRAM chips which can transfer to the In-Processor-Memory via explicit copies within the software. The memory contained in the external DRAM chips is referred to as Streaming Memory™."

There's a ™ every few words. Those seem like pretty generic terms. That's their technical documentation.

The architecture is reminiscent of some ideas from circa-2000 which didn't pan out. It reminds me of Tilera (the guy who ran it was the Donald Trump of computer architectures; company was acquihired by EZchip for a fraction of the investment which was put into it, which went to Mellanox, and then to NVidia).


Sweet, thanks! It seems like this research ecosystem was incredibly rich, but Moore's law was in full swing, and statically known workloads weren't useful at the compute scale of back then.

So these specialized approach never stood a chance next to CPUS. Nowadays the ground is.. more fertile.


Lots of things were useful to compute.

The problem was

1) If you took 3 years longer to build a SIMD architecture than Intel to make a CPU, Intel would be 4x faster by the time you shipped.

2) If, as a customer, I was to code to your architecture, and it took me 3 more years to do that, by that point, Intel would be 16x faster

And any edge would be lost. The world was really fast-paced. Groq was founded in 2016. It's 2024. If it was still hayday of Moore's Law, you'd be competing with CPUs running 40x as fast as today's.

I'm not sure you'd be so competitive against a 160GHz processor, and I'm not sure I'd be interested knowing a 300+GHz was just around the corner.

Good ideas -- lots of them -- lived in academia, where people could prototype neat architectures on ancient processes, and benchmark themselves to CPUs of yesteryear from those processes.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: