Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Transformers on Chips (etched.ai)
94 points by vasinov on Dec 16, 2023 | hide | past | favorite | 62 comments


Founder here!

We're still in stealth, but I'll be able to share details and performance figures soon.

Our first product is a bet on transformers. If we're right, there's enormous upside - being transformer-specific lets you get an order of magnitude more compute than more flexible accelerators (GPUs, TPUs).

We're hiring - if the EV makes sense for you, reach out at gavin @ etched.ai


You’re still in stealth but you’re asking us to meet your supercomputer and are sharing benchmarks that are unfalsifiable. Better show and tell a lot more real soon.


You seem to be downvoted because of the lack of details. Upvoted and thanks for commenting.

1. Do you have a working prototype?

2. Are the pictures real (or close), or entirely CGI?

That would win over a lot of people here on HN.


Hey Gavin, I regret how much skepticism you faced here especially from me.

I believe honest feedback is important but that it should be given in the most productive way possible.

I always want to see fellow entrepreneurs succeed here, and will definitely keep an open mind as you release more details. Best of luck!


Very promising, excited to learn more!

Any thoughts on State Space Models?

Eg:

https://github.com/havenhq/mamba-chat

https://arxiv.org/abs/2311.18257


Curious what approach you’re using. I did some work replicating this paper on an arty7 fpga: https://arxiv.org/abs/2210.08277 - any similarities?


What’s your projected “model to chip” turnaround?


I am not buying this at all. But I’m not a hardware guy so maybe someone can help with why this is not true:

- Crypto hardware needed SHA256 which is basically tons of bitwise operations. That’s way simpler than the tons of matrix ops transformers need.

- NVidia wasn’t focused on crypto acceleration as a core competency. There are focussed on this, and are already years down the path.

- One of the biggest bottlenecks is memory bandwidth. That is also not cheap or simple to do.

- Say they do have a great design. What process are they going to build it on? There are some big customers out there waiting for TMSC space already.

Maybe they have IP and it’s more of a patent play.

(I mention crypto only as an example of custom hardware competing with a GPU)


> One of the biggest bottlenecks is memory bandwidth. That is also not cheap or simple to do.

This is precisely why people are trying to put logic into memory instead of just making the logic chips simpler. Compute being 10x faster doesn't mean much when you want real-time, near-zero latency in the current day (and potentially, future) ML workloads. Memory bandwith for low batches are much more important, and even though this chip comes with HBM3E (which is cutting edge), that by itself won't make this faster than H200/MI300X.


Iirc Ethereum ASICs were also memory bandwidth bound. With KV caching transformers are just lots and lots of matrix vector multiplication and are bound by loading the huge weight matrices onto the cores.


https://www.eetimes.com/harvard-dropouts-raise-5-million-for...

“Uberti cites bitcoin mining chips as an example of a successful specialized ASIC offering.“

The founder also references crypto, so your comparison is an apt rebuttal to an argument you didn’t know they were making.

Overall, the article gives a small bit of detail, which is infinitely more than gleaned from the website.


You are not the only one who is skeptical.

Nvidia has devoted an astronomical amount of effort to supporting AI as their “next big thing”.

…and here is some information-free landing page showing perf which is an order of magnitude above what nvidia is offering.

…but no numbers. You can get called out for numbers.

A vague infographic is much safer.

When things seem to good to be true, they usually are.

I guess some custom hardware with some cherry picked metric here, but frankly the whole thing screams scam.

If it was that easy, Amazon, Google, etc would have already done it with their proven ability to make new silicon.


Title was a bit of a letdown. I was hoping for a discussion of silicon planar transformers (like, the electrical component), which are of increasing interest in RF ICs. :)


Yeah me too, they really ought to explain themselves better


There is a lot going on in the LLM / AI chip space. Most of the big players are focusing on general purpose AI chips, like Cerebras and Untether. This - what I understand to be more like ASICs is an interesting market. They give up flexibility but presumably can make them more cheaply. There is also Positron AI in this space, mentioned here: https://news.ycombinator.com/item?id=38601761

I'm only peripherally aware of ASICs for bitcoin mining, I have no idea the economics or cycle times. It would be interesting to see a comparison between bitcoin mining chips and AI.

One thing I wonder about is that all of AI is very forward looking, ie anticipating there will be applications to warrant building more infrastructure. It may be a tougher sell to convince someone they need to buy a transformer inference chip now as opposed to something more flexible they'll use in an imagined future.


Only one certainty, HBM memory makers will be doing nicely in the current climate as all these AI processing options are using it in larger and larger volumes. Those will be the unnoticed winners in this rush.


In the cloud, these chips will compete head to head with GPUs. If they are able to pull off a 10x price/performance win without excessive porting work… it’ll take off in a heartbeat.


Like ASIC Botcoin miners did. There are parallels here in how it might just pan out.


Interesting point. That said, the AI model space is rapidly evolving, while bitcoin's hashing problem is static. This makes it significantly more risky to make a large capital investment in dedicated HW when it's unclear if it will be able to run the next big model architecture. For instance, if this had been built + released a year ago, before SOTA models used MoE , then it would rapidly have become obselete.


Outside of hardware/implementation optimizations, and position embedding choice - has the SOTA transformer architecture evolved that much?

Llama-2 code appears to be about the same as gpt-2.


You can look at https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp... for examples of the different layers in a number of different models, and further down in the code for their implementations. tldr, yes they are very similar. I can see lots of value in something that can just run these models. Even if you just supported llama2 there are tons of options available.


Oh man, all those years back I made a choice between antminer and butterfly labs. I backed the wrong horse.

BFL mined with customer hardware and basically didn't ship units to customers until there was no profit in running one.

Crypto ASICs are a super weird edge case IMHO in chips, strictly speaking it's not rational to sell them if they are very profitable. It only makes sense if the customer has a different risk profile than you; or the customer can somehow get power more cheaply than you; or you have some kind of scam going on; or you couldn't get capital except by presales and are unusually honest.

Note that an additional profit-making option for crypto ASIC producers is to secretly over-produce and compete with your customers and you are unlikely to get caught doing this.


That isn't correct. If the ASIc manufacturer produce them for X can use them directly and it only costs Y to operate them for Z profit. Then you can price them as Z-Y-profit_margin > X. The lower the operating cost, the higher your profit margin per chip if you sell it. Selling the chip might have a 50% profit margin and mining has a 10% profit margin. If you wanted to get into mining you would build a holding company owning both types of companies.


Where did this come from? There is absolutely nothing clickable except 'contact us' which just reloads the same page? There's almost zero information here?


Maude you have JS disabled? It’s one of those fancy animations as you scroll websites.


No, I see the animation as I scroll. Very little information though, and no links as far as I can tell to more anywhere. The one clickable element to contact them seems broken.


My comment is about the general idea (LLM transformers on a chip), not particular company, as I have no insight into the latter.

Such a chip (with support for LoRA finetuning) would likely be the enabler for the next-gen robotics.

Right now, there is a growing corpus of papers and demos that show what's possible, but these demos are often a talk-to-a-datacenter ordeal, which is not suitable for any serious production use: too high latency, too much dependency on the Internet.

With a low-latency, cost- and energy-efficient way to run finetuned LLMs locally (and keep finetuning based on the specific robot experience), we can actually make something useful in the real world.


Product page like this... they haven't even designed the chip. Complete vaporware.


Oh. Not that you can tell from the web site.


This only tells me we are at peak AI hype, given that products like this have to dress up ASICs as 'Transformers on Chips' or 'Transformer Supercomputer'.

As always, no technical reports or in-depth benchmarks other than a unlabelled chart comparing against Nvidia H100s with little context and marketing jargon to the untrained eye.

It seems that this would tie you into a specific neural net implementation (i.e llama.cpp as a ASIC) and would have to require a hardware design change to support another.


Isn't this kinda pigeonholing yourself to one neural network architecture? Are we sure that transformers will take us to the promised land? Chip design is a pretty expensive and time consuming process, so if a new architecture comes out that is sufficiently different from the current transformer model wouldn't they have to design a completely new chip? The compute unit design is probably similar from architecture to architecture, so maybe I am misunderstanding...


It’s a bet. Probably a good one to make. The upside of being the ones who have an AI chip (not a graphics chip larping as an AI chip) is huge. It will run faster and more cheaply. You get to step all over OpenAI, or get a multi billion dollar deal to supply Microsoft data centres. Or these ship on every new laptop etc. You get to be the next unicorn ($1tn company). So that is a decent bet for investors assuming the team can deliver. Yes the danger is there is a new architecture that runs on a CPU that for practical purposes whoop’s Attention’s ass. In which case investors can throw some money at ASICifying that.


Yep, transformers showed up in 2017, nearly 7 years ago, and they still wear the crown. Maybe some new architecture will come to dominate eventually, but I would love a low cost PCIe board that could run 80B transformer models today.


If the fairy tale numbers are correct, they could price it at a million dollars and it would still be cheap.


Well, GPT-4 runs on a transformer architecture, and even if for unknown reasons GPT-4 is the upper limit of what you can achieve with transformer models, having hardware specialized to run the architecture extremely fast would always be very useful for many tasks (the tasks GPT-4 can already handle at least).


This was my first thought too. Even if transformers turn out to be the holy grail for LLMs, people are still interested in diffusion models for image generation.

I think we’re about to see a lot of interesting specialized silicon for neural nets in the coming years, but locking yourself into a specific kind of model seems a little too specialized right now.


Diffusion models could actually be implemented with transformers, hypothetically. Their training and inference is what makes diffusion models unique, not the model architecture.


Could probably go even faster burning GPT-4's weights right into the silicon. No need to even load weights into memory.

Granted, that eliminates the ability to update the model. But if you already have a model you like that's not a problem.


Yeah I call BS on this. This does nothing to address the main issues with autoregressive transformer models (memory bandwidth).

GPU compute units are mostly sitting idle these days waiting for chip cache to receive data fr VRAM.

This does nothing to solve that.


You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding


Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.


Not exactly idle but only at around 30% utilization on average (measured on a ~900 GPU cluster over ~25 days)


If it's at 30% utilization then it's "mostly idle".


I agree. I am surprised that many folks, not you of course, think that is okay.


Wow. I wish I could get a computer or VM/VPS with this. Or rent part of one. Use it with quantized models and llama.cpp.

Seems like a big part of using these systems effectively is thinking of ways to take advantage of batching. I guess the normal thing is just to handle multiple user's requests simultaneously. But maybe another one could be moving from working with agents to agent swarms.


I don’t see them doing direct sales and it looks like a cloud offerings.

For training the big part of using these things isn’t batching it’s mainly designing the network and cleaning the data and then training it to get results. Training involves batching but it’s already baked in to libraries .

For inference you take the trained model which is huge and load it into memory and then take the model and have it predict output. The design of this architecture is to not use quantization because lower precision means you want to use less memory while this has a huge amount of memory . To handle multiple users requests you don’t do batching a message queue with multiple receivers it copies of the latest trained model would work.


interesting how MCTS decoding is called out. that seems entirely like a software aspect, which doesn't depend on a particular chip design?

and on the topic of MCTS decoding, I've heard lots of smart people suggest it, but I've yet to see any serious implementation of it. it seems like such an obviously good way to select tokens, you'd think it would be standard in vllm, TGI, llama.cpp, etc. But none of them seem to use it. Perhaps people have tried it and it just don't work as well as you would think?


It’s very difficult to implement, and requires training the network to use it.

I worked at DeepMind on projects that used MCTS. Even with access to the AlphaZero source code, it was very difficult to write an other implementation that got the same results as the original.


I'm really curious about this part:

> and requires training the network to use it.

I thought one of the benefits of MCTS was, if you already have your value network, then a general MCTS implementation can walk the tree of values created by that network. And so no special update to the model is necessary. But I'm probably wrong about this.

(also, it boosts my confidence to hear that even folks at DeepMind find MCTS difficult to implement :D Because I tried to implement a simple MCTS a few years back for a very small toy project. I was following a step-by-step explanation of how it worked, and even still, it was super difficult, and very prone to subtle bugs)


Ah, well you could use a standard value network, but it’d end really slow, so you probably want to train a smaller one and rely on the implicit ensembling that MCTS does to make it better.

In my experience, PUCT does a lot better than UCT, so you want to also have a prior network.

You don’t have to train a new network, but in my experience, it works much better. I haven’t spent a ton of time using off the shelf networks with MCTS though. Maybe it works great.

very subtle bugs is the MCTS experience. Particularly once parallelism is involved.


really interesting! thanks for the info!


Doesn't MCTS imply that you'd have to generate a whole tree of tokens? Instead of maybe a 200 token response, you'd have to generate several thousand tokens as you explore the tree?


I don't have a good scrollwheel, not easy to browse the site. :(


Spacebar worked pretty well for me.


How expensive will this be?

100T models on one chip with MCTS search.

That is some impressive marketing.

I’ll believe it when I see it.

Great to see so many hardware startups.

Future is deffo accelerated neural nets on hardware.


Given that you believe the transformer is the future, this could flip the state of latency & cost to run these models overnight.


Nonfunctional requirement: Decepticon logo in the chip art. It can't hurt and always adds 10 HP.


Wake me up when I get buy gpt4 as a dedicated chip etch to use as a realtime personal copilot.


What about Transformers on FPGAs?


AI models on FPGAs has been tried before, for instance: https://www.microsoft.com/en-us/research/project/project-cat....

They haven't been able to compete with GPU's on perf/watt. In general you end up just designing some AI accelerator for the FPGA (because the models are too big to map onto a single device all at once), but it's hard to beat purpose-built tensor and vector HW on a GPU when you're running soft logic.


FPGAs are designed to fight latency as much as possible. To do this, they have networks of switches to shuttle bits across the chip and keep delays to the bare minimum, in order for synchronous logic to be able to run at the highest possible clock rates for signals that traverse the entire chip.

To meet this goal, there's a huge amount of effort required to compile a program written in Verilog, VHDL, etc.. into a set of bits that can be used to program all of the switching logic and look up tables in the chip. I'm lead to believe it can sometimes take a day or more per compile.

The second factor optimized for in FPGAs is utilization, trying to use 100% of the available resources of the chip. This is never achieved in practice.

Because everything is optimized for speed, it's not very power efficient.

---

Generally, FPGAs aren't the right architecture for neural networks. If you could load all of the weights into the LUTs, and leave them there, you'd get the type of speedups you want, but those scales of FPGA just don't exist.


> I'm lead to believe it can sometimes take a day or more per compile.

This is true and misleading at the same time. Filling a large FPGA takes time, but if you are working with a small FPGA the turnaround time can be 15 minutes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: