Hacker Newsnew | past | comments | ask | show | jobs | submit | sapphire42's commentslogin


The comment you're replying to is 100% AI-generated. How does obviously LLM-generated content continually make it to the front of HN, and why in God's name are you being downvoted for calling this out??

"...a fascinating approach..." (LLMs think everything is fascinating)

"...they're essentially having a generalist learn from a committee of specialists..." (analogies, analogies)

"...where APIs are undocumented, partial failures are common, and user input is full of ambiguity..." (typical AI rule of three template with semantically similar parameters that contribute nothing to the overall meaning)


It does worry me how defensive people can become over really obvious slop - I don't think I'm even particularly attuned to the style of LLM writing but it is incredibly obvious every time I see it. It's only going to get worse I think.


> and why in God's name are you being downvoted for calling this out??

Tinfoil hat time, but perhaps the bots don't like being called out? I don't actually take that statement seriously, but it seems an eventual avenue. They've long been seeding threads on Reddit to shape initial hive mind, i imagine that's going to get more advanced and widespread.


Worshipping the elite won't make you become one of them...


Very thoughtful contribution, thank you.

I'll just say this: make the algorithm work for you.


I've been using a Sonim XP3 flip phone since February 2025 and I love it. It's so freeing to not have social media always accessible at all times. I've downloaded all 850 songs on my Spotify playlist as .mp3 format and play them using the built-in music app. When I need to navigate somewhere, I drive there from memory or consult the atlas of my city that I keep in the passenger door of my car. I've gotten pretty good at T9 predictive text typing and can text people at about half the speed that I would on a smartphone.

I don't like modern smartphones precisely because of their so-called conveniences. Because they're so easy to access, we're pushed into delegating to them as if they're a part of ourselves. If you have a smartphone, you'll never learn the streets of your city, because it's easier to use GPS all the time. You'll never get good at mental math, because you can just use your calculator. You'll remember less things because if you ever need to know something you can just take out your phone and Google it (this is an actual psychological phenomenon). And because social media is just a couple taps away, you'll spend hours every day trapped in an addictive algorithmic hell that leaves you bored and dissatisfied. Smartphones turn us into shells of ourselves, no longer living our own lives because it's easier not to.

Getting a flip phone doesn't make doing the things you used to do impossible. If you really want to do something that requires a smartphone, you can get a friend to do it for you, or take out your laptop. Everything is still possible, it's just a little bit more inconvenient, and that feeling of inconvenience, that tiny barrier to entry that smartphones do everything to eliminate, is what pushes your brain to be human, to learn how to do things so you don't have to rely on a device, to spend less time on social media.


This is a great story, but why does content that is clearly LLM-generated continually make it to the HN front page?


Sir this is a subReddit


This should not be downvoted :) lighten up you guys!


If you read on the news that a sealed cave with ancient symbols of death and destruction had been discovered in the New Mexico desert, what's the first thing you'd expect us to do?

There is no defense against human curiosity :)


https://www.smbc-comics.com/comic/rite-on

It seems like a related phenomenon to "there's an XKCD about that"


And of course there is an XKCD:

https://xkcd.com/3003/


Democracy dies when voters elect a candidate who tried to overthrow the democratic system before, and promises to do it again.


At least those folks went after the government instead of smashing the windows of every business in my city. But that event, despite being in the “worst global pandemic of 100 years” somehow got a free pass. It was labeled as “the summer of love”.

This country was founded by government distrust and rebellion. It was not founded on bashing your neighbors windows.

Those people who stormed the capitol put the fear of god into a bunch of politicians. Good for them.

…the people who set fire to neighborhood buildings… not so sure about that one.


Exactly.


Based on voting patterns, I think to many americans today, the main claim of the J6ers (that there were some fraudulent ballots in the 2020 election) is looking more likely, not less. If anything were to come out, the J6ers would become freedom fighters, just as they are apparently in the hearts of many Americans. Like it or not, perception is how you win an election.

On the other hand, the democrats have tried politically-inspired prosecutions, selecting a nominee while ignoring the party writ large.

Anyway, the simple truth is that Americans worried about democracy went to trump by large margins. Consider that


Democracy always dies by democracy though. Thats the fundamental flaw.


Does this not get exhausting?


You call a tax fraud investigation "needlessly harassing the business of sovereign individuals?"

You're right, the U.S. is very business and industry friendly, which explains why the American people have been getting poorer and poorer while corporations and their shareholders get richer and richer.


The Average American is getting richer and richer, especially compared to European wage stagnation. I don’t think non-Americans understand how rich middle class americans are.


Only 50% of Americans have Passports. 80% in Europe.

We understand it, because we travel there on, our paid, vacations.

My last time in America I went to San Francisco, lots of money, I have not seen a worst place to live in my life.

Americans don't understand the quality of life other places have.


And other places don’t understand the sheer amount of wealth and freedom other people have. "It costs a lot of money to look this cheap" and San Francisco spends billions to look that cheap btw.

Especially Europe. Europeans don’t even seem to understand how large America is, which is why so few have passports.


America, the land of the free... the place where I could not even walk at night because the streets were full of homless people.


Just carry a gun.

Self-defense is an absolute right unlike other places.


Carrying weapons for self defense was a think people had to do around here in the Middle ages.


And in a free societies the people themselves are the enforcers and protectors. That's why we even have a jury system in place.

In America the people are responsible, not the government or their enlightened bureaucrats.


[flagged]


Take a walk around San Francisco and then say poverty isn't a problem.


Americans have been getting richer and richer according to most recent studies. At the same time, other socialist countries are getting, on average, poorer. I think you may be confusing social inequality with overall wealth.


As someone who has worked in this space, this paper is unfortunately total BS.

Their claimed theoretical advancement is as follows. If you want to transform an input vector X to another vector Y of different dimension, "normal" people suggest to use a linear projection: create an appropriately sized matrix W and simply multiply it by your input:

Given X ∈ d_in and W ∈ d_in × d_out, then Y ∈ d_out = X @ W.

In the attention layer, where the input X is converted into queries Q, keys K, and values V, this is the simple strategy employed: Q = X @ W_q, K = X @ W_k, V = X @ W_v, and it has shown itself to be effective.

This is too simple for the authors of this paper. They propose another approach. Instead of converting directly to the desired dimension, we will increase computation by creating an intermediate dimension, and introducing a non-linearity between them.

Given X ∈ d_in, and W_1 ∈ d_in × d_tmp, and W_2 ∈ d_tmp × d_out, then Y ∈ d_out = f(X @ W_1) @ W_2.

Here, f can be any non-linearity. The authors choose softmax; it allows them to claim a superficial resemblance to attention. Later in the paper, they reveal it is not actually softmax, but a modified version to avoid gradient vanishing (softmax is not a very good general-purpose non-linearity).

So, they replace all projections in the attention layer with this new strategy. So Q = f(X @ W_q1) @ W_q2. And K = f(X @ W_k1) @ W_k2. And V = f(X @ W_k3).

The problem with this is not theoretical: this does increase the model's expressiveness and computational power. It is practical: we are adding parameters where we need them the least, in the attention layer. It is generally understood that LLMs do not need extra parameters in the attention layer. Actually, advancements like Grouped-Query Attention hinge on the idea that you can halve or even fourth the number of parameters in the attention layer without harming performance. The experience of the LLM community so far suggests that the authors' idea of adding even more parameters to the self-attention layer should degrade their models' performance while adding no tangible gain.

The authors' numbers say otherwise. But it is hard to trust their numbers. When training a Transformer to compare against they replicate the original GPT-2 proposed in 2019. In doing so they ignore years of architectural improvements, such as rotary positional embeddings, SwiGLU, and RMSNorm that have culminated in Transformer++, the strong recipe which is what Meta's Llama series uses. We've seen this time after time in the various "Transformer killers" that used to be popular about a year ago. A researcher would think up some novel variant of linear attention, furiously test it against a weak GPT-2 baseline, find it blew it out of the water, and declare victory. Somehow, these never caught on, because when tested against a newer baseline these models weren't actually that great. The authors are doing the same thing here.

In their tables they also include comparisons to other models. Actually, they exclusively select the EleutherAI suites: GPT-Neo, OPT, and Pythia. These models were not trained with any modern architectural improvements except rotary embedding (which EleutherAI invented), and so predictably TokenFormer crushes them. On the last page of the appendix the authors have included a full table with some more fair comparisons. Their TokenFormer-150M variant achieves a Pile ppl of 10.45 against Mamba-130M's 10.54. In the intermediate weight class, TokenFormer-450M matches Mamba-370M's 8.28 Pile ppl despite having 21% more parameters. And in the largest size, TokenFormer-1.5B loses to Mamba-1.4B, 6.91 to 6.80 ppl.

Overall, the architectural tweak proposed in this paper is impractical, and the few fair comparisons they include are unimpressive. TokenFormer is another in a long line of Transformer-killers that have nice graphs of cherry-picked data, and will similarly fade into obscurity.


totally agree. It doesn't make any sense to use linear(softmax(linear(x))) to replace linear(x) while claiming to be more explainable and more scalable.


I feel like you fundamentally misunderstood the paper. It's not only the attention weights; the weights in the MLP layer that follows each attention layer are also generated based on the methodology they describe.


Yes, and this results in the MLP layer being functionally unchanged. In the vanilla GPT-2 Transformer, the MLP layer is defined as a 4x up-projection, then a non-linearity, followed by a 4x down-projection. This can be understood as a specific case of their method, as they describe here:

> The number of key-value parameter pairs in both the query-key-value and output projections corresponds directly to the hidden dimension. In contrast, the FFN module utilizes four times the number of parameter pairs relative to the hidden size.

Here is the original FFN as described in GPT-2:

y = GELU(x @ W_u) @ W_d

And here is their FFN, when understood as a special case of their "Attention":

y = modified_softmax(x @ W_k) @ W_v

You can name the matrices whatever you want, but the grand enhancement that the authors make to the FFN is just replacing the GELU with a different non-linearity. Shazeer already conducted extensive empirical tests of different non-linearities for the FFN layer in 2020. Among the best were SwiGLU, which is used in Llama today. Unsurprisingly, a modified softmax did not make the cut.

Again, if the changes in this paper were truly a step forward instead of a mindless scrambling of architecture in an effort to achieve something publishable, it would show in the results. Instead, as you can see in their appendix, TokenFormer is on-par or loses in fair comparisons to other models.


Isn't their main claim is ability to gradually increasing weights number and saving on total training costs, rather than just expressiveness / efficiency of the architecture?


They do actually make several claims as to the efficiency of the architecture compared to the Transformer, as you can see by the many graphs throughout the document. Their claim that their architecture is the only one that allows for gradually increasing the number of weights is a prominent one too, though, so I'll explain why I don't find that claim credible.

The idea of gradually increasing the size of a Transformer to save on training costs is not a novel one, and researchers have explored ideas to this effect almost since the Transformer's inception. There are many ways to do it. We can start with a small number of layers, and then add in more initialized to the identity. We can keep the number of layers constant and start with a small, then increase the width throughout training, initializing the extra weights to zero. We can reformulate all weight matrices as LoRAs and start with a tiny rank, then slowly increase the rank until we reach a full-rank equivalent. Or we can use two or three of these strategies and mix them any way we want.

The performance of the resultant model is entirely dependent on what strategies you use, and how you mix them: whether you choose to increase width, depth, or rank all at once, one at a time, or somewhere in-between, and whether you increase those values linearly, exponentially, or by some new function you just thought of. Because there are so many ways to gradually increase the size of a Transformer, when you think of a new way, you've got to pick a strong baseline to compare against.

The authors choose the baseline Net2Net (2015). The paper, written two years before the invention of the Transformer, regrettably does not include pre-trained Transformer results for the authors to compare against. So, the authors train their own Net2Net model, and provide a couple nice graphs where the TokenFormer loss curve is under the Net2Net Transformer's for the entirety of training in Figure 6 and Figure 7. They provide no details of the training setup that produced these graphs: the model size, layer count, and width are all missing, as well as basic hyperparameters like the learning rate and batch size. They train on enwik8 (100MB) and seem to repeat data: near the end the TokenFormer reaches sub-0.5 perplexity levels, an impossible result for English text with reasonable entropy that a language model has never seen before.

Why choose this strange, home-grown baseline, reliant on a method developed in 2015, to compare against? Why not at least use a method tuned specifically for the Transformer (such as [1](https://arxiv.org/abs/2203.06211), [2](https://arxiv.org/abs/2401.02415), [3](https://arxiv.org/abs/2309.03852), to name a few!) If their progressive scaling method is truly better, it would only benefit from comparison against a strong baseline.

The authors' progressive scaling method is an idea that has been explored many times by other ML researchers. Their method in particular is compared against a weak baseline with no concrete details other than the loss graphs. In my humble opinion, it's merely an effort to shoehorn a claim of novelty into a paper that isn't.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: