From experience in payments/spending forecasting, I've found that deep learning ...

sigmoid10 · 2024-07-16T20:52:30 1721163150

Transformers are just MLPs with extra steps. So in theory they should be just as powerful. The problem with transformers is simultaneously their big advantage: They scale extremely well with larger networks and more training data. Better so than any other architecture out there. So if you had enormous datasets and unlimited compute budget, you could probably do amazing things in this regard as well. But if you're just a mortal data scientist without extra funding, you will be better off with more traditional approaches.

dongobread · 2024-07-16T21:39:14 1721165954

I think what you say is true when comparing transformers to CNNs/RNNs, but not to MLPs.

Transformers, RNNs, and CNNs are all techniques to reduce parameter count compared to a pure-MLP model. If you took a transformer model and replaced each self-attention layer with a linear layer+activation function, you'd have a pure MLP model that can model every relationship the transformer does, but can model more possible relationships as well (but at the cost of tons more parameters). MLPs are more powerful/scalable but transformers are more efficient.

Compared to MLPs, transformers save on parameter count by skimping on the number of parameters devoted to modeling the relationship between tokens. This works in language modeling, where relationships between tokens isn't that important - you can jumble up the words in this sentence and it still mostly makes sense. This doesn't work in time series, where relationships between tokens (timesteps) is the most important thing of all. The LTSF paper linked in the OP paper also mentions this same problem: https://arxiv.org/pdf/2205.13504 (see section 1)

newrotik · 2024-07-17T03:42:23 1721187743

Though I agree with the idea that MLPs are theoretically more "capable" than transformers, I think seeing them just as a parameter reduction technique is also excessively reductive.

Many have tried to build deep and large MLPs for a long time, but at some point adding more parameters wouldn't increase models' performance.

In contrast, transformers became so popular because their modelling power just kept scaling with more and more data and more and more parameters. It seems like the 'restriction' imposed on transformaters (the attention structure) is a verg good functional form for modelling language (and, more and more, some tasks in vision and audio).

They did not become popular because they were modest with respect to the parameters used.

sigmoid10 · 2024-07-17T06:52:23 1721199143

>Compared to MLPs, transformers save on parameter count by skimping on the number of parameters

That is only correct if you look at models with equal parameter count from a purely theoretical perspective. In practice, it is possible to train transformers to orders of magnitude bigger scales than MLPs because they are so much more efficient. That's why I said a modern transformer will easily beat these puny modern MLPs, but only in cases where data and compute budgets allow it. That is not even a question. If you look at recent time series forecasting leaderboard entries, you'll almost always see transformers playing along at the top of it: https://github.com/thuml/Time-Series-Library

immibis · 2024-07-16T23:32:01 1721172721

Transformers reduce the number of relationships between tokens that must be learned, too. An MLP has to separately learn all possible relationships between token 1 and 2, and 2 and 3, and 3 and 4. A transformer can learn relationships between specific values regardless of position.