Why train when you can optimize?

steppi · on April 30, 2022

Neural networks can approximate any function, but that doesn’t mean they do so efficiently. Depending on the function, they can require incredible amounts of neurons and training. At their worst, they devolve into a lookup table. It’s not hard to find these examples either. Just try training a neural network to compute sin(x)!

This is possible! One of the cool things about neural networks is that you can try to encode prior understanding into either the structure of the network or choice of activation function. See the paper Neural Networks Fail to Learn Periodic Functions and How to Fix It by Ziwin, Hartweg, and Uweda. Where they propose the activation function f(x) = x + sin(x)^2 that can encode an understanding that the underlying function should be periodic.

[1] https://proceedings.neurips.cc/paper/2020/file/1160453108d3e...

PheonixPharts · on April 30, 2022

> Just try training a neural network to compute sin(x)!

This is also the classic demo case for LSTMs, I have a notebook open right now that as a LSTM learning the sine function quite well with 32 dimensional state vector.

However the authors point still stands. Neural Networks are not great at computing arbitrary functions where the output is an unbounded real number. The sine function is still limited to range that is bounded to [0,1].

A better example of where NNs really fail is when trying to learn tricky to implement the normal quantile function (inverse CDF).

This would be an excellent place for function approximation because

a.) generating training data is easy (just run random numbers into the CDF and reverse these arguments into a NN)

b.) manually writing the quantile function from scratch is a pain since it involves the inverse error function which is very annoying to implement from scratch.

You can learn the standard quantile function for mean=0, sd=1, however if you try to generalize this to taking not only the desired quantile but an arbitrary mean and standard deviation you will not learn anything useful.

It's a bit of a shame that neural networks are weak in this area because it would be incredible to have a good tool to approximate inverse functions in general. The fact that we almost never see neural networks being used as a tool for this type of work is evidence of this limitation.

In general if you're problem can't be modeled where the output is some vector of probabilities it's not a great fit for NNs.

steppi · on April 30, 2022

It's a bit of a shame that neural networks are weak in this area because it would be incredible to have a good tool to approximate inverse functions in general. The fact that we almost never see neural networks being used as a tool for this type of work is evidence of this limitation.

It's funny you say. I haven't actually used NNs much in my research but my background is in math and in my spare time I'm a maintainer for SciPy working on special and statistical functions, often the exact kind of stuff you mentioned.

I might take up your challenge and write a blog post about it or something if I get any success.

Anyway, I wasn’t trying to invalidate the author’s general point, just point out a fun fact about his example.

PheonixPharts · on April 30, 2022

> I might take up your challenge and write a blog post about it or something if I get any success.

Success or failure, I would really enjoy seeing that write up! I would be even more excited to be proven wrong.

The promise of "universal function approximator" is very temping. My personal dream would be to have it so one could essentially run scipy in reverse and learn the entire library with a NN. Even in the case I gave, the idea that you could learn an arbitrary quantile function means you could also arbitrarily learn a sampler for any distribution, since all you have to do compose a uniform sampler with whatever quantile function you learned.

Of course for this example I'm using "solved" using a similar approach with variational inference (pyro has a write up on it: https://pyro.ai/examples/svi_part_i.html, you might find David Blei's "Variational Inference: A Review for Statisticians" useful as well https://arxiv.org/abs/1601.00670)

steppi · on April 30, 2022

What a coincidence. Pyro has basically become my go to for Bayesian learning and I know those docs in and out. I’ve also skimmed that David Blei article before. Small world. I’m not very confident of success but it seems like an interesting path to explore.

mumblemumble · on April 30, 2022

> LSTM learning the sine function quite well with 32 dimensional state vector.

I'd say that, from "LSTM" and "32 dimensional state vector" alone, the author's point still stands. That may not be a huge network by contemporary deep learning standards, but it's still a pretty darned expensive way to compute sine. That's 100% in line with the original assertion that ANNs can learn any function, but they can't necessarily do so efficiently.

Der_Einzige · on April 30, 2022

Is it that neural networks are weak or is it that gradient descent/backprop doesn't do well for finding good local solutions for these kinds of problems?

steppi · on May 1, 2022

My understanding is that it's because extrapolation is difficult in general. Extrapolating the behavior of a periodic function like sine is one thing, but extrapolating the behavior of a 1d function that tends to infinity is another challenge altogether. I happen to be very familiar with the implementation of the inverse CDF of the normal distribution though. My plan is to cook a lot of prior knowledge into the NN. Even if I get it working, the end result will likely seem more like sleight of hand than something truly impressive.

zarzavat · on April 30, 2022

It’s hard to walk if you can’t feel your legs.

The problem is improperly evaluating the neural network as a function of time, instead of evaluating the network as a function of previous state.

When we humans approximate functions (let’s say you’re drawing it on a piece of paper, or waving your arm around) we do not simply look at a clock and feed forward that information directly into our motor neurons. Rather, we have sensory neurons that feed in the current state of the function we are approximating as an input, then approximating the next output of a periodic function becomes trivial.

It’s very easy to train a neural network to look at a piece of a sine wave and predict what the next value should be - the fact that the function is periodic actually helps you.

dahart · on April 30, 2022

> It’s very easy to train a neural network to look at a piece of a sine wave and predict what the next value should be - the fact that the function is periodic actually helps you.

I’m not sure what you mean. Are you talking about training a network to predict cos(x)? In this case nothing at all changes by training on the derivative.

Or do you mean train the net to take as input sin(t) and produce sin(t+0.01)? The problem with this is that any given value of sin(t) has 2 answers for sin(t+0.01), therefore your optimizer is going to spit out 0 as the answer. Plus, this is a different problem entirely, and you lose the ability to infer sin(t) based on t. It doesn’t answer the same question.

Your suggestion is further going to be seriously confounded if the periodic function is more complex, say, the sum of several sin waves. There can be an arbitrary number of values y that correctly match f(x).

You should try it before making assumptions that it’s easy, see what it takes to train a net to predict sin(x), without embedding knowledge of the fact that sin is periodic.

> The problem is improperly evaluating the neural network as a function of time, instead of evaluating the network as a function of state.

This also sounds like an assumption to me that somehow the entire world of research has failed to consider the most obvious of ideas. The point of both optimizers and neural networks is that they can be black boxes, right? It doesn’t matter at all whether the input is time based or position based or a function of money. The network, in theory, can learn any function, time or otherwise, and there’s nothing special about time.

But, neural networks function better with domain knowledge. When you know the function domain is time and that the output is periodic, you can do things to make a network easier to train, like using a periodic activation function.

As a side note, RNNs explicitly model an NN based on previous state. Also all layered NNs can be viewed as a series of smaller nets that feed state to the next net. In some sense, NNs always evaluate as a function of state.

omrjml · on April 30, 2022

Periodic function like sin(x) are not a dynamical system so its previous state does not determine the current state. So it should be approximated in that way.

doubleunplussed · on April 30, 2022

Periodic functions like sin(x) are the solutions to differential equations like dy/dx = -y that describe for example, oscillations of springs, to name but one of an extremely large number of dynamical systems that behave this way.

omrjml · on April 30, 2022

Of course sine can appear in the solution for dynamical systems but the function itself is not dynamical. When evaluating sin(x) you do not need to know about the previous state.

zmgsabst · on April 30, 2022

That is true, the problem is with the conclusion you drew from that fact:

> So it should not be approximated in that way.

We can’t conclude a dynamic approximation is a bad approach based purely on the fact the underlying function isn’t dynamic.

The function might nevertheless be easily approximated via dynamics — as in the case of predicting sine from seeing the recent history.

freemint · on April 30, 2022

sin(x) only arises as solution in second order systems. So d^2x/dt^2 = -x

doubleunplussed · on May 1, 2022

Sorry, you're absolutely correct. Brain fart. The equation I wrote actually is for an exponential, whoops.

em500 · on April 30, 2022

It's completely valid and equivalent to formulate trigonometric functions as a dynamic system (x,y-coordinates as you go around the circle using a rotation matrix).

vidarh · on May 1, 2022

It sounds like you're trying to say something different than the comment you replied to. sin(x) for a range of x is trivial to draw by looking at earlier parts of the curve to determine what to draw next. Back in the slow 1980's home computer days, on machines without floating point and multiply/divide, we quickly got used to approximate sine waves numerically by simply typing out sequences of integers with a rough idea of the wave it'd produce. If you gave me a subset of such a sequence and asked me to complete it, I wouldn't need much of it to assume you were wanting to approximate sin.

omrjml · on April 30, 2022

Edit: should not be approximated

em500 · on April 30, 2022

I have the creeping feeling that soon the field is going to reinvent all of Fourier analysis from scratch.

orbifold · on April 30, 2022

It will be called „Neural Harmonic Analysis“, it will cite one paper by a Russian and otherwise ignore any prior work which didn’t include the word neural network.

srean · on April 30, 2022

Its sad that this is not far from the truth

guidopallemans · on April 30, 2022

wavenet and transformer models already come dangerously close

SemanticStrengh · on April 30, 2022

And then your NN can't represent anything else than periodic functions.. If we had to build separate programs for each product requirements variations.. programming would not be viable. More generally neural networks can't even imitate a dumb calculator without throwing absurd errors despite the rules of calculus being trivial and well defined. And matching a calculator is a task order of magnitudes easier than the semantic Causal reasoning abilities of human NLU that is involved in argumentation, inferences and understanding. But people are pathetically fooled by the fallacy of it being a uNiVeRsAl aPpRoXiMaTor.

viraptor · on April 30, 2022

> And then your NN can't represent anything else than periodic functions

Significant parts of x+sin are close to linear. You don't need to use this activation as the result layer either. Why would we lose anything?

> If we had to build separate programs for each product requirements variations

We pretty much do? We use both extremely generic frameworks both in technical sense (.net) and organisational (sap). But we also have software written to specific requirements where needed (there's millions of very specific ways to invoice someone, companies get invoicing platforms written just for them from scratch). There's space for both approaches.

telchar · on April 30, 2022

And yet neural networks can solve symbolic integral and derivative problems and differential equations better than other computer algebra programs. Sure one network might fail to compute sin(x) numerically, but another could easily tell you its derivative is cos(x). Turns out they are pretty flexible. Do they need to do everything?

ABeeSea · on April 30, 2022

Source? I would be very surprised if there was a neural symbolic PDE solver better than what’s in wolfram mathematica.

SemanticStrengh · on April 30, 2022

Related https://arxiv.org/abs/1806.07366

mhh__ · on April 30, 2022

At the cost of how many parameters?

PartiallyTyped · on April 30, 2022

You are wrong.

https://www.vincentsitzmann.com/siren

SemanticStrengh · on April 30, 2022

? I did not say that NNs can't represent periodic functions. Show me a NN calculator with >99% accuracy if you wanna refute me.

PartiallyTyped · on April 30, 2022

> and then your NN can't represent anything else than periodic functions..

This is a quote from the parent comment where you state that NNs with periodic activations can't represent non periodic functions.

Siren (what I linked) uses periodic activations and is able to represent non periodic functions.

SemanticStrengh · on April 30, 2022

Oh OK, thanks yes that's what I was looking for! However if this NN can represent non Periodic functions, what cost does the Periodic function support incur on non Periodic function accuracy, if any?

PartiallyTyped · on April 30, 2022

It actually behaves a lot better than non periodic functions in some scenarios. It seems that period functions enable a network to encode positional information within [-1,1]. This is supported by the fact that positional encoding is needed in transformers to make them work.

lukaszwojtow · on April 30, 2022

Shameless plug as I'm the author: Primeclue uses math functions to express models and one of its function is sine. Like here: https://github.com/lukaszwojtow/primeclue/blob/68e3b4c8e9f1a...

whatshisface · on April 30, 2022

If we're allowed to do that, why can't my activation function be sin(x)?

yobbo · on April 30, 2022

It can, but sin(x) has infinite number of extremes, and the gradients will vanish at those points. Activations will get stuck at 1 and -1 (x=π/2, 3π/2, ...). They set x+(1/a)*sin²(x) to be monotonic, which fixes this.

Or you need to optimize without using gradients.

blackcat201 · on April 30, 2022

You can use sin as activation function, but that would require careful initialization to avoid gradient explosion as you would ended up with a lot of points where gradient is simply zero. You can refer to Implicit Neural Representations with Periodic Activation Functions for more details.

zarzavat · on April 30, 2022

It works perfectly if you don’t have any parameters.

PartiallyTyped · on April 30, 2022

You can actually do that.

https://www.vincentsitzmann.com/siren/

sicp-enjoyer · on April 30, 2022

I'm surprised to see almost no discussion of fourier series in that paper, considering fourier series is all about representing signals as linear combinations of sinusoidal functions.

PartiallyTyped · on April 30, 2022

You may be interested in [1] where they go to a great extend to show that the convolution operation that we consider in DL is the dual of fourier series [2].

[1] https://geometricdeeplearning.com

[2] https://arxiv.org/pdf/2104.13478.pdf page 27 (23 if you count book pages).

amelius · on April 30, 2022

Is convolution in DL not implemented with the FFT as the underlying workhorse?

PartiallyTyped · on April 30, 2022

Probably no since FFT is slower and less parallelizable than products.

anoy8888 · on April 30, 2022

I like your article and went to your home page to find more good articles and I like what I saw. Thank you for sharing. The only thing is that I am reading from my phone and the site is not very mobile friendly

thomasahle · on May 2, 2022

While `x + sin(x)^2` may be monotonic itself, it only takes a simple linear combination of two neurons like `x+sin(x)^2 - (x/2 + sin(x/2)^2)` before you have a completely crazy loss landscape. I have a feeling this is why such activation functions haven't become standard.

taylorius · on April 30, 2022

A small shout-out to Differential Evolution, which is my go-to (derivative free) optimization algorithm. Think of a Genetic Algorithm, but the crossover operator is linear interpolation (DE's natural domain is real-valued vectors). It's simple, and in my experience, it works pretty well.

fho · on April 30, 2022

Scipy has a good implementation of that. I had some fun with it over the last couple of weeks. Just be aware that it runs a "polishing" optimization at the end which might improve the results, or just take forever :-)

Buttons840 · on April 30, 2022

I've always been interested in reinforcement learning, believing it will magically solve anything I throw at it. Unfortunately, I haven't got it to work everywhere yet. I hoped to learn a couple good RL algorithms and then never have to actually learn the concrete details of optimization, because RL can do almost as good. I don't truly believe this, but I think it is an underlying psychological reason for my love of RL. It's scary to think I might have to learn a lot of new algorithms to solve a new problem, and maybe I won't be able to solve it at all.

Any suggestions on the best resources for branching out from RL to more classical control and optimization?

cweill · on April 30, 2022

The Deepmind lead behind AlphaStar, there StarCraft 2 bot that beat several world champs, once said, "the best way to solve a reinforcement learning problem is with supervised learning". That's because supervised learning is relatively extremely efficient, and just RL with more constraints.

Just start learning the basics of supervised learning for classification and regression on common benchmark tasks like CIFAR and UCI. Apply a mix of linear models, neural networks, and trees like random forests and GBDTs. Next try convolutional networks for vision and transformers for NLP. You'll be all set to solve most real world problems.

venividivini · on April 30, 2022

I've been working on imitation learning recently and can attest that reducing RL to supervised learning gives remarkable results, even when using naive algorithms (like Behavioral Cloning).

That's an excellent quote by the way, do you perhaps have any source? Sounds like a good opener slide :)

stncls · on April 30, 2022

> I hoped to learn a couple good RL algorithms and then never have to actually learn the concrete details of optimization

I think that you can use optimization without having to learn anything about its algorithms (learning the basics is always advised of course, but that takes nowhere near as much effort). Nowadays, an off-the-shelf black-box solver will perform better than a custom implementation in most cases: what you lose in fine-tuning is more than compensated by the implementation quality and the sheer number of algorithms available.

> best resources for branching out from RL to more classical control and optimization

It depends on your problem. For general unstructured (nonconvex continuous) optimization, like the OP, you can use NLopt's documentation as a pragmatic starting point [1].

Optimal control is a whole different thing though, and you may have to start with some academic papers/books. I have been recommended this one [2], but I have not read it.

[1] https://nlopt.readthedocs.io/en/latest/

[2] https://link.springer.com/book/10.1007/978-1-4471-0967-9

AlexanderTheGr8 · on April 30, 2022

I feel the exact same way!! Throw RL at the wall for any problem, and it will magically solve it. In the back of my mind, I know that it's not true, but I always want to throw RL at any problem. Right now, I am considering throwing it to make TCP/other-networking-stuff more efficient...

taeric · on April 30, 2022

Isn't training literally an optimization of a loss function over training data?

srean · on April 30, 2022

It is and that is why I find the title a clickbait. The article isnt bad, but the title is certainly being coyly disingenuous.

In problems like this there are two aspects, (i) designing or specifying the search space of functions (ii) choosing the best function within the search space.

The opposite extremes are a) the search space contains only one function, the right function. In this case the training/optimization is moot. The other extreme is to have a very wide search space, say all smooth functions. In that case searching/training/optimizing is more challenging. The more reasonable example is one uses domain knowledge to design a much more restricted search space (for example, one may encode that the function is periodic with a known period) making the next step easier.

diffeomorphism · on April 30, 2022

Technically yes, but practically no. This is discussed in depth in section 11.

DeathArrow · on April 30, 2022

The author claims ML is overused and many problems that could be solved more effectively with optimizations are solved using ML.

If wonder what classes of problems fit this? For sure, I can't imagine how can you tackle sentiment analysis or text classifiers using optimization.

srean · on April 30, 2022

Supervised learning, which is one of the main type of ML is about learning from examples of input and desired output. In some cases it is easier to define the behavior of a function using such example pairs as opposed to specifying exactly what the learned function ought to do. This usually happens when we aren't certain what the right function value should be at every point in its input space . This is where ML should be used.

There are other functions, for example, searching, sorting etc where its easier to specify accurately what the desired function does compared to giving a list of pairs of examples. In such cases ML may not be the best choice. Note, reasonably accurate sorting functions can be learned from examples, but that's not the most efficient way to design a sorting function.

teruakohatu · on April 30, 2022

> tackle sentiment analysis

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a large lookup table mapping words to a sentiment score, calculated by surveying people. It is simple to use and involves no ML.

PeterisP · on April 30, 2022

I don't consider VADER as suitable for tackling sentiment analysis but perhaps we have a different understanding of 'tackling sentiment analysis'. In my view, 'tackling sentiment analysis' requires doing sentiment analysis well, which no systems currently do.

Decent sentiment analysis is a yet unsolved problem and current state of art solutions (mostly based on large pretrained langugage models like BERT, leveraging transfer learning from very large unlabeled corpora) are not sufficiently accurate, and of course neither is VADER. As soon as you go beyond simple binary polarity (blatantly positive/negative sentiment) and something more informative like various five-way sentiment tasks, or aspect-based sentiment or determining sentiment targets, sentiment analysis still has a long, long way to go.

Current ML solutions are a possible way to tackle sentiment analysis because while the current models are not good enough, we have evidence showing that with increased model and data size they improve and perhaps might result in sufficiently good sentiment analysis someday. Perhaps not, and we'll need something else; but at least they have some potential.

VADER and similar methods are not a way to tackle sentiment analysis because not only the current VADER model is not nearly good enough, it's clear that they can't scale to anything much better than that. Adding extra surveys to increase the lookup table quickly hits diminishing returns (unlike ML systems where we consistently see that ever larger models continue to get improvements), and if they can't even reach the current bar of ML models (which still are not good enough) then they can't bring us to the level of sentiment analysis where we want to be, they are a dead end.

Tackling sentiment analysis also requires handling a wide variety of languages, not only English, which is yet another aspect where data-driven models have an advantage over systems that require extensive human labor for each new language.

I mean, we all (me included!) want to believe that human knowledge encoded in rules can work, intuitively it's a very appealing concept, however, it does not work out in practice and IMHO in the end we all have to learn to accept the Bitter Lesson (http://incompleteideas.net/IncIdeas/BitterLesson.html).

srean · on April 30, 2022

The article itself is good but the title is clickbait. The process of training is minimization (in other words, optimization) of generalization error.

d4rkp4ttern · on April 30, 2022

True, but to be a bit pedantic, I wouldn’t call that generalization error (which you don’t have access to while training). Training minimizes some loss function which is a proxy for generalization error.

srean · on April 30, 2022

Spot on.

deterministic · on May 2, 2022

Optimisers are perfect for solving problems where you don’t have access to a large training set. I actually recommend starting with an optimisation algorithm first and only then try NN if using an optimisation algorithm isn’t enough to solve the problem.

tehsauce · on April 30, 2022

Very cool article, but if you want to draw straight lines just use the start and end points :/

taylorius · on April 30, 2022

You might accidentally veer off just at the end of the line - better to take all points into account.

nurettin · on April 30, 2022

Yes but can you trust a human to draw from start to end?

DeathArrow · on April 30, 2022

NN failed to model sine function: https://towardsdatascience.com/can-machine-learn-the-concept...