More

skierscott · on Sept 23, 2018

> a low confidence score

Neural nets should return a low confidence score. But, the popular approach (described below) ignores that. Neural nets ignore confidence because of a technique called softmax [1].

This happens as the final operation of a neural net, and is required for training.

Softmax is a tool to make an array of positive numbers look like a probability distribution:

    out = x / x.sum()

x[i] is a class prediction, but x.sum() != 1. Say if the network was uncertain, x[cat, dog] = [0.03, 0.01]. These are small values that do not imply great confidence (the network was trained on vectors with out.sum() = 1. The network would predict “dog” using softmax because out[dog] = 0.75 > 0.25 = out[cat].

But then in inference/prediction, the confidence is ignored. What if x.sum() is small? That would imply that the network is uncertain.

[1]: https://en.m.wikipedia.org/wiki/Softmax_function

p1esk · on Sept 24, 2018

No. Regardless if the outputs for cat/dog are [0.03, 0.01] or [0.75, 0.25], the network is still three times more confident it's a cat. The uncertainty (entropy) of the outputs is exactly the same in both cases.

In other words, if you only have two object classes, the magnitude of the outputs does not matter, the uncertainty is measured by the relative difference of the outputs.

The only way to measure the confidence of the model that the output is "cat OR dog", is to have another class (e.g. "chair"), only then, looking at all three outputs you can estimate the confidence of the model regarding "cat OR dog" predictions (vs 'NOT (cat OR dog)"). For example, if [cat, dog, chair] outputs are [0.03, 0.01, 0.05] then we know the model is not confident that it's either a cat or a dog, but if the outputs are [0.75, 0.25, 0.05], then it's clear it is.

TeMPOraL · on Sept 23, 2018

Is this what softmax is? Simply dividing a vector by sum of its components? If so, then how does it deserve a name, not to mention a long Wikipedia page full of formulas?

CodesInChaos · on Sept 23, 2018

Softmax has two components:

1. Transform the components to e^x. This allows the neural network to work with logarithmic probabilities, instead of ordinary probabilities. This turns the common operation of multiplying probabilities into addition, which is far more natural for the linear algebra based structure of neural networks.

2. Normalize their sum to 1, since that's the total probability we need.

One important consequence of this is that bayes' theorem is very natural to such a network, since it's just multiplication of probabilities normalized by the denominator.

The trivial case of a single layer network with softmax activation is equivalent to logistic regression.

The special case of two component softmax is equivalent to sigmoid activation, which is thus popular when there are only two classes. In multi class classification softmax is used if the classes are mutually exclusive and component-wise sigmoid is used if they are independent.

TeMPOraL · on Sept 24, 2018

Thanks for the detailed explanation!

brilee · on Sept 23, 2018

It also includes the exponentiation step before the vector normalizations. There are connections to statistical mechanics here, where the relative energy population numbers are proportional to the softmax of the energy levels divided by temperature. (so as temperature goes up, the relative energy differences get smaller and the states are more equally populated.) That idea has been ported over as "softmax temperature" in some places.

Tenobrus · on Sept 23, 2018

No, it's not. It's actually e ^ x_i / sum(e ^ x_j for x_j in x), which is in fact different. Simply dividing by the sum wouldn't work for "squashing to a probability distribution" in a large number of cases.

throwaway080383 · on Sept 23, 2018

So pointwise exponentiation composed with dividing by the sum. Still don't need a new word.

throwaway080383 · on Sept 23, 2018

Coming from pure math, I often feel this way now learning statistics and ML. In pure math, it feels like the threshold for how a novel a concept should be before it gets its own word is much higher.

E.g, we have "regression" and "classification" instead of "supervised continuous prediction" and "supervised discrete prediction".

p1esk · on Sept 24, 2018

If you don't undestand where the name "softmax" came from, you don't really understand what it is. Softmax is a differentiable approximation of the max function.

Plot max(0, x) and softmax(0, x) functions, and it should become clear.

throwaway080383 · on Sept 24, 2018

Nit: it seems it's more like a smooth approximation to maxarg than max.

Yeah it makes sense that this is a super important function, but I still feel like one could just remember the principle that "exponentiation followed by normalization is a smooth approximation to maxarg."

p1esk · on Sept 24, 2018

Basic building blocks of most deep learning models are convolutional layer, pooling layer, fully connected layer, and softmax layer. How do you propose we call "softmax layer" instead?

TeMPOraL · on Sept 24, 2018

Normalization layer?

This opens up possibility of using something else than softmax in there.

p1esk · on Sept 24, 2018

Well, there are other building blocks, such as batch normalization layer, or local contrast normalization layer (not to mention a dozen of batchnorm alternatives, e.g. group normalization, weight normalization, layer normalization, instance normalization, etc).

If you just say "normalization layer" how am I supposed to know which normalization you're talking about?

hendzen · on Sept 23, 2018

Because grad students need to publish papers.

gok · on Sept 23, 2018

Note that you can get a form of confidence by just not applying softmax to the output during inference. Softmax is primarily to aid in training.

PavlikPaja · on Sept 23, 2018

How well do neural networks train with no normalization at all, compared with softmax?

CodesInChaos · on Sept 23, 2018

You need to perform some kind of normalization, since probability must be between 0 and 1 (and being wrong on a confident prediction gives huge penalties using the popular maximum likelyhood loss functions).

But you can use component wise normalization (sigmoid) instead of combined normalization (softmax). These correspond to the assumption that the classes are independent (component wise sigmoid) or mutually exclusive (softmax).

PavlikPaja · on Sept 23, 2018

"probability must be between 0 and 1" - why? (I get it's used in mathematics, but I see no reason why a NN would have to output probability that way.)

"and being wrong on a confident prediction gives huge penalties using the popular maximum likelyhood loss functions" - It should.

p1esk · on Sept 24, 2018

I see no reason why a NN would have to output probability that way

For classification tasks, the labels are usually encoded as a one hot vector (one in the position of the correct class output, zeros everywhere else). If you don't normalize outputs to be between zero and one, it becomes a regression task - you are essentially asking the model to fit your one hot encoded label. That's not desirable, because we don't care about the actual value of the output for the correct class. Whether it is 0.1, 1.1 or 1001 it is the correct output as long as it's larger than outputs for other classes. That's why we want to take the largest output, and scale it in a way that it's always less than one. Its distance from one depends on how much larger it is than other outputs (the confidence of the model in this prediction).

Without normalization, the model that outputs 1000 for the correct class and tiny values for all other classes would get severely penalized because the labels says it should be 1 in that position (so the error is 1000-1=999), even though the model made the correct prediction.

There's some confusion about this (e.g. https://news.ycombinator.com/item?id=18054447 ), so hopefully my explanation makes sense.

p1esk · on Sept 24, 2018

It's not about normalization, it's about loss function. Softmax is required by cross entropy minimization (negative log-likelihood to be precise), which works somewhat better in practice than mean squared error (MSE) minimization (which needs no normalization of outputs).

skierscott · on June 5, 2018

mpi4py is a solid implementation, but the docs aren’t the greatest, especially if you’re not familiar with MPI.

ianhowson · on June 5, 2018

And there are different versions with different APIs. Best to just read the source code for whatever you're running.

skierscott · on June 4, 2018

On iOS this isn’t functional. I can’t advance past the first slide.

spiderPig · on June 4, 2018

Works for me, swipe left/right

skierscott · on June 4, 2018

>> Going forward, GitHub will remain an open platform, which any developer can plug into and extend.

Does an “open platform” mean “open source” or “a plugin system”? I think making it open source would alleviate a lot of dev concerns.

heartbreak · on June 4, 2018

If they meant "open source" I have to imagine they would have said so. "Remain" is the key word there. Github is already an open platform with an API and plugin system. Github is not already open source, so it can't remain so.

skierscott · on May 13, 2018

> Some states destroy the blood spots after a year, 12 states store them for at least 21 years.

What are the other 12 states?

hawkice · on May 13, 2018

Presumably those that store the spots for between 1 and 21 years.

skierscott · on March 7, 2018

My favorite is Alan Turings quote:

> Mathpix's AI definitely passes THIS Turing test!

skierscott · on Feb 20, 2018

Isn’t this what JITs do?

singhrac · on Feb 21, 2018

There's a (very exciting) PyTorch JIT incoming sometime soon!

skierscott · on Feb 20, 2018

> “If you understand and agree, Apple and GCBD have the right to access your data stored on its servers. This includes permission sharing, exchange, and disclosure of all user data (including content) according to the application of the law.”

> In other words, once the agreement is signed, GCBD — a company solely owned by the state — would get a key that can access all iCloud user data in China, legally.

What user data will this decrypt? Are iMessage and FaceTime still safe?

skierscott · on Jan 30, 2018

There’s also a closed form for the Fibonacci numbers. It runs in O(1).

http://stsievert.com/blog/2015/01/31/the-mysterious-eigenval...

skierscott · on Dec 11, 2017

Not true. Infinite sums are not defined for series that don’t converge, and sum(1..n) does not converge as n => infinity.

Yes, you can play a nice trick with that sum if you ignore the fact that inf-inf is meaningless.

fantispug · on Dec 11, 2017

It depends on how you define the summation of infinite series. With the standard convergence type definitions you are correct. But it is possible to define this sum in a consistent way (e.g. Ramanujan summation or Riemann Zeta analytic continuation) as shown in Hardy's “Divergent Series”. The cost is that they have properties like rearranging the order of the terms gives a different result. Apparently this sum can come up in Quantum Field Theory when calculating vacuum force between two conducting plates.