Hacker Newsnew | past | comments | ask | show | jobs | submit | lordgrenville's favoriteslogin

The article discusses Boltzmann's formula exp(-E/kT). I was recently looking at the same formula in the context of semiconductors and I realized that Boltzmann's constant k is only needed because temperature uses bad units. If we measured temperature in energy instead of degrees, then Boltzmann's constant drops out. For instance, you could express room temperature as 25 meV (milli electron volts) or 2444 joules/mole and the constant disappears. Likewise, the constant in the ideal gas law disappears if you measure temperature as energy rather than degrees Kelvin. In other words, degrees Kelvin is a made-up unit that should be abandoned. (I'm not sure I believe this, but I don't see a flaw.)

Off-topic, but a somewhat similar experience for me was learning about Alt+select. Allows you to select parts of a hyperlink without actually activating the link. (At least in Firefox.)

When you first hear about it, you think "What's the big deal?", but then you quickly find that you use it pretty often.


I started writing a long comment in response to this article (more to respond to the original catchy headline this topic first went live with "Mental Illness Is Not in Your Head").

I decided I'd turn it into a blog post and have posted it for discussion here: https://news.ycombinator.com/item?id=33201781

Here's it is as a comment:

This is a response to an article currently being discussed on Hacker News [Two recent books by historians explore the crisis in biological psychiatry, originally titled Mental Illness Is Not in Your Head]. I believe that the, now replaced, catchy headline of the article is not correct. I believe mental health is in your head, but this does not mean that mental health is controllable through altering neurotransmitters (or in fact altering any specific biological process).

~All mental health issues use the same biological structures[1]. A structure which interprets the emotional dynamic of the situation you are currently in. Another structure which reactivates the emotional memory you have associated with that dynamic.

Most likely the same mechanism is used for both happy and unhappy paths:

# Happy paths:

If you grew up with a loving (but not overwhelmingly loving) and calm family, your unconscious association between the emotional dynamic of a situation you are in, and the emotional memory associated with it are positive. These could range from: "Everyone is having fun right now, I can relax and have fun too!", to "That person did something that made me uncomfortable, I know it's safe to express my needs and feelings, so I can communicate calmly to the person who upset me how I their behaviour made me feel".

# Unhappy paths:

If you grew up with caregivers who were stressed by certain situations, your unconscious association between the emotional dynamic of a situation you are in, and the emotional memory/requirements associated with it will contain protective responses. These could range from: "Everyone seems to having fun right now... but everyone got so stressed out when I was anything other than calm and happy when I was young, I better keep all my stressed feelings hidden inside, and be act like I'm happy and having fun too - even if something is going on for me which means deep down I'm not feeling good", to "That person did something that made me uncomfortable. Everyone go angry so quickly when I was little, that I’m sure this person will get really angry too if I say anything to them. I will just pretend that I’m ok with what they did.” This list goes on and on, and will depend on the subtle dynamics of the relationships you were raised in.

You will notice that in the happy paths there is not a separation between your external world and your internal worlds, whereas in the unhappy paths there is this split. This split is uncomfortable and it is lonely. It requires a tense form of control that the person on the happy path doesn’t need to apply to themselves.

# Things get worse […before they get better?]

I’m sure a bit of you related to the unhappy paths that I described. That is because we all have them. One of the biological survival mechanisms we have as highly dependent infants is to bend our emotional responses into ones which mean we get what we need from our caregivers.

This is such a common requirement for making it through infancy that the human is built to shed these leant emotional shackles. I am in a controversial minority within psychotherapy that believes that the precise diagnosis of these emotional shackles is the function of dreaming (https://psyarxiv.com/k6trz).

Getting rid of an emotional shackle is not complicated when it is clearly visible. It is not particularly pleasant, but you simply have to unlearn the fear by facing up to it. If you notice you keep your stressed feelings inside, you’ll need to find the courage to start opening up. If you are not setting boundaries when you feel yours are getting trodden on, you need to find the courage to start having those (initially) awkward conversations. The same is true for whatever unuseful emotional conditioning you are trying to get free from.

The mechanism behind this approach is very simple. We are extremely scared of facing these learnt fears (the type and level of fear we typically[2] only know in infancy). When we repeatedly face these fears and survive they are very quickly unlearned from the brain. It is highly inefficient for the brain to keep a fear in place that we now know (at an experiential, not only cognitive, level) to be superfluous, and the brain does not seem to want to do this.

But what happens if no one is there to help you work our your emotional shackles and you are left to suffer their isolating consequences on your own? Again, I am in a bit of a controversial minority of the mental health community, but I believe it is the useful response that mental health symptoms should worsen.

If things worsen both you and others begin to notice that something is wrong. If they notice something is wrong, there is an increased likelihood that you will get the emotional care that might lead you to successfully removing your emotional shackles; reducing your stress and isolation. Many people start treating their mental health because things have gotten bad, but the treatment (the process of discovering and facing up to unconscious fears) doesn’t need to stop when you return to your base level.

In summary, I think there is a strong component of mental illness that is very much within our own heads. Because the happy and unhappy paths of mental illness use the same structural processes we cannot force a change at the biological level. Instead we have to explore, challenge and ultimately change the underlying emotional memories that are elicited in the structural processes. From my personal experience, this causes the greatest improvement to our mental health/reduces our “mental illness”.

[1] I'm aware that I am talking with one of two layers of abstraction. I'm not talking about the specific parts of the brain, but these processes are consistent in all of us.

[2] Stressful situations we experience as adults that cause PTSD are ones where our emotional processing of the situation we are going through mimics our childlike experience. The experience is overwhelming.


These definitions don't really give you the idea, rather often just code examples..

"The ideas", in my view:

Monoid = units that can be joined together

Functor = context for running a single-input function

Applicative = context for multi-input functions

Monad = context for sequence-dependent operations

Lifting = converting from one context to another

Sum type = something is either A or B or C..

Product type = a record = something is both A and B and C

Partial application = defaulting an argument to a function

Currying = passing some arguments later = rephrasing a function to return a functions of n-1 arguments when given 1, st. the final function will compute the desired result

EDIT: Context = compiler information that changes how the program will be interpreted (, executed, compiled,...)

Eg., context = run in the future, run across a list, redirect the i/o, ...


A lot of this can be simplified to three questions:

1. What problem is your company solving?

If you don't get an answer, beware. If the answer sounds vague, beware. If the answer makes no sense, beware. If the answer is multifaceted, beware. This suggests that the company will not even begin the process of becoming profitable.

2. Who has this problem?

You should get a clear picture of an actual person. If not, beware. If that person has no money, beware. If that person has no pull within an organization, beware. If that person is high maintenance or fickle, beware. This suggests that the company will never find the revenue they seek.

3. What's your solution?

If the solution doesn't actually address the problem, beware. If the solution is too expensive for the customer, beware. If the solution can't be differentiated from its competitors, beware. If the solution has no competitors, beware. If there are a dozen solutions, beware. This suggests that no matter how amazing the technology or technical team, the company will not be able to execute on its business plan.


This is a pretty good write up. I've taken to use a very similar scenario as part of my interview process for candidates. It's shocking how few SRE candidates I interview can walk me through a simple scenario such as troubleshooting this.

I don't even require the correct incantation of curl/openssl/dig/nslookup/nc/route/etc, I just want to see their process for breaking down the problem and searching for the fault. The most common thing that I see is changing the local DNS resolver, flushing the DNS cache, rebooting the local computer, disabling the local firewall, and then giving up.

I start with NXDOMAIN because the subdomain is not configured on Route53, then it gets progressively more weird from there. NACL blocking return traffic, apache/nginx service is just dead and systemd didn't restart it, apache/nginx is bound to 127.0.0.1:80/443, self signed cert with an invalid CommonName.

Each problem presents a jumping off point to dive deeper into different areas.


First thing: all claims to have proved a causal relationship are relative to a set of assumptions which may or may not be satisfied (even in experimental sciences).

Taking the experiment as an ideal we cannot reach, there are many settings where there is useful variation in the data which can approximate the random assignment to treatment which is the essence of experiments.

Some canonical research designs which can be used here include differences in differences, regression discontinuity, synthetic control methods, or instrumental variables methods. They are differently appropriate to different settings but permit causal claims to be made. These methods are widely used in contemporary empirical economics (indeed this is a nearly exhaustive list of methods used in applied micro!).

Useful, accessible references here include Angrist and Pischke, “mostly harmless econométrics,” or “causal inference the mixtape” by a guy at Baylor whose name I’m forgetting, and the very recent “the effect” by Huntington-Klein, which I have not yet read.

Other slightly more exotic models are used in (for example) industrial organization which still permit causal claims to be made about the effects of increasing prices or changing product features on demand for a product or set of products.


The comment thread that is soon to follow is predictable.

Someone will reply: Medical costs are responsible for X% of personal bankruptcies in America.

Someone will reply: Most Americans don't have to worry about medical bills because they have coverage through their employer.

Someone will concur: Yeah, I have fantastic benefits with my company.

Someone will provide a counter anecdote: I also have great benefits with my employer. But I still have to pay a $5000 deductible before coverage kicks in.

Someone will reply to the parent with: The US has the best health care in the world. Why do you think people fly from overseas to receive treatment in the US?

Then someone will reply: The US has the best health care ... if you can afford it. The last time I saw my doctor, he looked at me for five minutes and prescribed me a bottle of aspirin.

Someone will reply: You need a new doctor. I once had an experience like yours. I found a new doctor, who truly cares about my health and well being.

Someone else will reply to the comment two spots above:

> The US has the best health care ... if you can afford it.

"No, the US has the best health care in the world, including for middle class people and poor people. I once injured myself in [insert first world country] and their health care system was practically third world."

_______


They say that scientists rarely get any real work done after they win a Nobel prize, because the prestige warps their self-expectations in a way that guides them away from things they could actually make progress on and towards heights that are too difficult to climb. Avoiding this has guided career my up until now and I can safely say I have been totally successful at eliminating even the slightest risk.

I felt like this article was a bit light on data scientist specific advice, and while I am not one, I do herd them for a living, so thought I'd put some random thoughts together:

1) Quite often you are not training a machine to be the best at something. You're training a machine to help a human to be the best at something. Be sure to optimise for this when necessary.

2) Push predictions, don't ask others to pull them. Focus on decoupling your data science team and their customers early on. The worst thing that can happen is duplicating logic, first to craft features during training, and later to submit those features to an API from clients. Even if you hide the feature engineering behind the API, this can either slow down predictions, or still require bulky requests from the client in the case of sequence data. Instead, stream data into your feature store, and stream predictions out onto your event bus. Then your data science team can truly be a black box.

3) Unit test invariants in your model's outputs. While you can't write tests for exact outputs, you can say "such and such a case should output a higher value than some other case, all things being equal". When your model disagrees, do at least consider that the model may be correct though.

4) Do ablation tests in reverse, and unit test each addition to your model's architecture to prove it helps.

5) Often you will train a model on historical data, and content yourself that all future predictions will be outside this training set. However, don't forget that sometimes updates to historical data will trigger a prediction to be recalculated, and this might be overfit. Sometimes you can serve cached results, but small feature changes make this harder.

6) Your data scientists are probably the people who are most intimate with your data. They will be the first to stumble on bugs and biases, so give them very good channels to report QA issues. If you are a lone data scientist in a larger organisation, seek out and forge these channels early.

7) Don't treat labelling tools as grubby little hacked together apps. Resource them properly, make sure you watch and listen to the humans building and using them.

8) Have objective ways of comparing models that are thematically similar but may differ in their exact goal variables. If you can't directly compare log loss or whatever like-for-like, find some more external criteria.

9) Much of your job is building trust in your models with stakeholders. Don't be afraid to build simple stuff that captures established intuitions before going deep - show people the machine gets the basics first.

10) If you're struggling to replicate a result from a paper, either with or without the original code, welcome to academia.

Probably not earth shattering stuff, I grant you.


As luck would have it, my blog on Github pages is down. So here's the post describing the four in markdown.

Tldr; Designing Data Intensive Applications, Effective Python, The Google SRE book, and High Performance Browser Networking.

https://github.com/eatonphil/notes.eatonphil.com/blob/master...


Also, know the difference between honing and sharpening.

The edge of a knife is very fragile, because it is very very thin. As such, when you use the knife's edge, it will eventually "bend over".

A honing edge straightens the edge again. That's not "sharpening", its just pushing a stick against the knife edge to realign everything. It will cut better once you hone it.

https://www.globeequipment.com/learning-center/wp-content/up...

-----------

"Sharpening" is when you remove metal from the edge of the knife.

https://cdn.shopify.com/s/files/1/0001/8425/4516/files/messe...

Honing should be done every time you use the knife. Sharpening should only be done every month (if you use the knife a lot) or once every year (if its a rarely used knife)


As someone who owns more cast iron than is reasonable, I just wanted to go a step beyond upvoting this and acknowledge in writing that this is the correct view of things.

You want oil soaked into the metal. This idea that you are trying to build a non-stick surface on top of the metal is just adding extra work that isn't needed.

Wash with soap, dry of the stove top, put in some oil (I use peanut), then take a paper towel and rub the oil around over everything and at the end the metal should look shiny but there shouldn't be any pooled oil left anywhere.

If you are cooking something that doesn't leave a residue or strong flavor, you can skip the washing all together and just leave it there to cook with next time.

If you try to go the whole "never wash this" route, you end up with unhappy results going from cooking something with onions and garlic to cooking something more neutral flavored. No one wants onion flavored pancakes.


Both Bayesian inference and deep learning can do function fitting, i.e. given a number of observations y and explanatory variables x, you try to find a function so that y ~ f(x). The function f can have few parameters (e.g. f(x)= ax+b for linear regression) or millions of parameters (the usual case for deep learning). You can try to find the best value for each of these parameters, or admit that each parameter has some uncertainty and try to infer a distribution for it. The first approach uses optimization, and in the last decade, that's done via various flavors of gradient descent. The second uses Monte Carlo. When you have few parameters, gradient descent is smoking fast. Above a number of parameters (which is surprisingly small, let's say about 100), gradient descent fails to converge to the optimum, but in many cases gets to a place that is "good enough". Good enough to make the practical applications useful. In pretty much all cases though, Bayesian inference via MCMC is painfully slow compared to gradient descent.

But there is a case where it makes sense: when you have reasonably few parameters, and you can understand their meaning. And this is exactly the case of what's called "statistical models". That's why STAN is called a statistical modeling language.

How is that? Gradient descent for these small'ish models is just MLE (maximum likelihood estimation). People have been doing MLE for 100 years, and they understand the ins and outs of MLE. There are some models that are simply unsuited for MLE; their likelihood function is called "singular"; there are places where the likelihood becomes infinite despite the fit being quite poor. One way to fix that is to "regularize" the problem, i.e. to add some artificial penalty that does not allow the reward function to become infinite. But this regularization is often subjective. You never know when the penalty you add is small enough to not alter the final fit. Another way is to do Bayesian inference . It's very slow, but you don't get pulled towards the singular parameters.


Based on your description, it sounds like:

    <file.json jq '.. | .fileURL? | select(startswith("http://"))' -r
... would've done the job?

Or, if you can't remember `startswith`:

    <file.json jq '.. | .fileURL?' -r | grep '^http://'

All sensor data, the closer you get to the analog side of things, is bullshit. It's just about smoothing over the bullshit enough to make the tolerances workable for real world applications.

We call this bullshit smoothing "calibration". If you're doing work on sensor data and don't have every calibration parameter, whether from configuration or magic factory numbers and statistical tolerances, someone, somewhere is pulling the wool over the eyes of the software guy downstream that works with the final data.

Ever looked at weather data from two separate apps and have values vary by multiple degrees? Two different pipeline just sprinkled on their own versions of interpretations on top of raw sewage data.


In late 90s/early 2000s the mainstream thought around numerical optimization was that it was easy-ish when it was a linear problem, and if you had to rely on nonlinear optimization you were basically lost. People did EM (an earlier subgenre of what is now called Bayesian learning) but knew that it was sensitive to initialization and that they probably didn't hit a good enough maximum. Late 90s neural networks were basically a parlor trick - you could make it do little tricks but almost everything we have now including lots of compute, good initialization, regularization techniques, and pretraining, was absent in the late 90s.

Then in the mid and later 2000s the mainstream method was convex optimization and you had a proof that there was one global optimum and a wide range of optimization methods were guaranteed to reach it from most initialization points. Simultaneously, the theory underlying SVMs and CRFs was developed - that you could actually do a large variety of things and still use these easy, dependable optimization techniques. And people hammered home the need for regularization techniques.

In the late 2000s to early 2010s, several things again came together - one being the discovery of DropOut as a regularization technique - and the understanding that it was one, the other being the development of good initializers that made it possible to use deeper networks. Add to that improved compute power - including the development of CUDA which started out as a way to speed up texture computation but then led to general purpose GPU computing as we know it today. All this enabled a rediscovery of NN learning which could take off where linear learning methods (SVMs, CRFs) had plateaued before. And often you had a DNN that did what the linear classifier before did but could learn features in addition to that - and could be seen as finding a solution that was strictly better.

But the lack of global optimum means that - even with good initializers and regularization packaged into the NN modules we use in modern DNN software implementations - the whole thing is way more finicky than CRFs ever were. (It would be wrong to say that CRFs are trivial to implement or never finicky at all, just as many well-understood NN architectures have a good out-of-the-box experience with TF/PyTorch etc. - so take this as a general statement that may not hold for all cases).


The two-language problem is well-known. People wanted performance, which is reserved to languages like C, C++ or Java, but they didn't want to use these languages, since they are objectively ugly and a pain to write. Thus, languages like Python were born, but we were warned that they were going to be slow because something something dynamic typing something something the compiler can't optimize blah blah blah. And so we were told to avoid doing too many loops, or load too many objects in memory, or indeed even attempt push the language to match one's actual use cases, because Python wasn't well-built for it.

But in the meantime, languages like R or Matlab had figured a solution: write all the heavy-lifting ultra-optimized algorithms in C or Fortran or some equally ugly language that no one but really smart nerds wants to touch, and wrap it in a semantic that makes loops and loading many objects unnecessary, called 'vectorized operations'. In R, for instance, you think you're manipulating mere strings or logicals, but you're in fact manipulating vectors of length 1 and of type 'string', 'logical', etc. But doing operations on vectors or arrays became as seamless as doing them with mere scalars, with hardly any loss in performance. And so the R world thrived, although we were still cautioned to use weird lapply/sapply/rapply magic instead of doing proper loops because something something compiler something something slow blah blah blah.

And so the Python world saw that the R and Matlab world thrived, and wondered if they could do the same. A bunch of really smart nerds sat down with their laptops and wrote a bunch of ultra-optimized algorithms in one of those ugly languages no one else wants to touch, and lo, in the mid-2010s Python had finally achieved feature parity with R and Matlab twenty years ago. Yet the trend showed no sign of slowing, as Python was not only useful for scientific computing, but many other use cases as well (you ever tried to write an interface or webserver in R?), and sometimes researchers have the audacity to want to do several things at once with the computer. And so Python achieved its present ubitquity in data science.

There's trouble in paradise, however. As with R, we were cautioned to avoid doing too many loops because something something you know what I mean, and instead use vectorized operations. And little by little, we had to learn every day a little more of numpy's arcane API, the right magical formulas to invoke in order to avoid losing performance. We had to learn which operations are in-place and which ones create a new array (knowing this could change over multiple versions), which appropriate slicing and indexing to use, which specific functions to call. And the more our use cases deviated from the documentation, the more magic we had to learn. At some point we had to learn obscure methods beginning with an underscore, or even (the horror!) mind whether arrays were ordered C-style and Fortran-style, or even told to use Cython (!), nevermind your desire to absolutely avoid touching these languages in any way. May Allah be with you should you ever want to manipulate sparse data.

Aware that the community had to learn magic whose complexity on par with the ugly languages they'd sworn off, really smart nerds took it upon themselves to... write more magic in order to avoid writing the older magic. And so we got dask, which is as powerful as it is painful to use. We got numba, which seems to work automagically in the official demo snippets and zilch in your own. 'That's because you're using them wrong', the smart people tell you on stackoverflow. 'Teach me how to use them right', you beg. And so your mental spellbook thickens with no end in sight...

Enter Julia. Julia doesn't have that any of the above dillemas, because Julia is fast. Julia doesn't care whether you vectorize or write loops, but you can do either. Julia doesn't force you to declare types, but you can if you really want to. Julia doesn't require you to write advanced magic to do JIT compilation. Julia doesn't see itself as an R or Python competitor: why, Julia loves Python and R, and in fact you can just call one from the other if you feel like it! Go on, just RCall ggplot on an array created with PyCall("numpy"), it just works! Julia was built with parallel computing and HPCs in mind, so no need to fiddle with dask boilerplate when it just works with @macros. Julia knows programmers are afraid of change, so it syntax is really, really close to Python's. Julia has a builtin package manager. Julia lets you use the GPU without having to sacrifice a rooster to Baal every time you want to install CUDA bindings.

Of course Python isn't going anywhere, just like R is still going strong even after Python 'displaced' it. And of course, Julia's ecosystem is smaller (but growing), its documentation is lacking, it doesn't have millions of already answered questions on Stackoverflow...but if you know where the wind blows, you know where the future is headed, and its name rhymes with Java.


I consider shellcheck absolutely essential if you're writing even a single line of Bash. I also start all my scripts with this "unofficial bash strict mode" and DIR= shortcut:

    #!/usr/bin/env bash
    
    ### Bash Environment Setup
    # http://redsymbol.net/articles/unofficial-bash-strict-mode/
    # https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html
    # set -o xtrace
    set -o errexit
    set -o errtrace
    set -o nounset
    set -o pipefail
    IFS=$'\n'

    DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
I have more tips/tricks here: https://github.com/pirate/bash-utils/blob/master/util/base.s...

To OP's point of "features are more complicated than they appear" I always like to point to the flowchart of Slack's notification logic.

https://pbs.twimg.com/media/C6ROe0mU0AEmpzz?format=jpg


The portrait modes on these are getting really good. The blur is pretty convincing looking. The only open-source software I know that does similar stuff is body-pix which does matting, but I don't think it generates a smooth depth map like this thing. It would be cool because then you can do a clever background blur for your Zoom backgrounds with v4l2-loopback webcam.

By the way, I decided to also quick summarize the usual HN threads that have the trigger word iPhone in it:

- No headphone jack

--- Actually this is good because ecosystem built for it

----- Don't think ecosystem is good. Audio drops out

------- Doesn't happen to me. Maybe bad device.

----- Don't want to be locked in. Want to use own device.

------- That's not Apple philosophy. Don't know why surprised.

--------- I have right to my device

----------- cf. Right to Repair laws

------- Can use own device with dongle.

--------- Don't want dongle. Have to get dongle for everything. Annoying.

----------- Only need one dongle.

------------- If only audio, but now can't charge.

----------- Use dongle purse.

--- Apple quality have drop continuous. Last good Macbook was 2012.

----- Yes. Keyboard is useless now. Have fail. Recalled.

------- I have no problem with keyboard.

--------- Lucky.

------- Also touchpad have fail. Think because Foxconn.

------- Yes. Butterfly? More like butterfly effect. Press key, hurricane form on screen.

----- Yes. Yes. All Tim Cook. Bean Counter.

----- Yes. Many root security violation these days.

------- All programmers who make security violate must be fired.

--------- Need union so not fired if manager make security violation.

----------- Don't understand why no union.

------------- Because Apple and Google have collude to not poach. See case.

------- Yes. Security violation is evidence of lack of certification in industry.

--------- Also UIKit no longer correctly propagate event.

--- Phone too big anyway. No one make any small phone anymore.

----- See here, small phone.

------- Too old. Want new small phone. Had iPhone 8. Pinnacle of small beauty.

------- That's Android. No support more than 2 months.

--------- Actually, support 4 months.

----------- Doesn't matter. iPhone support 24 centuries and still going. Queen have original.

--------- Yes, and battery on Android small.

--- Will buy this phone anyway. Support small phone.

----- No. This phone is also big. No one care about small hand.

------- Realistically, phone with no SSH shell dumb. I use N900 on Maemo.

--- Who care? This press release. Just advertisement.

----- Can dang remove clickbait. What is one-eye anyway? Meaningless. Phone no have eye.

--- Also, phone not available in Bielefeld.

--- Phone only have 128 GB? Not enough. Need 129 GB.

----- 64 GB enough for everyone.

------- "640 KB enough for everyone" - Bill Fence, 1923


You can also do a per-directory _global_ git configuration, e.g.

in .gitconfig, you say:

    [user]
        name = Me Myself
        email = personal@example.com
        signingkey = D34DB44F

    [includeIf "gitdir:~/src/github.com/work_org/"]
        path = ~/.gitconfig_work
Then in ~/.gitconfig_work:

    [user]
        name = Me Myself
        email = work@example.com
        signingkey = D34DC0D4
    
    [core]
        sshCommand = ssh -i ~/.ssh/work_ed25519
I like this way better, because I don't need to remember to specify per-project config, as long I put them in the right directory :-)

I was in the same boat in 2014. I went a more traditional route by getting a degree in statistics and doing as much machine learning as my professors could stand (they went from groaning about machine learning to downright giddy over those two years). I worked as a data scientist for an oil-and-gas firm, and now work as a machine learning engineer (same thing, basically) for a defense contractor.

I’ve seen some really bad machine learning work in my short career. Don’t listen to the people saying “ignore the theory,” because the worst machine learning people say that and they know enough deep learning to build a model but can’t get good results. I’m also unimpressed with Fast AI for the reasons some other people mentioned, they just wrapped PyTorch. But also don’t read a theory book cover-to-cover before you write some code, that won’t help either. You won’t remember the bias-variance trade-off or Gini impurity or batch-norm or skip connections by the time you go to use them. Learn the software and the theory in tandem. I like to read about a new technique, get as much understanding as I think I can from reading, then try it out.

If I would do it all-over again I would:

1. Get a solid foundation in linear algebra. A lot of machine learning can be formulated in terms of a series of matrix operations, and sometimes it makes more sense to. I thought Coding the Matrix was pretty good, especially the first few chapters.

2. Read up on some basic optimization. Most of the time it makes the most sense to formulate the algorithm in terms of optimization. Usually, you want to minimize some loss function and thats simple, but regularization terms make things tricky. It’s also helpful to learn why you would regularize.

3. Learn a little bit of probability. The further you go the more helpful it will be when you want to run simulations or something like that. Jaynes has a good book but I wouldn’t say it’s elementary.

4. Learn statistical distributions: Gaussian, Poisson, Exponential, and beta are the big ones that I see a lot. You don’t have to memorize the formulas (I also look them up) but know when to use them.

While you’re learning this, play with linear regression and it’s variants: polynomial, lasso, logistic, etc. For tabular data, I always reach for the appropriate regression before I do anything more complicated. It’s straightforward, fast, you get to see what’s happening with the data (like what transformations you should perform or where you’re missing data), and it’s interpretable. It’s nice having some preliminary results to show and discuss while everyone else is struggling to get not-awful results from their neural networks.

Then you can really get into the meat with machine learning. I’d start with tree-based models first. They’re more straightforward and forgiving than neural networks. You can explore how the complexity of your models effects the predictions and start to get a feel for hyper-parameter optimization. Start with basic trees and then get into random forests in scikit-learn. Then explore gradient boosted trees with XGBoost. And you can get some really good results with trees. In my group, we rarely see neural networks outperform models built in XGBoost on tabular data.

Most blog posts suck. Most papers are useless. I recommend Geron’s Hands-On Machine Learning.

Then I’d explore the wide world of neural networks. Start with Keras, which really emphasizes the model building in a friendly way, and then get going with PyTorch as you get comfortable debugging Keras. Attack some object classification problems with-and-without pretrained backends, then get into detection and NLP. Play with weight regularization, batch norm and group norm, different learning rates, etc. If you really want to get deep into things, learn some CUDA programming too.

I really like Chollet’s Deep Learning with Python.

After that, do what you want to do. Time series, graphical models, reinforcement learning— the field’s exploded beyond simple image classification. Good luck!


Be sure to check out 3Blue1Brown's linear algebra series as well. (Maybe after you've built your own MNIST network) Blew my mind when I made the connection that each layer in a dense NN is learning how to do a linear transformation + a non-linear "activation" function.

A relevant Twitter thread: https://twitter.com/NeuroStats/status/1192679554306887681

At the risk of projecting, this has the hallmark of bad experimental design. The best experiments are designed to determine which of many theories better account for what we observe.

(When I write "you" or "your" below, I don't mean YOU specifically, but anyone designing the kind of experiment you describe.)

One model of gravity says the postition/time curve of a ball dropped from a height should look like X. Another model of gravty says it should look like Y.

You drop many balls, plot their position/time, and see which of the two models' curves match what you observe. The goal isn't to get the curve; the goal is to decide which model is a better picture of our universe. If the plotted curve looks kinda-sorta like X but NOTHING like Y, you've at least learned that Y is not a good model.

What models/theories of customer behavior were your experiments designed to distinguish between? My guess is "none" because someone thinking about the problem scientifically would start with a single experiment whose results are maximally dispositive and go from there. They wouldn't spend a bunch of time up-front designing 12 distinct experiments.

So it wasn't really an experiment in the scientific sense, but rather a kind of random optimization exercise: do 12 somewhat-less-than-random things and see which, if any, improve the metrics we care about.

Random observations aren't bad, but you'd do them when you're trying to build a model, not when you're trying to determine to what extent a model corresponds with reality.

For example, are there any dimensions along which the 12 variants ARE distinguishable from one another? That might point the way to learning something interesting and actionable about your customers.

Did the team treat the random algorithm as the control? Well, if you believe some of your customers are engaged by novelty then maybe random is maximally novel (or at least equivalently novel), and so it's not really a control.

What about negative experiments, i.e., recommendations your current model would predict have a NEGATIVE impact on your KPIs? If those experiments DON't produce a negative impact then you've learned that some combination of the following is the case:

   1. The current customer model is inaccurate
   2. The model is accurate but the KPIs don't measure what you believe they do (test validity)
   3. The KPIs measure what you believe they do but the instrumentation is broken
Some examples of NEGATIVE experiments:

What if you always recommend a video that consists of nothing but 90 minutes of static?

What if you always recommend the video a user just watched?

What if you recommend the Nth prior video a user watched, creating a recommendation cycle?

Imagine if THOSE experiments didn't impact the KPIs, either. In that universe, you'd expect the outcome you observed with your 12 ML experiments.

In fact, after observering 12 distinct ML models give indistingiushable results, I'd be seriously wondering if my analytics infrastructure was broken and/or whether KPIs measured what we thought they did.


> what are the killer resources for getting up to speed with Statistics & Probability?

(1) Linear algebra. Much of applied statistics is multi-variate, and nearly all of that is done with matrix notation and some of the main results of linear algebra.

The central linear algebra result you need is the polar decomposition that every square matrix A can be written as UH where U is unitary and H is Hermitian.

So, for any vector x, the lengths of Ux and x are the same. So, intuitively unitary is a rigid motion, essentially rotation and/or reflection.

Hermitian takes a sphere and turns it into an ellipsoid. The axes of the ellipsoid are mutually perpendicular (orthogonal) and are eigenvectors. The sphere stretches/shrinks along each eigenvector axis according to it's eigenvalue.

U and H can be complex valued (have complex numbers), but the real valued case is similar, A = QS where Q is orthogonal and S symmetric non-negative definite. The normal equations of regression have a symmetric non-negative definite matrix. Factor analysis and singular values are based on the polar decomposition.

You get more than you paid for: A relatively good treatment of the polar decomposition has long been regarded as a good, baby introduction to Hilbert space theory for quantum mechanics. The unitary matrices are important in group representations used in quantum mechanics for molecular spectroscopy.

Matrix theory is at first just some notation, how to write systems of linear equations. But the notation is so good and has so many nice properties (e.g., matrix multiplication is associative, a bit amazing) that with the basic properties get matrix theory. Then linear algebra is, first, just about systems of linear equations, and then moves on with matrix theory. E.g., could write out the polar decomposition without matrix theory, but it would be a mess.

(2) Need some basic probability theory. So, have events -- e.g., let H be the event that the coin came up heads. Can have the probability of H, written P(H), which is a number in the interval [0,1]. P(H) = 0 means that essentially never get heads; P(H) = 1 means that essentially always get heads.

Can have event W -- it's winter outside. Then can ask for P(H and W). That works like set intersection in Venn diagrams. Or can have P(H or W), and that works like set union in Venn diagrams.

Go measure a number. Call that the value of real valued random variable X. We don't say what random means; it does not necessarily mean unpredictable, truly random, unknowable, etc. The intuitive "truly random" is essentially just independence as below, and we certainly do not always assume independence. For real number x and cumulative distribution F_X, can ask for

F_X(x) = P(X <= x)

Often in practice F_X has a derivative, as in calculus, and in that case can ask for the probability density f_X of X

f_X(x) = d/dx F_X(x)

Popular densities include uniform, Gaussian, and exponential.

We can extend these definitions of (cumulative) distribution and density to the case of the joint distribution of several random variables, e.g., X, Y, and Z. We can visualize the joint distribution or density of X and Y on a 3D graph. The density can look like some wrinkled blanket or pizza crust with bubbles.

In statistics, we work with random variables, from our data and from the results of manipulating random variables. E.g., for random variables X and Y and real numbers a and b, we can ask for the random variable

Z = aX + bY

That is, we can manipulate random variables.

Events A and B are independent provided

P(A and B) = P(A)P(B)

In practice, we look for independence because it is powerful case of decomposition and with powerful, even shockingly surprising, consequences. A lot of statistics is from an assumption of independence.

Can also extend independence to the case of random variables X and Y being independent. For any index set I and random variable X_i for i in I, can extend to the case of the set of all X_i, i in I, being independent. The big deal here is that the index I can be uncountably infinite -- a bit amazing.

For the events heads H and winter W above, we just believe that they are independent just from common sense. To be blunt, in practice the main way we justify an independence assumption is just common sense -- sorry 'bout that.

Can define some really nice cases of an infinite sequence of random variables converging (more than one way) to another random variable. An important fraction of statistics looks for such sequences as approximations to what we really want.

Since statistics books commonly describe 1-2 dozen popular densities, a guess is that in practice we look for the (distribution or) density of our random variables. Sadly or not, mostly no: Usually in practice we won't have enough data to know the density.

When we do know the density, then it is usually from more information, e.g., the renewal theorem that justifies an assumption of a Poisson process from which we can argue results in an exponential density or the central limit theorem where we can justify a Gaussian density.

(3) A statistic is, give me the values of some random variables; I manipulate them and get a number, and that is a random variable and a statistic. Some such statistics are powerful.

(4) Statistical Hypothesis Testing. Here is one of the central topics in statistics, e.g., is the source of the much debated "p-values".

Suppose we believe that in an experiment on average the true value of the results is 0. We do the experiment and want to do a statistical hypothesis test that the true value is 0.

To do this test, we need to be able to calculate some probabilities. So, we make an assumption that gives us enough to make this calculation. This assumption is the null hypothesis, null as in no effect, that is, we still got 0 and not something else, say, from a thunderstorm outside.

So, we ASSUME the true value is 0 and with that assumption and our data calculate a statistic X. We find the (distribution or) density of X with our assumption of true value 0. Then we evaluate the value of X we did find.

Typically X won't be exactly 0. So we have two cases:

(I) The value of X is so close to zero that, from our density calculation, X is as close as 99% of the cases. So, we fail to reject, really in practice essentially accept, that the true value is 0. Accept is a bit fishy since we can accept easily enough, that is don't notice the pimple simply by using a fuzzy photograph, just by using a really weak test! Watch out for that in the news or "how to lie with statistics".

(II) X is so far from 0 that either (i) the true value is not zero and we reject that the true value is 0 or (ii) the true value really is 0 and we have observed something rare, say, 1% rare. Since the 1% is so small, maybe we reject (ii) and conclude (i).

So, for our methodology we can test and test and test and what we are left with is what didn't get rejected. Some people call this all science, others, fishy science, others just fishy. IMHO, done honestly or at least with full documentation, it's from okay to often quite good.

I mentioned weak tests; intuitively that's okay, but there is a concept of power of a test. Basically we want more power.

Here's an example of power: Server farm anomaly detection is, or should be, essentially a statistical hypothesis test. So, then, the usual null hypothesis is that the farm is healthy. We do a test. There is a rate (over time, essentially probability) of false alarms (where we reject that the farm is healthy, conclude that it is sick, when it is healthy) and rate of missed detections of actual problems (fail to reject the null hypothesis and conclude that the farm is healthy when it is sick).

Typically in hypothesis testing we get to adjust the false alarm rate. Then for a given false alarm rate, a more powerful test has a higher detection rate of real problems.

It's easy to get a low rate of false alarm; just turn off the detectors. It's easy to get 100% detection rate; just sound the alarm all the time.

So, a question is, what is the most powerful test? The answer is in the classic (late 1940s) Neyman-Pearson lemma. It's like investing in real estate: Allocate the first money to the highest ROI property; the next money to the next highest, etc. Yes, the discrete version is a knapsack problem and in NP-complete, but this is nearly never a consideration in practice. There have been claims that some high end US military target detection radar achieves the Neyman-Pearson most powerful detector.

There's more, but that is a start.


The intersting thing is that today's NLP has nothing to do with the manipulation of symbols, as such. Rather, NLP is performed (lately) by neural networks, that learn to optimise the parameters of continuous functions. NLP is possible in this way because by construction, the parameters of the functions to be optimised are associated to the elements of natural language (characters, mostly, but sometimes words). So an NLP "machine" nowadays means a machine that can predict the next character (or word) in a sequence.

Leibniz's ideas on the other hand, of combining symbols to generate and evaluate human language, that's the spirit that permeates Good, Old-Fashioned AI, symbolic, and logic-based artificial intelligence. Today, most NLP researchers would say that the logic-based branch of research, common as it was until very recently, was after all a dead end that did not lead anywhere. It would be interesting to be able to know what Leibniz would have made of that.

Perhaps one day (in the far distant future, when my bones are dust and my memories lost) we'll be able to reproduce the great philosopher's thoughts from his writings and reconstruct his personality, in part or in whole. And then we could pose to him the question: "Master, what do you think of the machine that now houses your intellect"?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: