I poked at the github repo for a bit. The ugliness of the code doesn't bother me...

noelsusman · on May 18, 2020

You're correct to focus on the effect of parameter choices over code quality. It's been a little funny to watch a bunch of software engineers freak out about unit tests while ignoring everything else that has a much larger impact on the output of the model. I would bet large sums of money that this code is producing the correct output according to the model/parameter specifications.

All I can say is welcome to epidemiology. The spread of a disease is highly dependent on a host of factors that we have very little insight into. Even simple things like hospitalization rate or fatality rate can be difficult if not impossible to estimate accurately. Epidemiologists are open about this, but few people ever want to listen. Humans just aren't good at truly conceptualizing uncertainty.

The theory behind disease spread models is relatively sound, but they're highly dependent on accurate estimates of input parameters, and governments have not prioritized devoting resources toward improving those estimates. I sat in on discussions between epidemiologists and government officials about COVID models. The response to nearly every question was "we don't know, but here's our best guess". I listened to them beg officials for random testing of the population to improve their parameter estimates. That testing never happened.

thu2111 · on May 19, 2020

I would bet large sums of money that this code is producing the correct output according to the model/parameter specifications.

I'll take that money off you then.

The code has various memory safety bugs in it and originally had a typo in a random number generator constant. Amongst other problems.

There's really no reason to believe it produces correct outputs, in fact, we know it didn't and probably still doesn't given how it was written.

wbhart · on May 18, 2020

The problem is, unsophisticated models do not predict anything. You apply them in one country and they do ok, and apply them in another and they get it totally and completely wrong.

Unless all important factors are accounted for, they are going to result in incorrect information for someone. Public policy will then be based on incorrect predictions. People will grow tired of the predictions being wrong and they'll give up on data science entirely.

It's already quite bad that people think they can choose their reality by finding numbers that agree with them and ignoring the ones that don't.

I do understand the point you are making, which is like the epicycles argument. But in global warming and epidemics alike, more parameters are actually needed to model reality.

I do agree that those parameters should be based on actual data, not guesses though. But what value of R would you pick? Is that actually well-constrained?

datastoat · on May 18, 2020

I would pick a value of R that shows itself to have good predictive accuracy.

The way to test predictive models is always to look for their predictive accuracy on holdout data. Machine learning has this ingrained. Classic statistics does this too -- AIC is used to compare models, and it's (asymptotically) leave-one-out cross validation [1].

There's nothing intrinsically wrong with models that have millions of parameters; they might overfit in which case they will have poor predictive accuracy on holdout data, or they might predict well.

I agree with the original article that software engineer scrutiny isn't appropriate for this sort of code -- but I would argue instead that it needs a general-purpose statistician or data scientist or ML expert to evaluate its predictive accuracy. You can't possibly figure this out from a simulator codebase.

At the time the model was published, and acted on by the UK government, there was very little data on which to test predictive accuracy. That's fine -- all it means is that the predictions should have been presented with gigantic confidence intervals.

[1] http://www.stats.ox.ac.uk/~ripley/Nelder80.pdf

rainforest · on May 18, 2020

The model isn't predictive though - it's a simulator. If we'd waited until we had enough data to make predictions with it (which I doubt you could given the sheer number of parameters) it'd be too late to use any of the interventions.

How would you ethically collect training data for the interventions?

datastoat · on May 18, 2020

The outputs of the model _were_ being treated as predictions.

The Ferguson paper from 16 March used the language of prediction: "In the (unlikely) absence of any control measures [...] given an estimated R0 of 2.4, we predict 81% of the GB and US populations would be infected over the course of the epidemic." [1]. The news coverage also used that language: "Imperial researchers model likely impact of public health measures" [2]. And look at the rest of the comments in this discussion, and count how many types "predict" appears!

> If we'd waited until we had enough data to make predictions with it

This is like the drunk looking for their keys under a streetlight. "Did you lose the keys here?" "No, but the light is much better here." -- "How confident are you in your model's predictions?" "I have no idea, but it's the model I have."

Also -- the Ferguson model made predictions, based on the parameters they picked. You don't need to wait for data to make predictions; you only need data to validate your predictions.

> How would you ethically collect training data for the interventions?

You don't. You (as a scientist who influences public policy) should publish validated confidence intervals for your predictions. You (as a government) should understand that there is a huge margin of uncertainty in the predictions, and accept that sometimes you just have to make decisions in the absence of knowledge. You (both the scientist and the government) do not go around spouting "Our decisions are led by science".

[1] https://spiral.imperial.ac.uk:8443/bitstream/10044/1/77482/1...

[2] https://www.imperial.ac.uk/news/196234/covid19-imperial-rese...

rainforest · on May 19, 2020

How do you validate the predictions for the number of infected cases in May for scenarios that don't happen?

CharlesW · on May 18, 2020

> The problem is, unsophisticated models do not predict anything. You apply them in one country and they do ok, and apply them in another and they get it totally and completely wrong.

That's the nature of all models, "sophisticated" or not. Relatively simple models may or may not be useful for a particular case, just as relative complex models may be.

tkiley · on May 18, 2020

"But what value of R would you pick?"

I don't know -- and until we can agree on the answer to your simple question with a high degree of confidence, I think complex models based on specific assumed values of R obscure more than they reveal.

A little bit of modeling is useful because humans are intuitively bad at exponential math and we need scary graphs to jolt us awake sometimes. But when we don't even know the basic parameters (transmission/hospitalization/fatality) with a high degree of precision, complex models with myriad parameters create a false sense of confidence.

asddsfgdfsh · on May 18, 2020

We have a model, we can run some sensitivity analysis, then we can go out and collect data to better estimate the parameters to which we are sensitive to. Important but not glamorous work and hence underfunded.

jandrewrogers · on May 18, 2020

I was asked to look at the spatiotemporal parameters and modeling, separate from any code issues. That part of the model is astonishingly naive, apparently oblivious to existing research and science on the matter that strongly recommends a different and much more nuanced approach. Industry has invested inordinate amounts of money in understanding how to build effective real-world predictive models of this type and none of that knowledge is reflected here. That seems like a rather glaring oversight and alone voids any utility as a predictive model.

tripletao · on May 18, 2020

I partially agree with the comment above, but I also think it misunderstands how numerical models are often used. At least where I've built them (not epidemiology), the goal wasn't necessarily to gather the most accurate set of inputs and produce the most accurate prediction of the output. The goal was often to help a highly skilled operator explore the parameter space and guide their intuition on the problem, to help that person and simulation together reach some decision.

So code quality mattered less then usual. If there's a significant bug, then the operator will probably notice, and if there's an insignificant bug then no one cares. The large number of input parameters also doesn't matter. The operators are fully aware that they could artificially manipulate the output to wherever they wanted, but to do so would be cheating only themselves.

It feels to me like Ferguson's model was built with similar intent, and probably served that purpose well. The problem came only when the media portrayed the model as a source of authority apart from the people operating it, perhaps to create a feeling of objectivity behind the decisions driven from that. That created an expectation of rigor that either didn't exist (in the software engineering), or fundamentally can't exist given our current knowledge of the science (in the input assumptions).

tentboy · on May 18, 2020

This reminds me of the Drake equation. A sound formula for the probability extra terrestrial life..but half the parameters are wild guesses that can differentiate in orders of magnitude.

asddsfgdfsh · on May 18, 2020

The flip side to having lots of parameters is that you have lots of knobs to tune beyond a basic lockdown.