This is what happens when devs are presented with a very complicated problem, ex...

gruez · on March 27, 2018

>extremely short deadline

they had 6 months.

maltalex · on March 27, 2018

> they had 6 months.

You say that like 6 months is automatically a lot of time. "She had 6 months to give birth". Yeah, only it takes 9, so 6 is short.

Consider the scope and depth of the issue and the fact that they probably couldn't involve too many people on this effort.

zeth___ · on March 27, 2018

Did she try having the pregnancy in parallel?

Doesn't sound like she was trying at all.

jahewson · on March 27, 2018

Twins - double the bandwidth but the latency stays the same.

johnhenry · on March 28, 2018

0.22 vs 0.11 bpm is actually a big improvement despite the latency.

TeMPOraL · on March 28, 2018

"Never underestimate the bandwidth of a station wagon full of babies hurtling down the highway"?

bayangan · on March 28, 2018

Truly a quote for ages

efdee · on March 28, 2018

It isn't if you need the baby in 6 months.

avereveard · on March 28, 2018

just get one from any outsurcing firm then

johnhenry · on March 28, 2018

Pre-caching.

tiuPapa · on March 28, 2018

She should have used Rust for a fearless pregnancy.

walterbell · on March 27, 2018

Other operating system maintainers had only days/weeks in Jan 2018.

loufe · on March 27, 2018

We're all throwing darts in the dark here with regards to the resource they gave the problem and its difficulty. I think the real takeaway is just that it could be something other than just stupidity through and through.

mannykannot · on March 27, 2018

Unless you can be sure their response solved their problem without introducing others, it's not evidence of sufficient time.

rtpg · on March 28, 2018

So I agree with the principle that a lot of time is a lot of time.

Inversely, though, I'd argue that Meltdown is a relatively small problem! It's strictly around memory usage, cache, and calling patterns. There's not a lot of systems at play, though there's the hard "figure out which order of instructions gets the state machine in a dangerous state" problem. There's a lot less coordination involved than, say, a system call bug that would subtly return the wrong answer half the time and you know that programs sometimes rely on this and others crash because of it.

Some things are hard, other things are hard but at least they're basically math, and math has a bit more determinism involved. Imagine if UX design or debugging strategies could always be broken down into state machines!

avip · on March 27, 2018

Great analogy. As an ex-manager always said: "3 women don't deliver in 3 months".

[E I see walrus01 already got that]

pizza234 · on March 28, 2018

This is Scott Adam's (Dilbert) hilarious take on it: http://www.dilbert.com/strip/2007-09-03

walrus01 · on March 27, 2018

corollary adage: nine women cannot gestate and give birth to a baby in one month.

DrScump · on March 28, 2018

See Brooks' little-known sequel: The Mythical Woman-Month

friedButter · on March 28, 2018

But 9 women can give birth to one baby per month on average

ptaipale · on March 28, 2018

Not really for a period longer than 9 months. Of course, you can have even 9 women deliver 9 babies in one month, but none in the following 10-12 months.

stagbeetle · on March 28, 2018

Retort: That's where you're wrong! If we hooked up nine mothers to one single faetus, we could get the job done in 9 months.[0] The same way, if we hooked up our dev teams to a lead that could delegate the work properly, we could pump out a Meltdown patch in around a month and a half.

http://www.pnas.org/content/early/2012/08/28/1205282109?sid=...

vageli · on March 28, 2018

> Retort: That's where you're wrong! If we hooked up nine mothers to one single faetus, we could get the job done in 9 months.[0] The same way, if we hooked up our dev teams to a lead that could delegate the work properly, we could pump out a Meltdown patch in around a month and a half.

> http://www.pnas.org/content/early/2012/08/28/1205282109?sid=....

Except this whole train of thought falls apart once you consider the difficulty of "hooking up 9 mothers to a single fetus". In the same way you down play the difficulty of coordinating multiple teams for a solution around breaking research. Show me a working solution of the former and I'll accept the corollary.

mygo · on March 28, 2018

you're looking at the problem all wrong, maltalex. Just hire a developer who is already 3 months pregnant.

chris_wot · on March 27, 2018

The Linux kernel developers came up with a decent solution, how can it be that they can do this and the Microsoft developers cannot?

rxhernandez · on March 27, 2018

Is Windows written the exact same way as Linux? Never underestimate the amount of technical debt that can be holding a team down.

acct1771 · on March 28, 2018

Ding-ding-ding, this is the non-bs answer.

tinus_hn · on March 28, 2018

Is that an excuse though?

chris_wot · on March 28, 2018

It appears not, if such a bad bug can get through all the way to release.

Raymonf · on March 28, 2018

"Linux" had a bug in which you could log into a system by pressing backspace 28 times a few years ago. And by Linux, I meant GRUB[1], and in turn, (many) Linux systems.

We're comparing Linux and Windows, an operating system that contains 3.5 million files[2] (of course, not just the kernel in this case). That isn't really fair. Code is as perfect as humans can make it, and it certainly does not help that there's so much to take into account.

[1] http://hmarco.org/bugs/CVE-2015-8370-Grub2-authentication-by...

[2] https://arstechnica.com/gadgets/2018/03/building-windows-4-m...

pecg · on March 28, 2018

This GRUB bug you are talking about, is not a kernel problem though; on a side note, I'm going to read on the links you provided as I want to see if encrypted root partitions could also be compromised, I suspect no.

chris_wot · on March 29, 2018

That's not quite on par with this Windows bug, but I take your point.

nl · on March 28, 2018

The Linux kernel developers hate their solution, and they only used it because they can't think of a better one. It causes enormous increases in complexity and kills performance in many cases.

They revived previous work on this as part of the KAISER work in November 2017, and still had major bugs with it in February 2018 (ie, 4 months later). That's pretty similar to the 6 month timeline mentioned here.

https://lwn.net/Articles/738975/

https://arstechnica.com/gadgets/2018/01/whats-behind-the-int...

amluto · on March 28, 2018

Linux kernel developer here. I don’t know know how MS’s Meltdown solution differs from Linux’s, let alone whether I should hate it.

MS (I think) uses IBRS to help with Spectre, and IBRS is not so great. Retpolines have a more fun name at the very least :)

sewer_bird · on March 28, 2018

I think he means Linux kernel developers hate (their own) solution for Meltdown.

caf · on March 29, 2018

The essential element of the solution for Meltdown is the same in every x86-64 OS: unmapping the kernel when in usermode. This is widely hated because it makes kernel entries and exits much slower, and blows away the TLB if your hardware doesn't have PCID support.

nl · on March 29, 2018

Yes, this.

It sucks, but what else can one do?

cesarb · on March 28, 2018

The Linux developers had a head start in the form of the KAISER (later KPTI) patch set, development of which had AFAIK started before Meltdown was discovered and reported in private to Intel.

gruez · on March 28, 2018

>AFAIK started before Meltdown was discovered

source?

cesarb · on March 28, 2018

According to https://googleprojectzero.blogspot.com.br/2018/01/reading-pr... Spectre was initially reported to Intel on 2017-06-01, and Meltdown a bit later.

After a quick web search, I found https://patchwork.kernel.org/patch/9712001/ which records the initial submission of the KAISER patch set at 2017-05-04. The repository at https://github.com/IAIK/KAISER has an older version of the patch set dated 2017-02-24, indicating that work on it had started even earlier.

Finally, the timeline at https://plus.google.com/+jwildeboer/posts/jj6a9JUaovP mentions a presentation from the authors of the patch set at the 33C3 on late 2016. Note that this page puts the submission of the KAISER patch set at 2017-06-24, but I believe that to be wrong; searching the web for "[RFC] x86_64: KAISER - do not map kernel in user mode" finds several mail archives with that message, and they all agree that the date was on May, not June.

That is, even if Microsoft had been immediately warned by Intel (or by Google), the Linux kernel developers would still have had a few extra months of head start, by basing their work on the KAISER patch set. Was it luck, or a side effect of the Linux kernel being used for academic research?

Cookiesaurusbex · on March 28, 2018

They said discovered AND reported, not just discovered. It is entirely possible someone discovered it much earlier and didn't report, but we won't ever know if evidence is never found.

xucheng · on March 29, 2018

KAISER is developed to mitigate another less severe vulnerability.

From the meltdown paper:

> We show that the KAISER defense mechanism for KASLR [8] has the important (but inadvertent) side effect of impeding Meltdown. We stress that KAISER must be deployed immediately to prevent large-scale exploitation of this severe information leakage.

freehunter · on March 27, 2018

Because Linux is a completely different OS with completely different code and a completely different set of problems.

chris_wot · on March 28, 2018

That's entirely weak.

freehunter · on March 28, 2018

I think the WINE people are probably looking for your help, for some reason they still seem to think there's a few differences between the two. They'll be happy to know they've been wasting their time.

chris_wot · on March 28, 2018

[flagged]

stagbeetle · on March 28, 2018

[flagged]

sctb · on March 29, 2018

We ban accounts that attack other users like this. Please stop.

https://news.ycombinator.com/newsguidelines.html

michaelmrose · on March 28, 2018

This is not appropriate discourse here.

cesarb · on March 28, 2018

Another relevant factor I just thought of: the Windows kernel has more constraints, due to binary-only drivers which have to keep working. The Linux kernel could fix any incompatible driver at the same time, since they're all in the same git tree (out-of-tree drivers are not expected to be compatible with newer kernels).

lmilcin · on March 28, 2018

I agree. The philosophy is different. Linux is focused on having the right thing working, at the cost of compatibility (sometimes). Windows is (or at least was) focused on extreme compatibility and the actual features of operating system seem to be slapped onto the features of the previous version of the OS.

This seemed to work well for Windows audience in the past, also for Linux audience, due to the fact that they have different uses and audiences.

People seem to have segregated into those users that just want stuff working and those that want powerful operating system that allows them to do whatever they want.

At least that was until sometime the Windows 10 came...

rbobby · on March 28, 2018

> only it takes 9

What if we outsourced the QA to India?

yorby · on March 28, 2018

that's a really bad analogy... Note: I think that those devs should be up for death row /s

dmix · on March 27, 2018

A single person with a fixed 9 month biological timeframe is the analogy you chose to compare 6 months of a billion dollar company's software development time by potentially hundreds of developers (for better or worse) but importantly for an extremely critical class of bugs and therefore that's "just how it is"?

Come on, software is hard but when you fix a vulnerability and expose a far far worse one, and had months to plan, execute, and test it, then you are most certainly justified in being criticized.

It's not like we're saying the code is shoddy and needs work, which is entirely excusable in a short timeframe. It's that they've left users far worse off in the end then from where they started.

amluto · on March 27, 2018

If MS had allocated 1000 devs to fix this issue quickly, the result would have been an utter disaster.

dmix · on March 27, 2018

Of course 1000+ developers working on one single solution waterfall style in a short timeframe is a terrible idea. That's not how software works... and we all know that. You know that.

Jumping on the next worst thing does not excuse them either. Nor is taking another analogy to the other extreme helpful at all in this discussion.

A solid pool of talent with complete flexibility resource-wise and a strong critical-level mandate is nothing like a single person with a fixed biological timeframe, with relatively limited resources, no matter which way you'd like to spin it.

justinjlynn · on March 27, 2018

If they had allocated 1k devs into n teams to develop different approaches and review and test each other's code and approach. No, the result would've been a better patch and probably not that piece of hot garbage.

rhizome · on March 27, 2018

I suggest reading The Mythical Man Month sometime, you're not accounting for the complexity of running such an "n-teams" scheme.

justinjlynn · on March 28, 2018

I have read it and yes, I have. Assuming they had that many developers qualified to work on the problem, they'd also likely already have been employed on other projects. Therefore the management infrastructure would already be in place. The state would need to change, but yeah, the government would already be there and qualified.

hinkley · on March 28, 2018

My own personal dogma is that your CI/CD system hasn't achieved its goal until everyone on the team can spool up a given build of the code and try to reproduce an error for themselves without interrupting anyone else to do it.

The person who discovers the bug may not come up with the best repro case. The person best equipped at fixing the bug may not be best person to track it. Being able to spool up new people on a problem for cheap keeps the whole experience lower stress and generally improves your consistency with regards to success.

If the cost of someone trying a crazy theory is linear in man-hours and O(1) or even O(log n) for wall clock hours you're going to look like a bunch of professionals instead of a bunch children with pointy sticks.

From what I understand, Microsoft has never gotten there. They got too big to fail a long time ago. And certainly wouldn't have for Windows 7.

justinjlynn · on March 28, 2018

Not only that but the teams would be working independently by design. 9 people can't make a human in one month, but 9 people can make 9 children in 9 months. You can then choose amongst them. So, yeah, I have no idea why you're bringing in mythical man month stuff here.

stagbeetle · on March 28, 2018

Anything sufficiently complex can be broken down into simpler pieces. This includes most developer generalists.

TeMPOraL · on March 28, 2018

Anything sufficiently complex can be broken down into simpler pieces plus the glue holding those pieces together.

In human organizations, that glue itself gets incredibly complex and expensive, as number of pieces grow.

stagbeetle · on March 28, 2018

I disagree that glue is expensive and complex. When you build a ply-wood tower in school to see who's holds up the most to compression, you don't douse your entire structure in glue. You get points off, because it adds so much to weight!

People are the plywood, fragile, finicky, and useless if left to their own devices. Management is the middle school kid who needs to take the wood he's been given and make something that will hold up to all the weight that'll be put on top it. In order to do this, he's been given a hot glue gun and enough glue to mummify the entire thing if he so chooses. Most of the kids will rush bullheadedly (or should I say uncaringly) into gluing the sticks together into something that "looks like it should work." They use too much glue, the structure isn't optimized for load handling, and when the day of truth comes, it crumbles down when the bucket that's supposed to hold the weight, destroys it!

What is glue? Whatever management wants it to be. It can be a team leader or a hastily configured IRC channel. In my experience (this includes organizing, delegating, and making sure that 40 devs-et-al get what's needed done), if you choose your sticks right, taking the time to make sure they're not hiding any structural faults, you can make the job 65% easier. If you lament that choosing sticks if difficult, I reply with "it's just practice."

The main issue I've seen, has been the all too common "there are no good managers." Especially in technology. The remedies for this? There's no bandaid. Each manager has to realize his personal shortcomings and fix them. But, to throw up his hands and say "the more people working on a project, the slower it'll get done," is a nice way to say "I can't handle all these people, but I'll excuse that away by saying it's inevitable. It's even industry 'common sense!'"

amluto · on March 27, 2018

All but a very small number of those teams would have spent quite a while reading manuals, reading code, and learning how the kernel entry code and pagetable handling code worked. Then they'd come up with something, but there would be a severe shortage of reviewers.

Not to mention that the whole problem would most likely leak once that many people knew about it.

thatfrenchguy · on March 27, 2018

There's probably no 1000 good VM engineers out there, in the world.

dmix · on March 27, 2018

Yet it didn't take 1000 of them to fix it on other OSes. Why are we even debating 1000 devs anyway? That is hardly the point and throwing more and more bodies at a programming problem is hardly ever a solution, nor one I proposed in my original comment.

It's ultimately a matter of talent, resources, and proper management. Which is hardly an insurmountable problem for a major tech company which decades of experience solving world-is-ending bugs.

chris_wot · on March 28, 2018

I put it down to lack of openness to a wider review than just Microsoft engineers.

golergka · on March 27, 2018

https://en.wikipedia.org/wiki/Brooks%27s_law

Nine month analogy is widely known in software development.

bitL · on March 27, 2018

Software is all about capturing as many income streams with as fewest people as possible.

exikyut · on March 27, 2018

While also generating the most number of jobs possible.

sova · on March 28, 2018

What kind of jobs? Architects developing biotecture, or janitors cleaning up vomit and firefighters putting out dog shit that's burning?

sddfd · on March 27, 2018

Are you sure that that particular team inside Microsoft had full six months?

Intel had 6 months.

gerdesj · on March 27, 2018

I'm pretty sure they had around six months give or take a day or so of oh shit in Intel. Of course, Intel may have actually simply broken the glass on a dusty old plan of action "In the event of ..."

zer00eyz · on March 27, 2018

Disaster plans are funny things.

I have had the misfortune of having to pull them out twice in my career - in both cases they offered little in the way of guidance for the particular situation that came up.

The set of unknown unknowns that are typically missed make most of them unless in all but the most narrow of cases, because many companies write them, and then forget them. Especialy if they are as large as intel.

gerdesj · on March 27, 2018

Very true, although I'm glad to say I have not had to break out one of my own yet for real. My first experience of a full on DR test was pretty humbling - NetWare servers backed up by the Unix troops via Legato. It turned out that the backups were good but restored at a pathetically slow speed (no reflection on the Unix systems but I suspect the Novell TSAs were a bit shag at the time). We updated "time to restore" estimations and moved on, after adding one or two other results of lessons learned.

Do test your plans (this is not aimed at you personally zer00eyz - you probably know better than most).

There are a lot of unknowns but the basic model of a real DR plan is pretty sound these days, if you can afford it or wing it in some way. An example:

Another site, a suitable distance away. On that site there is enough infra to run the basics - wifi, a few ethernet ports, telephony etc. There should also be enough hypervisor and storage for that. Some backups are delivered there as well as on site. Hypervisor replicas are created from the backups (or directly) depending on RPO requirements and bandwidth available. The only thing that should be able to routinely access the backup files is the backup system (certainly not "Domain Admins" or other such nonsense". Ensure that what is written is verified.

Now test it 8)

.... regularly

zer00eyz · on March 28, 2018

Ok now I have to share a story...

The company in question had a rather large on site server room (raised floor, fire suppression) and a massive generator to deal with any power issues as well as redundant connectivity. This room was literally the backup incase their "real" data center went off line.

The problem is that the room was "convenient" so there were plenty of things that lived ONLY there (mistake one) -

When the substation for the office went, and the generator started everything looked fine. The problem was that no one had ever run the generator for that long... after a few hours it simply crapped out (over heated, problem two).

A quick trip to home depot got them generators and extension cords that let them get the few boxes that were critical back up - however one box decided to not only fault, but to take it's data with it.

This is when I got a rather frantic call "did I still have the code from the project I did?" - they offered to cut me a check for $2000 if I would go home right then and simply LOOK for it.

Lucky for them I had it - and the continuity portion of the DR plan got revisited.

In hind sight after I said I had the code, I probably could have asked them to put another zero on the end of the check and they would have done it just to be a functioning business come 6am.

gerdesj · on March 28, 2018

I didn't even have to show some leg to get you to recant the dit.

Thank you - I'm happy to listen to (nearly) everything.

"I probably could have asked them to put another zero" - ahem that's not the IT Consultant's Way exactly. We have far more polite ways of extracting loot. We are not lawyers and should have morals.

TeMPOraL · on March 28, 2018

Reminds me of that The Expanse quote:

"I have a file with 900 pages of analysis and contingency plans for war with Mars, including fourteen different scenarios about what to do if they develop an unexpected new technology. My file for what to do if an advanced alien species comes calling is three pages long, and it begins with 'Step 1: Find God'."

campuscodi · on March 27, 2018

Microsoft had 2-3 months, tops.

gerdesj · on March 28, 2018

Source, proof of assertion?

bonzini · on March 28, 2018

As far as I know he's right. The news was given first to Amazon and Microsoft sometime in August. Consider one month for testing and preparing for release, that gives three months to build a solution for all supported operating systems. Two months to do it for the most recent version and one month for backporting to the older ones sounds about right. Maybe a few weeks more, but that's it.

campuscodi · on April 1, 2018

Intel's own press releases. I'm not gonna go digging into old links just because some random dude on the Internet can't use Google.

Aloha · on March 27, 2018

If the problem is complex enough, six months may be a short deadline.

azinman2 · on March 27, 2018

Especially since it comes as a surprise, and there’s already an existing train on its way to the next station with its own timetable.

cakes · on March 27, 2018

that surprise could include "management says we don't have to do anything :/" <5.99 months pass>, management: "we have to patch this and it needs to be done yesterday."

TeMPOraL · on March 28, 2018

That would be extreme, but I can entirely imagine it taking many weeks for the true importance of this problem to correctly propagate across all management levels.

yashap · on March 28, 2018

I wonder how long the actual devs fixing it had? From what I hear from friends who work there, Microsoft is a sprawling bureaucracy with many layers of management, where decisions are far from quick. I'd imagine that after Intel/whoever let Microsoft know about the exploits, it went through many levels of prioritization, negotiation about which team would work on it, not being brought into sprints because of other features already being worked on, etc. Most likely there were people with minimal knowledge of the relevant tech making all these prioritization decisions.

Wouldn't shock me at all if there was very little actual dev work done for the first few months, and then it was all super rushed at the end. Quite possibly the devs with the required knowledge didn't even know this was in the pipeline for months. That's par for the course at every decently large company I've worked at (i.e. 100+ devs), and at a beast like Microsoft I imagine it'd be way worse.

lmilcin · on March 28, 2018

I remember that Microsoft was able to deliver critical fixes practically overnight. This assumed that once you see the problem the fix is pretty straightforward.

Unfortunately Spectre and Meltdown aren't straightforward and go to the very heart of how the OS works. It's not at all easy to fix this when you have enormous amount of software working on top of it depending on every little quirk your solution provides.

yuhong · on March 28, 2018

Yea, it is probably the biggest change to the Windows kernel in a security update.

Tobba_ · on March 28, 2018

This is what happens when you don't have a QA department.

bonzini · on March 28, 2018

This is something that you find through code review, not testing. Apart from regression testing, but that presupposes that you encountered the issue before.

efdee · on March 28, 2018

Are you implying that Microsoft doesn't do QA?

Tobba_ · on March 28, 2018

If they do, whatever issues they're occupied with finding would call for an exorcism.

gwbas1c · on March 28, 2018

These kinds of things should be part of an automated test suite. Specifically, the kind of tests that were written years ago.

Honestly, Microsoft is really big into automated testing. I'm surprised this slipped through.

astrange · on March 28, 2018

I don't think there are any OS kernels that practice test-driven development - most of them don't even have code coverage working. It's also very hard to test for a problem you haven't thought of yet.