How is team-member-1 doing?

mmjaa · on March 17, 2017

I knew a kid once. He was a 'junior operator' in a computer room .. you know, in the good ol' days, where the computers lived. (Before they escaped and attached themselves to your wrists.)

He thought he was smart. And sometimes, he was.

One day, he overheard a team leader talk to his programmers about the newly-minted database, sitting there in front of them on the table, on a brand new .. amazing .. 640Meg hard drive.

This database had consumed the disk. It had cost the company a cool million dollars to create. It was vital that we backed it up.

So, the new 640 Meg disk was on its way, onto which we'd back the database up. The first thing we'll do, the leader said, is copy the database, sector for sector.

"And only then, will we re-index the database!", he claimed. "Until then, the indexes will remain un-sorted!"

Well, the kid overheard all of this, but only heard "the indexes will remain un-sorted!".

Later that night, this kid thought he'd prove himself.

He re-indexed the database.

He didn't tell anyone.

The next day, a not-so-junior programmer came in, saw the database disk attached to the operator machine, and thought that the backup had been done. For reasons we shall not explain, he disconnected the disk from the operator machine.

The index had not been done.

The database was gone.

The new disk arrived, but nobody could mount the old database disk. Much panic ensued!

Operator logs were consulted. The computer room security cam tapes were spooled.

Oh shit!

Epilogue: I made a lot of money from those kids, writing a tool to recover a corrupted database, whose power had been removed mid re-indexing ..

catshirt · on March 17, 2017

> He was a 'junior operator' in a computer room .. you know, in the good ol' days, where the computers lived. (Before they escaped and attached themselves to your wrists.)

i hope one day i can tell a story like you. that intro is a work of art. :)

mmjaa · on March 19, 2017

Thanks. :)

fapjacks · on March 18, 2017

Hey, old timer! I used to be a junior operator in the computer room. I moved up from tape operator to print operator to console operator. Finally, I was the lead console operator during the Y2K rollover. What great memories! I didn't make any enormous mistakes like the one in your excellent lore, but I will say it was because I was never brave enough in that computer room to make a move that mattered. It got easier to be brave when the computers shrank, though. Thanks for the story!

mmjaa · on March 19, 2017

Computers shrank, but those indexes .. they keep ticking!

colanderman · on March 18, 2017

Ah, the days before write-ahead logging was invented.

make3 · on March 18, 2017

you write like a professional writer. I mean that in the best of ways

mmjaa · on March 19, 2017

Ah, shucks, thanks! I'm totally not though.

Declanomous · on March 17, 2017

What I think is particularly noteworthy is not that Gitlab recognized that anybody could have made that mistake, but rather how supportive Gitlab was about the whole thing.

When you make a big mistake, it is easy to place yourself in a mindset where you feel like a disaster even though everyone is accepting. I call it the "disappointing your parents" mindset, because it can feel a lot like people are just being supportive because they love you, and what you did was indeed inexcusable to a certain degree.

The feeling is made somewhat worse when you are an employee, because your livelihood and your future are dependent on how other people perceive you. To that point, I'm really impressed by the fact that Gitlab addressed the fact that this employee was still being promoted, and that the mistake hadn't affected that. In my mind that is as at least as important as all of the rah-rah stuff.

overcast · on March 17, 2017

They really had no choice. If they were jerks about it, they would have made an already completely ridiculous scenario ten times worse. Spinning this as some lighthearted commentary spam every week since then, is their PR move.

Declanomous · on March 17, 2017

That's true, and I do find the semi-celebratory tone a bit self-congratulatory. That's why I think team-member-1 still being promoted is the most important factor at play.

sytse · on March 17, 2017

It wasn't our intention to come accross as self-congratulatory but I can see how it might come accross as such. We take this incident really serious and especially our production engineers are working very hard to improve our infrastructure.

RainaRelanah · on March 18, 2017

I actually think the openness is very important as well. Many companies(/C levels/managers/employees/clients/etc) are still stuck in the "heads must roll" mindset, where a mistake is punished rather than used as a learning experience for the employee and for the company.

I think at the very least it'll turn the heads of a few employees that are living in fear of being fired for a typo by showing them there are still decent employers left in the field.

rodorgas · on March 18, 2017

An ordinary company would not disclose so much details about the outage. GitLab's openness required them to do so, and they had no choice but posting these "lighthearted commentaries" about people involved. The amazing thing about the event is their level of openness.

overcast · on March 17, 2017

I really have to just start rolling my eyes at this point. I'm just waiting for the official meme, and the cycle will be complete. Get to work making your infrastructure resilient to a simple accidental deletion, and restoring some faith in your product.

jeron · on March 17, 2017

>Get to work making your infrastructure resilient to a simple accidental deletion

I'm sure they've been working on that since the deletion

aquabib · on March 17, 2017

But where are the stories on this work? What improvements have been made?

Detailed posts on that is how you begin to restore confidence.

No one is just going to take their word that "stuff are in place now".

YorickPeterse · on March 17, 2017

This is mostly spread across different issues in our issue tracker (https://gitlab.com/gitlab-com/infrastructure). I suspect we'll write up a blog post once all the moving parts are in place, have been tested/used for a while, etc.

a3_nm · on March 17, 2017

There is a list of issues in https://about.gitlab.com/2017/02/10/postmortem-of-database-o... -- also in https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCx..., see Recovery, 3, l.

I think it's great that they are being completely transparent about this.

That said, it's true that it's been almost two months and it seems that the some important issues there are still open and don't look especially active.

sytse · on March 18, 2017

The follow up was pretty extensive and we'll be working on it for months to come. Some issues that have been done:

1. Update PS1 across all hosts to more clearly differentiate between hosts and environments https://gitlab.com/gitlab-com/infrastructure/issues/1094

2. Set PostgreSQL's max_connections to a sane value https://gitlab.com/gitlab-com/infrastructure/issues/1096

3. Move staging to the ARM environment https://gitlab.com/gitlab-com/infrastructure/issues/1100

4. Improve PostgreSQL replication documentation/runbooks https://gitlab.com/gitlab-com/infrastructure/issues/1103

5. Build Streaming Database Backup https://gitlab.com/gitlab-com/infrastructure/issues/1152

6. Assign an owner for data durability https://gitlab.com/gitlab-com/infrastructure/issues/1163

hartator · on March 17, 2017

Their reaction has been at least very un-profesional. I wouldn't trust them with my data. Specifically when you have a github in front of them.

LyndsySimon · on March 17, 2017

My impression is the opposite of yours - I've moved personal projects from GitHub since their incident.

hartator · on March 17, 2017

I respect that but why? Not the incident itself, but the architecture thwy pick point to that they have no idea about what they are doing.

LyndsySimon · on March 17, 2017

Their incident response gives me confidence that this sort of thing is much less likely to happen in the future.

It also brought to my attention how much they've progressed as a platform since I used them last.

winteriscoming · on March 18, 2017

>> Their incident response gives me confidence that this sort of thing is much less likely to happen in the future

That doesn't sound logical to me. The chances of such incidents happening isn't related to how they announce that incident.

marcus_holmes · on March 18, 2017

yes it totally is.

A company culture of openly admitting mistakes, knowing that your team-mates will not play the blame game, means that problems will be reported quickly.

In a "conventional" culture, an engineer who makes a mistake (and who will be fired for making that mistake) is incentivised to cover up the mistake and hope that the blame falls elsewhere.

In an open culture, the engineer who makes a mistake is incentivised to immediately raise the alarm.

So while the chances of a mistake happening are the same (we're all human), the chances of it being caught quickly and dealt with quickly are better in an open culture.

winteriscoming · on March 18, 2017

>> So while the chances of a mistake happening are the same (we're all human)

...and that's the point I was making in my comment

mos_basik · on March 18, 2017

think you're missing his meaning. should be:

>while the chances of a mistake happening are the same [at transparent companies like gitlab and at opaque companies like github]

>the chances of it being dealt with quickly are better in an open culture [like gitlab compared to a closed culture like github]

hartator · on March 18, 2017

They lost 3h of customer data, it's not open culture. It's just gross incompetence.

infinisil · on March 17, 2017

To be honest, I have now more faith in their product after the incident :D

winteriscoming · on March 18, 2017

I can't say if you are serious (given the smiley), but if you are indeed serious, then it's really odd that something like this incident would increase your faith in the product.

Gitlab.com is a repository hosting product i.e. it's in data storage business. Of course there are other important aspects to the product, like the web interface and such, but they are primarily into data storage. They literally lost data of their users [1] (some of them might have been paid customers too), not because someone accidentally deleted the data (which happens and is understandable), but because they did not have the basic things, that you expect of data storage products, functional or tested.

They have been offering gitlab.com as a SaaS product, since 5 years [2] and yet the very basic thing, like backups, wasn't tested or functional. If that's what increases the faith in a data storage company, then I don't know what to make of it.

[1] https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-...

[2] https://about.gitlab.com/about/

overcast · on March 17, 2017

You're a better man than I! Backups aren't something that go in the "oops" bucket for me.

icebraining · on March 18, 2017

Frankly, backups is something I do; it's folly to rely on a single provider to host both the original and the backup. That's why I don't use services that don't let you export your data somehow - at least, not for important stuff.

bshimmin · on March 18, 2017

It's funny, I started reading this thinking "Ah, how nice they all are" quite sincerely (especially the gifts from Google and Codefresh - though I suppose you could interpret those a bit cynically if you put your mind to it...), and by the time I got to the t-shirts, I was pretty much in alignment with you.

Pharylon · on March 17, 2017

Several years I was a junior dev at a 3PL (3rd Party Logistics) company troubleshooting an issue with some overnight data imports.

I had it all loaded up on the test environment, then I'd delete the import, make some changes, and re-run it. Wash, rinse, repeat, trying to track down the issue. As I'm sure you can guess, at one point I executed my scirpt in the wrong window and deleted last nights import from Production.

I immediately told my boss, who was very understanding with a "everyone does this kind of thing at some point" kind of shrug and went over to our DBA's office to ask him to re-load the last snapshot. But it turned out the snapshots had been broken for two weeks and no one had noticed.

And it wasn't a simple issue of re-running the import. After all the orders had been imported, humans had manually assigned orders to trucks and dispatched them in the wee hours of the morning. Now there was no way to know what packages were on what trucks.

I think the DBA ended up buying ApexSQL Log out of pocket to roll back the deletion with the transaction log.

The result was that for several hours the delivery drivers for a national office supply company in a certain state were completely unable to use their handhelds or access their truck's inventory. That was my team-member-1 moment.

mulmen · on March 17, 2017

What were the consequences to you, the DBA, and the organization?

Pharylon · on March 17, 2017

Nothing really happened in the end. I'd only worked there for about six months but my boss liked me and thought I did good work. He chalked it up as a learning experience (which it certainly was).

Our DBA was someone who'd been with the company 15 years. I think most of the blame fell on him, but he didn't lose his job or anything like that. I think the snapshot issue was pretty trivial to fix. The only real problem was that there were no notifications going to anyone that they failed.

The customer was pissed, and it was a new customer at that. They basically lost a day of deliveries out of it. But they kind of had vendor lock-in with us (the office supply company they were delivering for liked us and had basically told them to use us because we already knew how to process their data the way they wanted). Switching wouldn't have been trivial, but we nevertheless fell over ourselves to keep them happy for a few months until it blew over.

keithpeter · on March 17, 2017

I'm interested to know as well.

I imagine that getting the snapshots working properly would be quite an important factor, otherwise the company would always be a typing error away from chaos.

lightbritefight · on March 17, 2017

Working snapshots/backups are system engineers number 1 concern, always. They mitigate issues with every other system, of any scope. Its the first thing a new hire sysadmin/eng should ask about coming through the door.

strombofulous · on March 18, 2017

Why did you write an abbreviation if you just write what it stands for right after?

gbrindisi · on March 18, 2017

"Bob Hoover, a famous test pilot and frequent performer at air shows, was returning to his home in Los Angeles from an air show in San Diego. As described in the magazine Flight Operations, at three hundred feet in the air, both engines suddenly stopped. By deft maneuvering he managed to land the plane, but it was badly damaged although nobody was hurt. Hoover’s first act after the emergency landing was to inspect the airplane’s fuel. Just as he suspected, the World War II propeller plane he had been flying had been fueled with jet fuel rather than gasoline. Upon returning to the airport, he asked to see the mechanic who had serviced his airplane. The young man was sick with the agony of his mistake. Tears streamed down his face as Hoover approached. He had just caused the loss of a very expensive plane and could have caused the loss of three lives as well. You can imagine Hoover’s anger. One could anticipate the tongue-lashing that this proud and precise pilot would unleash for that carelessness. But Hoover didn’t scold the mechanic; he didn’t even criticize him. Instead, he put his big arm around the man’s shoulder and said, “To show you I’m sure that you’ll never do this again, I want you to service my F-51 tomorrow.”

sverige · on March 17, 2017

I'm old enough to have made some serious mistakes on the job over the years. Thankfully, the worst ones were when I was much younger, but I know that I could still make a worse one before I retire.

Here's what impresses me about Gitlab: They not only say they're committed to honesty and transparency, they actually practice it.

It's easy to see this as some cynical PR move, but to me it's refreshing that they have addressed specifically what happened to the employee who made the error. It makes me believe that they are working very hard on fixing their practices to ensure this kind of failure won't happen again, and I trust they will share (as @syste said in this thread) what those changes are once they have it sorted out.

"Oh, how naive you are!" some may say, to which I respond, "Oh, how cynical and inexperienced you are!" Human failure is inevitable. Designing systems (whether in code or in management practices) that tolerate this inevitable failure is very difficult.

This sort of event can be the catalyst for tearing out what didn't work and creating a much stronger foundation for the future, but only if blame is set aside and honesty is allowed to prevail in the "after action" analysis. Call it PR if you like, but I see a healthy desire to deal with what actually happened and fix it rather than falling into the trap of pointlessly assigning blame.

Consider that it took congressional hearings and someone with the chutzpah of Richard Feynman for NASA to own up to the shuttle explosion. A far worse event, with far worse consequences, but the aftermath of those events and NASA's complete unwillingness to hold itself accountable and deal with reality cost it a lot of credibility.

Good on you, Gitlab.

impappl · on March 17, 2017

They could support their team members by not continuing to make a circus out of them

Moter8 · on March 17, 2017

The first part of the post were interesting and I guess funny, but creating and selling a T-Shirt with the accident and stuff? IMO this would have been a fun joke inside the company, but to outsiders, eh. I don't want to sound grumpy though :)

griffinmb · on March 17, 2017

Unless I'm misreading it, the t-shirt is internal only, and only for the team that specifically handled the issue.

YorickPeterse · on March 17, 2017

The shirt is internal only, we're not selling it to the public.

Moter8 · on March 17, 2017

Oohh sorry, I actually just skipped over that section going "what the... are they doing a tshirt for it now too?!" :/

overcast · on March 17, 2017

For now. Your next blog post will be "Due to popular demand, GitLab Team Member 1 Shirts For Sale".

sytse · on March 17, 2017

We won't do this because it would seem like we're not taking this incident serious and are trying to monetize an outage that affected our users. We're also not giving them away,

oblio · on March 17, 2017

I generally like the openness, but, Gitlab marketing team, if you're listening: stop spamming social media content on your blog, it seems cheesy and lazy.

A few tweets or comments are more than enough to prove your point.

They did something similar with the storage post, which was full of Hackernews opinions.

So, this, or at least post my comment on your blog, too :D

dchest · on March 17, 2017

Looks like "spamming" completely lost its meaning.

halostatue · on March 17, 2017

Or at least it’s going back to its original meaning from Monty Python.

sofaofthedamned · on March 17, 2017

When I had my first IT job in ~1991 I caused millions of pounds worth of loss to my employer, a well known retailer in the UK, due to a bug I made.

My boss covered my arse. I love that man, and i've never made a serious mistake since, as it's made me risk averse.

Gitlab did the right thing here by owning the situation and making it public.

pestaa · on March 17, 2017

Now you'll just have to share that story or else I won't get any sleep tonight. Please.

sofaofthedamned · on March 18, 2017

I only posted this a few weeks ago (check my comment history) but here goes again:

I was a programmer in my first IT job in 1992 for a large retailer in the UK. I was working on some stock related code for the branches, of which they had thousands. They sold a lot of local goods like books which were only sold in a couple of stores each - think autobiographies of local politicians, local charity calendars, that sort of thing. Problem with a lot of these items was that they were not on the central database. This caused a problem with books especially as you don't pay VAT on books, but if you can't identify the book then the company had to pay it. This makes sense because some books or magazines you DID pay VAT on, because they came with other stuff - think computer magazines with a CD on the front. So my code looked at different databases and historical info to work out the actual VAT portion payable, which was usually nil.

I wrote the code (COBOL, kill me now), the testers tested it, all went OK until when they deployed, on a Friday night. The first I knew was coming in Monday morning. All the ops had been working throughout the weekend as the entire stock status for each branch had been wiped. They had to pull a previous weeks backup from storage, this didn't work as they didn't have the space for both copies to merge so IBM had to motorcycle courier some hardware from Amsterdam, etc etc. As this was a IBM mainframe with batch jobs we also had to stop subsequent jobs in case it made the fuckup worse, so none of the stock/finance stuff could run at all.

The branches were royally fucked on Monday as, without any stock status to know what to order, they got nothing - no newspapers, books, anything. We even made it to the Daily Mail, I think it took at least 3 weeks before ordering was automatic again. Cost the company literally millions in overtime, not being able to sell stuff, consultants and reputational damage - it was big news in the national newspapers.

The root cause? I processed data on a run per-branch. I'd copy the branch data to a separate area, delete the main data, then stream it back. My SQL however deleted the main data for ALL branches. It didn't get picked up in QA as, like me, they only tested with a single branch dataset at a time.

bartvk · on March 19, 2017

Wow.... Very interesting, thanks for sharing.

lanceloth · on March 17, 2017

I'm curious to hear the story too. Thank you!

sofaofthedamned · on March 18, 2017

I replied to the other child post

matt4077 · on March 17, 2017

The Gitlab PR team is certainly doing a much better job than their engineering team did.

I actually appreciate their attitude towards errors by employees.

Unfortunately, the appearance this spectacle creates is that the same sort of attitude should apply to them as a company, i. e. "Don't fire Gitlab! You've just invested 200MB of data into their education"

It's a very smart method to protect not just team-member-1, but also employee-1.

mhink · on March 17, 2017

I was actually about to make a comment to this effect. They took a legitimate disaster and handled it perfectly; as the old saying goes: "No publicity is bad publicity." They've taken the time over the past few weeks to constantly release relevant blog posts, which is good for two reasons: 1.) they reassure customers that they're taking steps to prevent the problem in the future, and 2.) they're capitalizing off natural curiosity to boost brand awareness. (Although maybe a bit too much, according the minor grumbling in this comment section. ;) )

I've been a bit wishy-washy on GitLab for awhile, but honestly, I'm thinking I might give them a shot sometime soon.

madamelic · on March 17, 2017

His page says: "Database (removal) Specialist at GitLab"

Love it.

anderber · on March 17, 2017

I agree with the mentality that this is a team effort, and when it fails, a team failure. And that when something goes wrong what's important is to understand it, and put in place a way for it to not happen again. Kudos to GitLab for their forward thinking way of working.

jaz46 · on March 18, 2017

So who's team-member-1 for the Amazon s3 outage a few weeks back? I'm sure they feel the same way and we'd love to send them gifts to support them just like the community supported GitLab.

winteriscoming · on March 18, 2017

Yet another gitlab post on how open and transparent they are and how they are being praised for that. It's now just looking more and more like gitlab being known for being a transparent company than being recognized for their product or technical competency.

Like I said in another post a while back, it's fine being transparent but gitlab just have taken this to an extreme. It's important to be private about certain details and just get real work done and be known for that.

borplk · on March 18, 2017

It's the same crap that Buffer pulled.

tschellenbach · on March 17, 2017

Perfectly happy with Github, but seriously, can I hire you guys as a PR agency? :)

jaz46 · on March 18, 2017

+1. Hats off to GitLab for being awesome!

ar-jan · on March 17, 2017

Pedantic note: I'm sure under 1. Technical Skills, the "I think this is out of the question here" should read "I think there's no question about this" or something along those lines.

sytse · on March 17, 2017

Yep, I think we won't update the blog post because we wanted to post it verbatim.

nickpsecurity · on March 17, 2017

Probably could've designed a good backup and restore strategy with the time that was invested in this piece. A combo of full backups with append-only storage of changes going a certain amount of time into the past. Worked for me for long, long time despite my many screwups. Even when I lost all my stuff to triple, storage failure I still recovered a tiny bit stored on my cheap, write-once solution: DVD-R's. There was some bit rot but better than bit loss. I imagine their solution would be better done with a filesystem or backup software.

Note: It was neat that much of the community was supportive. I see the article as really a thank you to them.

pageandrew · on March 17, 2017

> Probably could've designed a good backup and restore strategy with the time that was invested in this piece

You do realize that it probably wasn't the engineering team writing this blog post, right?

nickpsecurity · on March 17, 2017

I thought the engineering team was the reason they wrote it.

kozak · on March 17, 2017

Instead of DVD-Rs, I now use write-protected USB flash drives (Netac U335 or similar) with their write protect switches melted with a soldering iron. I know this doesn't protect about hardware failure, but most data loss actally gets caused by user actions or software bugs. Store several of them at different locations to protect from other threats.

YorickPeterse · on March 17, 2017

We are considering using S4 (http://www.supersimplestorageservice.com/), it's probably the best place to store your important data.

ludwigvan · on March 17, 2017

No, too expensive and proprietary. I can't understand why people pay for these tools when you can build the same thing using open source technologies?

Here's a script I use at home for this:

  #!/bin/bash
  tar -cf - ~/Documents/ > /dev/null

elygre · on March 17, 2017

Back in the dark ages (1980s or so), a professor at my uni managed to get a speaking slot for his presentation on "write-only memory". Good times.

overcast · on March 17, 2017

Ahhh, absolutely adore alliteration.

nickpsecurity · on March 17, 2017

That's a good idea. I havent looked into them in a long time. Thanks for the reminder.

imode · on March 17, 2017

been there. crashed a client's website on a friday by updating the wrong plugins for a Joomla site.

given a sufficiently complex dependency chain for presented problems, anybody can be a 'team-member-1'.

mistakes happen at all levels.