Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Happens to all of us. Once I required logs from the server. The log file was a few gigs and still in use. so I carefully duplicated it, grepped just the lines I needed into another file and downloaded the smaller file.

During this operation, the server ran out of memory—presumably because of all the files I'd created—and before I know it I'd managed to crash 3 services and corrupted the database—which was also on this host—on my first day. All while everyone else in the company was asleep :)

Over the next few hours, I brought the site back online by piecing commands together from the `.bash_history` file.



Seems unwise to have an employee doing anything with production servers on their first day, let alone while everyone else is asleep.


It does but that was an exceptional role. The company needed emergency patches to a running product while they hired a whole engineering team. As such, I was the only one around doing things, and there wasn't any documentation for me to work off of.

I actually waited until nightfall just incase I bumped the server offline because we had low traffic during those hours.


What's the story behind this company/job? Was it some sort of total dumpster fire?


I wouldn't classify it as that but they had had trouble in the past which lead to a lot of their team leaving, and were now looking to recover from it.

I was only there for a short time though. Hopefully they figured things out.


Why does the DB get corrupted? Does ACID mean anything these days?


Not original poster, but up to 2010, default MySQL table type was MyISAM, which does not support transactions.


When a server runs out of memory a lot of strange things can happen.

It can even fail while in the middle of a transaction commit.

So transactions won't fix this.


No. That is exactly what a transactional DB is designed to prevent. The journal gets appended with both the old and the new data and physically written to disk, and only then the primary data representation (data and B-tree blocks) gets updated in memory, then eventually that changed data is written to DB files on disk. If the app or DB crashes during any stage, it will reconstruct primary data based on journalled, comitted changes. DBs shouldn't attempt to allocate memory during the critical phase, and should be able to recover even on failed allocations at any time by just crashing and let regular start-up recovery clean up. Though a problem on Linux might be memory overcomitting.

Edit: and another problem is disk drives/controller caches lying and reporting write completion when not all data has actually reached stable storage


Transactions should fix this. That's what the Write Ahead Log and similar techniques are for.


It was an older MongoDB in my case. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: