Hacker News new | past | comments | ask | show | jobs | submit login
Invent More, Toil Less (2016) [pdf] (usenix.org)
128 points by DyslexicAtheist on March 16, 2019 | hide | past | favorite | 41 comments



The Google SRE book has an excellent description of toil and, unexpectedly, discusses that some amount of toil is beneficial.

In short, the authors of that section claim that it's not really possible for the SREs at Google to spend all their time solving novel problems through automation.

This makes sense: constantly solving new problems is hard. It takes lots of time, mental energy, and the outcome is inherently uncertain.

Google found that some amount of toil (roughly defined as repetitive tasks that are not particularly challenging and do not solve long term problems) is essential for the health of their engineers. Toil is boring yes, but can be relaxing, and as work that is inherently easier to accomplish, can help keep confidence that working on solving unknown problems can deplete.

I would have expected that Google would have absolutely minimal toil, given that they are leaders in the automation space, but if they've found that some amount of easier work is necessary, then it's probably true for anyone.


> Toil is boring yes, but can be relaxing, and as work that is inherently easier to accomplish, can help keep confidence that working on solving unknown problems can deplete.

I think it is important for all roles to have a cadence that mixes easy tasks in with the challenging ones so that every day doesn't seem like a drag.

For example, I prefer to start my day by knocking out a couple of easy bug fixes before diving into some challenging development.


I'm exactly the same way. If I can get a few things checked off my todo list early, I'm nearly guaranteed to at least feel like I had a good day.


That seems like a good strategy but maybe we should also work towards a mindset of enjoying the process of working on hard problems even if apparently fruitlessly. Whenever I feel I’m spinning my wheels I remind myself thats how I learned what I know and my best breakthroughs usually come when I think its a dead end.


As a user of Google Cloud Platform and having suffered through many, many issues (nothing that takes down our systems mind you, just things that you would expect would be handled better, especially considering all the reverence for the Company) I can say with some confidence that they do not handle automation to that extent. Still love GCP so far though, especially compared with my limited experience with AWS.


> toil (roughly defined as repetitive tasks that are not particularly challenging ...)

I guess an example of this could be annotating deep learning datasets ...


Maybe, but that actually provides at least a little lasting value.

The ultimate example of toil, which I'm sure Google has automated away at this point, is truncating old log files that would eventually take up all remaining disk space. Simply truncating the file does not solve the inherent problem, but buys you more time before you'll face the exact same situation again.


This is a good example of the "capability trap" dynamic[0][1]: there is always short-term pressure that crowds out long-term capability building. The longer you neglect the capability-building, the worse your capability gets, the higher the pressure.

The only way out is to acknowledge that you will take the hit in a "worse before better" phase.

[0] http://web.mit.edu/nelsonr/www/Repenning=Sterman_CMR_su01_.p...

[1] https://www.systemdynamics.org/assets/conferences/2017/proce...


Could you go more into detail about the 'taking a hit in the worse before better' way of getting out of it? My department is drowning in endless toil (from the parent article) right now, and I'd like to figure out some way to get out of it besides just leaving the company like so many others have (although I might end up doing that too, but maybe I can course correct the department just a little bit before doing so).


> Could you go more into detail about the 'taking a hit in the worse before better' way of getting out of it?

The case studies in the first paper are worth reading, but essentially you make it clear to everyone that you are going to take a hit on your primary production output in order to restore capability:

> Policy analysis showed that escaping the capability trap necessarily meant performance would deteriorate before it could improve: While continuing to repair breakdowns, the organization has to invest additional resources in planned maintenance, training and part quality, raising costs. Most importantly,increasing planned maintenance reduces uptime in the short run because operable equipment must be taken off-line for the planned maintenance to be done. Only later, as the Reinvestment loop begins to work in the virtuous direction, does the breakdown rate drop.

The reason that it's called the capability trap is that nobody wants to accept things getting worse. Everyone wants the improvements to be (1) free and (2) monotonic. But once you get stuck in the trap it's (1) expensive and (2) initially backwards. You slow primary production to make improvements and all the skeptics can see is that the numbers are worse, and didn't you promise to improve it? But the way out means going backwards on the primary metrics before you can free up capacity to improve capability.

And it doesn't even need skeptics to be hard. Even with all the good intentions in the world, it's very hard to avoid the temptation to prioritise your primary output over everything else. Sure, we should improve our CI/CD ... once we ship this feature. Sure, we should automate our disaster recovery ... but put out this fire first. Yes, we should create systems to manage the fleet better ... but can't let you refuse anyone's requests, we have a business to run. Yes, of course we need reserve capacity to deal with uncertainty, but don't you dare run anything less than maximum utilisation ... and also don't you dare say we're at maximum utilisation when I give you more to do.


Dedicate some people to automate the toil away.

Review past decisions that led to the adoption of systems that create a lot of toil. Avoid that in the future. Change those systems or the incentives that led to them.

There's no secret or quick way, people have to make hard decisions.


The J curve


Love it. "Work with enduring value leaves a service permanently better, whereas toil is “running fast to stay in the same place.” Therefore, as a service grows, unchecked toil can quickly spiral to fill 100% of everyone’s time."


It depends on the point of view of management. In a lot of companies, managers don’t care about tech debt much and see “toil” as the rightful majority mode of work that engineers ought to focus on. They may simply lack the capacity to understand what an investment in paying down tech debt to free engineer time could buy them, or they may view that process as too risky, or they just may not care.

There are a lot of managerial reasons why this happens. It’s one of the biggest lines of questioning I pursue in job interviews, to try to understand what point of view people have elsewhere in the food chain about how (or if) engineers add value to the business.


There are two sides to this:

On one hand, good management should realize that automating repetitive tasks and paying down technical debt is how you add to a tech company's capital stock, and is the only reason why tech company valuations tend to go parabolic. If you're just trading money for labor and labor for customer's money, you have a consulting service business. These can be profitable, but you don't build an enduring asset this way, and the valuations that business owners usually think about when they decide to start a tech company are those that come from owning a monopoly asset that you can sell to multiple customers for virtually no additional cost.

On the other hand, the part that engineers are usually blind to is the business context that the company operates in. Tech moves fast. It's not uncommon for basically all of a company's founding assumptions to be obsolete 3 years later. New tech platforms are available; customers want different things; a new competitor has just demoed a killer feature. All of these have the possibility to render large swaths of a product's feature set obsolete. There's no sense investing engineering time in a codebase that's about to be killed anyway; just rack up the technical debt and declare bankruptcy. Worse, management often can't tell engineering about many of these external realities without killing the product: would you continue working for a company that said "Our competitive position is untenable. Pump out as many features before anyone external notices so we can sell the company" or "We need you to keep the lights on for our existing customers while this secret division over there builds the next generation of the product"?

It's often good to ask questions about the company's overall strategy and competitive position (and to do research on this yourself) before joining. While the owners usually won't tell you everything, it'll build trust if they can tell you some things. They should be able to think of these issues in terms of trade-offs: "What's technical debt?" is the wrong answer, but so is "We care deeply about technical debt and give our engineers as much time as they want for code cleanup" (the latter is a bullshit answer, designed to make the hire). Similarly, it builds trust if engineers can also recognize the business tradeoffs and accept that sometimes the right answer is to hack something together and ship it so the customers can get value while management figures out the next strategic move.


I think -and it depends on the corporate culture / industry- the value of culling toil, ergo using tech as force multiplier for tech problems, is both beyond their grasp and is a threat to the status quo.

Much like the type of fuel injection used in a combustion engine is irrelevant to most people that use their car.

If you are managing a hauling company and a brand type of fuel allows you to save 20%, you can see the value but you also know that it means updating your fleet, infra and all the costs and hassle associated with that. Which means that while you are keen on making those savings, you probably will not roll out fleet update ASAP.

I think this is how in companies which are not pure IT players, management might see those situations. Yes there is some value in reducing the toil but the perceived costs are too high for them to be fully on board, and I guess it's because their perspective is too rooted in the physical world (i.e my example).

Not that there are no costs when automating toil management in IT, it just doesn't map 1 to 1 to the real world equivalents.


Shouldn't that be something like "innovation is about improving a process permanently, whereas toil is continous, laborious work only for /status quo/ to be preserved"?


[flagged]


Could whoever's running this bot please fill out the about section in its profile?


If you’re bored with your work it’s sign that you should be automating something, or else someone else will do it for you and you’ll be out of a job.


There are a limited (and contested) number of non-boring jobs. Eventually someone has to take out the garbage and clean the toilets.


Speaking of that, I wonder when there will be robots (i.e., with good software) that will help take out the garbage, clean the toilets, and possibly fix plumbing?


Lol, plumbing especially will never be automated


Plumbing is automation. Without plumbing, human labor is used to haul water from wells and cart away human waste.


I think there's some confusion here between plumbing as in the trade and plumbing as in the infrastructure.


everything is an automation of some more complex system on top of it


it might be in large future apartment buildings where the necessary robotics are built in during construction


You’ll have closed black box plumbing systems which are meant to fix themselves, but in fact constantly fail. Then you’ll have a cadre of rogue plumbers, who will fix the problems illegally and off the books. These plumbers will all look like Robert De Niro:

https://en.m.wikipedia.org/wiki/Brazil_(1985_film)


I think we'll eliminate pipes (e.g traditional homes) before that happens.. we might as well be living in the fully sci-fi speculated future where nanobots control all physical structure and we have near-infinite power sources


That would be interesting, I wonder if there would be nanobot-like "pipes" (or some other framework) built into the physical structure of homes?


Never is a pretty strong word.

You're asserting that robots will never reach human level performance for everyday tasks?


You inaccurately generalized my comment. And, plumbing is not an everyday task.


> You inaccurately generalized my comment.

I thought I might, but was hoping you would respond with a more correct generalization.


This is solvable with management and hiring a full time house keeper that is empowered to call a plumber.


That's one way. Though, I wonder about the groups of people who own houses and might want to fix plumbing more independently?


My workplace has lots of parties counting on the "process" being in place.


"parties" as in stakeholders, or "parties" as in joyful celebrations?


former :)


As someone doing both dev and ops I automate and refactor away any issues (some small yet important issues might require large rewrites in the software). But I also like to carry out manual tasks for users as it gives them a sense of service and I get to talk to users.


Is there any tool to count toil?


We consider working on alerts as a toil and measure it in amixr.io


jira?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: