Oh right the "cool kids" approach. Here's what the "sensible adults" think about...

AnthonyMouse · on Jan 18, 2015

The Unix people have thought of these things. You can easily do them with command line tools.

> Operational Supportability: How do you monitor the operation ?

Downloading files with wget will create files and directories as it proceeds. You can observe and count them to determine progress, or pass a shell script to xargs that writes whatever progress data you like to a file before/after calling wget.

> Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ?

wget has command line options to skip downloading files that already exist. Or you can use tail to skip the number of lines in the input file as there exist complete entries in the destination directory.

> Maintainability: Can we run the same application on our desktop as on our production servers ?

I'm not sure how this is supposed to be an argument against using the standard utilities that are on everybody's machine already.

> Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?

Again, what? Extensibility is the wheelhouse of the thing you're complaining about.

nine_k · on Jan 19, 2015

Unix tools are composable. Functional languages (e.g. Clojure) are all about composability. While bash might be a reasonable glue language, I wonder why Clojure wouldn't be — and it could probably be as compact, if not terser.

The problem of the Hadoop approach is that the overhead of parallelization over multiple hosts is serious, and the task fits one machine neatly. A few GBs of data can and should be processed on one node; Hadoop is for terabytes.

eru · on Jan 19, 2015

> [...] I wonder why Clojure wouldn't be — and it could probably be as compact, if not terser.

Because Clojure is a goo language, that question depends mostly on the libraries available for Clojure.

(Whereas some other languages are worse at gluing, so libraries will only help you so far.)

rakoo · on Jan 19, 2015

That's if you want to do everything in Clojure, but if Clojure were to be used as a glue language, it seems to me it has a clear syntax to do it :

http://clojuredocs.org/clojure.java.shell/sh

Someone even went further to make it more useful:

https://github.com/Raynes/conch

konradb · on Jan 19, 2015

I'd never heard of conch, thanks loads for the reference; really useful.

sauere · on Jan 19, 2015

I could not agree more. And even with the things you mentioned, such a script will still be tiny and very readable.

You just have to love the simplicity.

nextos · on Jan 19, 2015

I love Unix, but it's just a local minima in the design space.

For example, it's typical text processing pipelines are hard to branch. I have hacked up some solutions, but never found them very elegant. I would love to hear some solutions to this. Ended up switching to Clojure (Prismatic's) Graph.

LeoPanthera · on Jan 19, 2015

> For example, it's typical text processing pipelines are hard to branch.

I'm not entirely sure what you mean by this, but it sounds like you should use "tee" pointing at a fifo.

ajuc · on Jan 19, 2015

The problem - you have file, you want to do one thing for lines matching REGEX and other thing for lines not-matching REGEX.

How to do it without iterating the file 2 times? You can do while of course, but it defeats the reason to use shell.

I would love to have two-way grep that writes matching lines to stdout and nonmatching to stderr. I wonder if grep maintainers would accept new option for grep "--two-way".

barrkel · on Jan 19, 2015

Write to more than one fifo from awk. If you're composing a dag rather than a pipeline, fifos are one way to go.

Personally though, I'd output to temporary files. The extra cost in disk usage and lack of pipelining is made up for by the easier debugging, and most shell pipelines aren't so slow that they need that level of optimization.

jiffytick · on Jan 19, 2015

awk can write to stderr.

ticviking · on Jan 19, 2015

Which is for errors.

spiffytech · on Jan 19, 2015

wget isn't the only part of the puzzle you may need Restart Recovery for - the CPU-bound map/reduce portion may also need to recover from partial progress. Unix tools aren't well-designed for that.

coderdude · on Jan 19, 2015

> Downloading files with wget will create files and directories as it proceeds. You can observe and count them to determine progress, or pass a shell script to xargs that writes whatever progress data you like to a file before/after calling wget.

Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).

> wget has command line options to skip downloading files that already exist. Or you can use tail to skip the number of lines in the input file as there exist complete entries in the destination directory.

Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).

This is one of those tasks that seems like you could glue together wget and some scripts and call it a day but you would ultimately discover the reasons why nobody does this in practice. At least not for anything but one-off crawl jobs.

Thought of another possible issue:

If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?

Edit: `previously fetched` -> `previously failed`

AnthonyMouse · on Jan 19, 2015

> Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).

This is kind of the premise of this discussion. You don't use Hadoop to process 2GB of data, but you don't build Googlebot using bash and wget. There is a scale past which it makes sense to use the Big Data toolbox. The point is that most people never get there. Your crawler is never going to be Googlebot.

> Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).

It really depends what you're trying to do here. If the reason you're restarting the crawler is because e.g. your internet connection flapped while it was running or some server was temporarily giving spurious HTTP errors then you want the failed URLs to be retried. If you're only restarting the crawler because you had to pause it momentarily and you want to carry on from where you left off then you can easily record what the last URL you tried was and strip all of the previous ones from the list before restarting.

But I think what you're really running into is that we ended up talking about wget and wget isn't really designed in the Unix tradition. The recursive mode in particular doesn't compose well. It should be at least two separate programs, one that fetches via HTTP and one that parses HTML. Then you can see the easy solution to that class of problems: When you fetch a URL you write the URL and the retrieval status to a file which you can parse later to do the things you're referring to.

> If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?

Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.

coderdude · on Jan 19, 2015

> Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.

I wrote this in a reply to myself a moment after you posted your comment so I'll just move it here:

Regarding the last two issues I mentioned, you could sort the list of URLs by domain and split the list when the new list's length is >= n URLs and domain on the current line is different from the domain on the previous line. As long as wget can at least honor robots.txt directives between consecutive requests to a domain, it should all work out fine.

It looks like an easily solvable problem however you go about it.

> It really depends what you're trying to do here.

I was thinking about HTTP requests that respond with 4xx and 5xx errors. It would need to be possible to either remove those from the frontier and store them in a separate list or mark them with the error code so that it can be checked at some point being passed onto wget.

sillysaurus3 · on Jan 19, 2015

Open file on disk. See that it's 404. Delete file. Re-run crawler.

You'd turn that into code by doing grep -R 404 . or whatever the actual unique error string is and deleting any file containing the error message. (You'd be careful not to run that recursive delete on any unexpected data.)

Really, these problems are pretty easy. It's easy to overthink it.

pyre · on Jan 19, 2015

> grep -R 404

This isn't 1995 anymore. When you hit a 404 error, you no longer get Apache's default 404 page. You really can't count on there being any consistency between 404 pages on different sites.

If wget somehow stored the header response info to disk (e.g. "FILENAME.header-info") you could whip something up to do what you are suggesting though.

sillysaurus3 · on Jan 19, 2015

Yeah, wget stores response info to disk. Besides, even if it didn't, you could still visit a 404 page of the website and figure out a unique string of text to search for.

saidajigumi · on Jan 18, 2015

> Dreadful for the long term.

Here comes a bubble-bursting: I've lead a team that built data processing tools exactly like this, and the performance and ease of manipulating vast amounts of text using classic shell tools is hard to beat. We had no problems with any of: operational supportability, restart recovery, or maintainability. Highly testable, even. No, it's not just cowboy-coded crappy shell scripts and pipelines. Sure, there's a discipline to building pipelined tooling well, just as with any other kind of software. Your problems seem to stem from a lack of disciplined software engineering rather than the tools, or maybe just an environment that encouraged high technical debt.

The kicker? We were using pipeline-based tooling ... running on a Hadoop cluster. Honestly, I'm a bit surprised to see such an apparent mindshare split (judging by some recent HN posts) between performant single-system approaches and techniques used in-cluster. The point that "be sure your data is really, truly big data" is obviously well made, and still bears repetition. Yet the logical follow-on is that these technique are even more applicable to cluster usage. Why would anyone throw away multiple orders of magnitude performance going to a cluster-based approach?

Roboprog · on Jan 19, 2015

Unix/POSIX backgrounds are pretty common among the Hacker News crowd. Not so in "Enterprise" development. (Beam me up Scottie, there's no intelligent life here, only risk avoidance)

Enterprise development is predominated by 2 or 3 trusted operating systems: Windows (/ .NET), and the JVM. POSIX systems are only useful in-so-far-as they are a cheaper (or sometimes more reliable) place to host Java virtual machines. Enterprise dev groups generally have very limited exposure to, and a lot of fear of, things like Borne shell, AWK, Perl, Python. These languages don't have Visual Studio or Eclipse to hold your hand while you make far reaching refactorings like renaming a variable.

Sure, you and I would crawl log/data files trivially with a few piped commands, but that's a rare skill in most shops, at least since the turn of the century.

Ugh, that sounds cliche, but it's hard not to feel that way after being drowned in "Java or nothing" for so long at work.

http://tvtropes.org/pmwiki/pmwiki.php/Main/ElegantWeaponForA...

snambi · on Jan 20, 2015

I agree with @roboprog. Most software shops employ engineers who don't have exposure into UNIX tools. Only few hardcore engineers have exposure or interest learning UNIX tools. For majority of engineers it is just a job. They simply use the same tool for everything. And they tend to use the tools that seem to get them into well paying jobs. If hadoop can get them good paying jobs, they would like us "hadoop" for something in their current job, even if that job can performed by a set of CLI utils. I have seen 100s of resume builder projects in the my past experience.

jdmichal · on Jan 20, 2015

Windows has PowerShell, which can be even more powerful than Unix shell depending on the data and processing required.

Roboprog · on Jan 21, 2015

Does anybody without a unix/POSIX background even bother tinkering with PowerShell? Yes, it's a cool idea, as well, especially if your source data is in MS-Office, but I've not seen it put to much use.

developer1 · on Jan 19, 2015

The problem with shell scripting is that nearly nobody is very, very good at it. The Steam bug doing an rm -rf / is an example, but it's very common for shell scripts to have horrible error handling and checks for important things. The shell is just not suitable for extremely robust programs. I would bet that 80%+ of people who think they're good at shell scripting... aren't.

yourad_io · on Jan 19, 2015

> The problem with shell scripting is that nearly nobody is very, very good at it. The Steam bug doing an rm -rf / is an example

The steam bug is an example of of utter incompetence; not of someone not being very, very good at it. Whoever is happy with shipping `rm -rf $VAR/` without extreme checking around it should get their computer driving license revoked.

> The shell is just not suitable for extremely robust programs.

Incorrect. "The shell" can go as robust as you can handle. In bash, `set -e` will kill your script if any of the sub-commands fail (although ideally you'll be testing $? (exit code of prev. op) at the critical junctions), `set -u` will error on usage of undefined variables, etc.

A huge part of the "glue" that holds your favourite linux distro together is bash files.

> I would bet that 80%+ of people who think they're good at shell scripting... aren't.

The same probably goes for driving[1], this doesn't make cars any less robust.

[1] http://en.wikipedia.org/wiki/Illusory_superiority

PhasmaFelis · on Jan 20, 2015

> The same probably goes for driving[1], this doesn't make cars any less robust.

I don't think I can imagine anything less robust than cars, in terms of the frequency and severity of operational failure. They're pretty much the deadliest thing we've ever invented that wasn't actually designed to kill people.

It's actually a good example of the point developer1 was making: cars and shell scripts are perfectly safe if operated by highly competent people, and only become (extremely) dangerous when operated by incompetents, but in practice most operators are incompetent, in denial, and refuse to learn from others' mistakes.

yourad_io · on Jan 21, 2015

> I don't think I can imagine anything less robust than cars, in terms of the frequency and severity of operational failure.

Maybe US cars :P

> They're pretty much the deadliest thing we've ever invented that wasn't actually designed to kill people.

It's a box weighing 1-2 tons that travels at 100kmh+. Millions (billions?) of km are driven every year. There will be accidents for both good drivers and bad. This won't change.

> cars and shell scripts are perfectly safe if operated by highly competent people, and only become (extremely) dangerous when operated by incompetents, but in practice most operators are incompetent, in denial, and refuse to learn from others' mistakes.

That's simply untrue - both points. Highly competent drivers will have accidents. I highly doubt you feel extreme danger when you get behind the wheel/in a car. The way you phrase it, one expects millions of fatalities daily.

yuubi · on Jan 19, 2015

> rm -rf $VAR/

> /

facepaw.jpg

Without the trailing slash, null or undefined $VAR would cause an error instead of a request to delete all the things.

DougBTX · on Jan 19, 2015

The Steam line was more like ` rm -rf $VAR/* `, eg, they didn't want to delete the $VAR directory. Still, ` rm -rf /* ` is no fun.

yourad_io · on Jan 21, 2015

Part of the journey into Linuxdom is learning a healthy dose of fear for that command. I always pause over the enter key for a few seconds, even when I'm sure I haven't typo'ed.

freehunter · on Jan 19, 2015

Yeah but 80% of the people writing Java and think they're good aren't as well. And plenty of companies support Java.

The answer isn't "don't use it", it's "train your programmers in the languages they use".

kansface · on Jan 19, 2015

The sorts of bugs people experience with Java mostly result in a crashed/stalled/hung process. Bash bugs erase your entire file system. The thing about Bash is that it is trivially easy to make these sorts of mistakes- the language just isn't suitable to general purpose scripting.

yourad_io · on Jan 19, 2015

> Bash bugs erase your entire file system.

    EMPTY=""
    rm -rf $EMPTY/

Is this the kind of bug you're referring to?

> The thing about Bash is that it is trivially easy to make these sorts of mistakes

I fail to see how any other scripting language would have a different effect when you told it to do:

    system("rm -rf "+""+"/")

> the language just isn't suitable to general purpose scripting.

Yes, it is. Bash is deeper than it looks, but not by much. Learn how to handle errors and you'll be fine.

collyw · on Jan 19, 2015

It shouldn't be able to erase your filesystem unless you are running as root or doing something equally stupid. That's pretty much common sense stuff for anyone that isn't a beginner.

gambiting · on Jan 19, 2015

Yeah the "common sense stuff for anyone that isn't a beginner" argument is repeated ad nauseam, and even the largest companies make this mistake in their largest products. Take Valve - they should know how to write good code, right? And yet, last week an article was on top of HN, outlining how they put:

"rm -rf '$STEAMROOT'/*" in their code, used to remove the library. But hey, no one checked if $STEAMROOT is not empty, so when it was for one user, Steam deleted all of his personal files, whole /home and /media directories next time it started.

I'm not saying that command line tools shouldn't be used,but sometimes they are just too powerful for some users,and stupid mistakes like this happen.

anthony_d · on Jan 19, 2015

You're right to an extent, but this isn't relevant to the Java vs Bash discussion. The largest companies make this kind of mistake in whatever language they happen to use.

People delete data and screw things up in MapReduce jobs for Hadoop. A lot.

vidarh · on Jan 19, 2015

If you're worried about that, don't give the script permissions to access your entire filesystem. Easily handled with separate users, cgroups, assorted containerisation, and more.

sergiosgc · on Jan 19, 2015

> The shell is just not suitable for extremely robust programs.

Absolute statements like this are usually wrong. This one does not escape the rule. When Linux distros init is mostly bash scripting, there is very little need to further prove that robust systems can be written in bash scripting without the language fighting the developer.

vdaniuk · on Jan 19, 2015

Wait, is it really a good argument for shell-based approach when all major distros are switching to the systemd due to the configuration/maintainability/boilerplate issues with bash init scripting?

sergiosgc · on Jan 19, 2015

I'm not going into the systemd VS sysvinit discussion. For my argumentation, it is enough to recognize bash based sysvinit has been with us for circa 20 years with no stability problems.

jbergens · on Jan 19, 2015

I think most of was was written also applies to any normal programming language. You could write this in Python, Ruby, Javascript, Java or C# without any problems. The code would probably be easier to read also. The only special thing is the web page scraping that could be done by a library but the same thinking about scalability and the use of a single computer instead of a hadoop cluster still holds even if you're reading from file systems or databases.

MichaelGG · on Jan 19, 2015

Yeah I learned this trying to massage some data with Elasticsearch. curl -XDELETE host/index/type/$id ... Except $id didn't exist.

virmundi · on Jan 19, 2015

Something to keep in mind is that while a single app might be best served on a single machine piping data, multiple apps working the same data set probably wouldn't scale. Hadoop for all its faults does provide a nice, relativily simple programing platform to support multiple data processes.

Edited for phone swipe mistakes.

threeseed · on Jan 18, 2015

This "orders of magnitude" less performance going to a Hadoop cluster approach is nonsense.

There are plenty of options for Hadoop that make it dramatically faster than the naive example in the article. Spark ? MR-Redis ? Storm ?

coldtea · on Jan 19, 2015

>Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?

Yeah, and then they produce some over-engineered monstrocity, late, over-budget and barely able to run...

gpapilion · on Jan 18, 2015

I look at this article as a criticism of the hadoop being the wrong tool for small data sets.

This starts to become a question of data locality, and size. 1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.

The fact that shell commands were used makes for an easy demo that might be hard to support, but if a solution were written using a traditional language with threading or IPC instead of relying on hadoop you should always be faster, since you don't incur the latency costs of the network.

lloeki · on Jan 19, 2015

> That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.

Not at all, because data is being streamed. It could just as easily be 35TB and only use a few MB of RAM.

gpapilion · on Jan 20, 2015

The IO bandwidth of the system will limit you more loading 35TB of data in ram on a single system, even if it is streamed. You'll need more than one disk, and network card to do this in a timely fashion.

microtonal · on Jan 19, 2015

1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system.

It depends on what you do with the data. If you are processing the data in 512KB chunks and each chunk takes a day to process (because expensive computation), you probably do want to spread the work over some cluster.

gpapilion · on Jan 20, 2015

I don't think of hadoop being built for high complexity computation, but high IO throughput.

When you describe this kind of setup, I imagine things that involve proof through exhaustion. For example prime number search is something with a small input and large calculation time. However, these solution don't really benefit from hadoop since you don't really need the data management facilities, and a simpler MPI solution could handle this better.

Search indexing could fit this description(url -> results), but generally you want the additional network cards for throughput, and the disks to store the results. Then again the aggregate space on disk starts looking closer to TB instead of GB. Plus in the end you need to do something with all those pages.

jbergens · on Jan 19, 2015

I think the article said that you don't need to use Hadoop for everything and that it might be much faster to just use command line tools on a single computer. Of course you might find a use case where the total computing time is massive and in that case a cluster is better. I still don't think many use cases have that problem.

We are doing some simple statistics at work for much smaller data sizes and the computing time is usually around 10-100 ms so it could probably compute small batches at almost network speed.

microtonal · on Jan 19, 2015

Definitely. I was reacting to my parent poster, because size does not say everything. 1TB can be small, 1GB can be big - it depends on the amount of computation time that is necessary for whatever processing of the data you do.

JustSomeNobody · on Jan 19, 2015

I hate developers who over engineer everything and then when it's time to perform some of that support and extensibility, they leave because maintenance is beneath them.

They put this behemoth together with a thousand moving parts and then walk away from it.

This, too, happens often.

Kiro · on Jan 19, 2015

And I can't stand developers who overengineer things. We have a couple of them at my company and something that should take a few hours always take several weeks just because of all the reasons you mention. Most things don't need that kind of features and maintainability and if they do in the future we can just rewrite them from scratch. The overall expected return on investment is still better since we seldom need to.

penguat · on Jan 19, 2015

Because in all too many companies, re-writing from scratch is a no-go, no matter how quickly and sloppily an initial solution was thrown together. I've worked on a prototype => production type project, where the throwaway was never thrown away. (the initial team made some mistakes, chief among them was building one prototype of a whole system, rather than one per major risk)

JustSomeNobody · on Jan 19, 2015

This is a systemic problem. Engineering is always subordinate to business. This simply should not be the case. We desperately need new business organization models.

known · on Jan 19, 2015

https://en.m.wikipedia.org/wiki/Triarchy_%28theory%29

nilbot · on Jan 19, 2015

Quite the opposite, and, quite simple: engineers over-engineer thing in order to make things generic. and generic make solutions robust. that's basic science. Unless the problem and solution are well understood, your investment won't guarantee a return at all.

JustSomeNobody · on Jan 19, 2015

Generic, by default, does not in any way make things more robust. We've gone from engineering solutions to meet specific problems to engineering solution frameworks that (supposedly) will solve the problem and allow for any unknowns. The problem is, no matter how hard the engineer tries, he can never anticipate the unknowns to the extent that the application framework can support all of them. We should go back to solving the specific problem at hand. In both scenarios you get the customer who wants a feature that absolutely doesn't fit with the current application, therefore a rewrite is necessary. And with the specific solution, you don't have nearly the man hours wasted.

gaius · on Jan 19, 2015

No, developers over-engineer because setting up a 20-node Hadoop cluster is fun, whereas doing the same task in an hour in Excel means you have to move onto some other boring task.

Generic doesn't mean robust either, I don't know where you got that from,the two concepts are entirely unrelated.

nilbot · on Jan 19, 2015

Generic -> robust. i... i dont know how to explain that. honestly i haven't thought about the necessity of explaining things like this. its... basic mathematics.

gaius · on Jan 19, 2015

I don't think these words mean what you think they mean. Like "science" and "mathematics".

JustSomeNobody · on Jan 19, 2015

I'm sorry, but if you cannot explain it, you simply do not understand it yourself. That's harsh, I get it, and I'm truly sorry, but that's a basic fact.

TickleSteve · on Jan 19, 2015

Generic != Robust.

It can quite easily be the opposite, they are certainly orthogonal concepts.

JabavuAdams · on Jan 19, 2015

This smacks of an unexamined bias. Or maybe we're not using the words to mean the same things?

JabavuAdams · on Jan 19, 2015

No. Look to safety-critical software for intuition on why.

Simpler is more reliable. Also, it's hard to know enough about a problem to make a generic solution until you've solved the problem 2-3 times already. But ... having solved a problem multiple times increases the risk that you will be biased towards seeing new problems as some instance of the old problem and therefore applying unsuitable "generic" solutions.

Symbiote · on Jan 19, 2015

'pv' shows a progress bar, something like '&& touch $x.success' can help restart recovery.

I'd probably pick the shell approach for something I expect to be a one-off, but reconsider each time the task is repeated.

I printed http://xkcd.com/1205/ and stuck it to the wall. It's a useful reference when someone seems to be under or overengineering something.

lloeki · on Jan 19, 2015

While it drives some point home, the chart eludes the question of robustness (a written script will run twice the same way, whereas human error, especially on routine tasks, may hit one hard) and documentation (writing even a lightly commented script to do yearly maintenance is guaranteed to help your future self remember things one year from now).

slagfart · on Jan 19, 2015

That chart assumes 24 hours days. The reality of (my?) productivity is that I have perhaps six productive hours in a day. If I can save eight productive hours per month, that's sixteen days a year, not four.

x0x0 · on Jan 18, 2015

Sure -- you can level multiple complaints.

What about the failed pages? How about shoving those on a queue and retrying n times with an exponential backoff between. What about the total number of failed pages? What about failed pages by site? etc etc etc

But so what -- the principle is still sound. All I described is still a 100 line python script, written in an afternoon, instead of 3 weeks of working bringing up emr, installing and configuring nutch, figuring out network issues around emr nodes talking to commodity internet, installing a persistent queue, performing remote debugging, building a task dag in either code or (god help you) oozie/xml, and on and on.

threeseed · on Jan 18, 2015

Anybody can throw some crap together and make it stick. And it's a perfectly valid solution.

My issue is when there is criticism laid against those solutions which are actually engineered in a way that allows for supportability and extensibility. They are arguably far more important than execution time.

angrybits · on Jan 19, 2015

I can't figure out why you're arguing against standard unix tools/idioms in the name of supportability and extensibility? It defies logic.

x0x0 · on Jan 19, 2015

I think, in many peoples minds, extensibility == pain; either lots of code configuration (hello java, ejb), or xml (hello hadoop, java, spring, ejb), or tons of code (hello java, c++), etc. When nice languages don't make things painful, it sometimes feels like it's wrong, or not really enough work, or in some other way, insufficient. But people can mistake the rituals of programming for getting actual work accomplished.

:shrug: just .02

shitgoose · on Jan 21, 2015

I imagine what would have happened to Linux if Linus designed it with supportability and extensibility in mind.

nilbot · on Jan 19, 2015

Simple: because std utils are programs that do what they supposed to do. if problems bound are well within the definition domain of a std util then its all good. Supportability and Extensibility is way too generic for you to draw a line saying std utils can handle them all. After all, they are programs, not programming languages.

charltones · on Jan 19, 2015

There are command line tools available to help the transition from 'hack' one liner to a more maintainable / supportable solution. For instance drake (https://github.com/Factual/drake) a 'Make for data' which does dependency checking would allow for sensible restarts of the pipeline.

The O'Reilly Data Science at the Command Line book (linked elsewhere in the comments) has a good deal to say on the subject: turning one liners into extensible shell scripts, using drake, using Gnu Parallel.

andyidsinga · on Jan 19, 2015

I've been using GNU Parallel for orchestrating running remote scripts/tools on a bunch of machines in my compute and storage cluster. Its now my goto tool for almost any remote ssh task that needs to hit a bunch of machines at once.

An excellent tool ...apparently an improvement on xargs even for local parallel tasks ( see http://unix.stackexchange.com/questions/104778/gnu-parallel-... )

2ion · on Jan 19, 2015

Just these pointed remarks:

- There are things like pv(1) which allow you to monitor pipes. Things like systemd open other interesting possibilities for implementing, grouping and monitoring your processes.

- Recovery could be implemented by keeping a logfile of completed steps like a list of completely processed files or moving processed files to elsewhere in the file system (could be done in memory only using ramfs or tmpfs). Of course, it depends on the case whether it's feasible or not.

- Extensibility: Scripts and configurations can be done in shell syntax. Hook systems and frameworks of varying complexity exist. I agree that doing extensibility in shell code is going to turn out to be hazardous when done without proper concept and understanding of the tool at hand.

pacala · on Jan 19, 2015

I fully agree with all the operational / restart / features comments. However, I've often been surprised on how a little thought / research can build all these requirements on top of off-the-shelf components. I also agree that it is likely that one will eventually outgrow wget, but, for example, one may run out of business / pivot before that.

module0000 · on Jan 19, 2015

We don't really "come up" with the xargs/wget approach. The approach is already there, waiting to be utilized by someone who understands the tools. The "cool kids" don't like(or are not able) to understand the tools.

The author (I think) is trying to point out that these problems are already solved, decades ago, with existing UNIX tools.

mwotton · on Jan 19, 2015

I've implemented this. It isn't too bad up to a certain point. You have to be a bit careful about your filesystem/layout of files, lots of filesystems don't particularly like it when you have a few hundred million files in one directory.

debacle · on Jan 19, 2015

What kind of operational supportability do you need for a script that took a few hours to write and takes <5 minutes to run?

_ofdw · on Jan 19, 2015

I agree, it's much better to re-invent the wheel.