Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
I can't stand developers who come up with the xargs/wget approach, hack something together and then walk away from it. I've seen it far too often and it's great for the short term. Dreadful for the long term.
The Unix people have thought of these things. You can easily do them with command line tools.
> Operational Supportability: How do you monitor the operation ?
Downloading files with wget will create files and directories as it proceeds. You can observe and count them to determine progress, or pass a shell script to xargs that writes whatever progress data you like to a file before/after calling wget.
> Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ?
wget has command line options to skip downloading files that already exist. Or you can use tail to skip the number of lines in the input file as there exist complete entries in the destination directory.
> Maintainability: Can we run the same application on our desktop as on our production servers ?
I'm not sure how this is supposed to be an argument against using the standard utilities that are on everybody's machine already.
> Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
Again, what? Extensibility is the wheelhouse of the thing you're complaining about.
Unix tools are composable. Functional languages (e.g. Clojure) are all about composability. While bash might be a reasonable glue language, I wonder why Clojure wouldn't be — and it could probably be as compact, if not terser.
The problem of the Hadoop approach is that the overhead of parallelization over multiple hosts is serious, and the task fits one machine neatly. A few GBs of data can and should be processed on one node; Hadoop is for terabytes.
I love Unix, but it's just a local minima in the design space.
For example, it's typical text processing pipelines are hard to branch. I have hacked up some solutions, but never found them very elegant. I would love to hear some solutions to this. Ended up switching to Clojure (Prismatic's) Graph.
The problem - you have file, you want to do one thing for lines matching REGEX and other thing for lines not-matching REGEX.
How to do it without iterating the file 2 times? You can do while of course, but it defeats the reason to use shell.
I would love to have two-way grep that writes matching lines to stdout and nonmatching to stderr. I wonder if grep maintainers would accept new option for grep "--two-way".
Write to more than one fifo from awk. If you're composing a dag rather than a pipeline, fifos are one way to go.
Personally though, I'd output to temporary files. The extra cost in disk usage and lack of pipelining is made up for by the easier debugging, and most shell pipelines aren't so slow that they need that level of optimization.
wget isn't the only part of the puzzle you may need Restart Recovery for - the CPU-bound map/reduce portion may also need to recover from partial progress. Unix tools aren't well-designed for that.
> Downloading files with wget will create files and directories as it proceeds. You can observe and count them to determine progress, or pass a shell script to xargs that writes whatever progress data you like to a file before/after calling wget.
Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).
> wget has command line options to skip downloading files that already exist. Or you can use tail to skip the number of lines in the input file as there exist complete entries in the destination directory.
Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).
This is one of those tasks that seems like you could glue together wget and some scripts and call it a day but you would ultimately discover the reasons why nobody does this in practice. At least not for anything but one-off crawl jobs.
Thought of another possible issue:
If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?
> Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).
This is kind of the premise of this discussion. You don't use Hadoop to process 2GB of data, but you don't build Googlebot using bash and wget. There is a scale past which it makes sense to use the Big Data toolbox. The point is that most people never get there. Your crawler is never going to be Googlebot.
> Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).
It really depends what you're trying to do here. If the reason you're restarting the crawler is because e.g. your internet connection flapped while it was running or some server was temporarily giving spurious HTTP errors then you want the failed URLs to be retried. If you're only restarting the crawler because you had to pause it momentarily and you want to carry on from where you left off then you can easily record what the last URL you tried was and strip all of the previous ones from the list before restarting.
But I think what you're really running into is that we ended up talking about wget and wget isn't really designed in the Unix tradition. The recursive mode in particular doesn't compose well. It should be at least two separate programs, one that fetches via HTTP and one that parses HTML. Then you can see the easy solution to that class of problems: When you fetch a URL you write the URL and the retrieval status to a file which you can parse later to do the things you're referring to.
> If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?
Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.
> Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.
I wrote this in a reply to myself a moment after you posted your comment so I'll just move it here:
Regarding the last two issues I mentioned, you could sort the list of URLs by domain and split the list when the new list's length is >= n URLs and domain on the current line is different from the domain on the previous line. As long as wget can at least honor robots.txt directives between consecutive requests to a domain, it should all work out fine.
It looks like an easily solvable problem however you go about it.
> It really depends what you're trying to do here.
I was thinking about HTTP requests that respond with 4xx and 5xx errors. It would need to be possible to either remove those from the frontier and store them in a separate list or mark them with the error code so that it can be checked at some point being passed onto wget.
Open file on disk. See that it's 404. Delete file. Re-run crawler.
You'd turn that into code by doing grep -R 404 . or whatever the actual unique error string is and deleting any file containing the error message. (You'd be careful not to run that recursive delete on any unexpected data.)
Really, these problems are pretty easy. It's easy to overthink it.
This isn't 1995 anymore. When you hit a 404 error, you no longer get Apache's default 404 page. You really can't count on there being any consistency between 404 pages on different sites.
If wget somehow stored the header response info to disk (e.g. "FILENAME.header-info") you could whip something up to do what you are suggesting though.
Yeah, wget stores response info to disk. Besides, even if it didn't, you could still visit a 404 page of the website and figure out a unique string of text to search for.
Here comes a bubble-bursting: I've lead a team that built data processing tools exactly like this, and the performance and ease of manipulating vast amounts of text using classic shell tools is hard to beat. We had no problems with any of: operational supportability, restart recovery, or maintainability. Highly testable, even. No, it's not just cowboy-coded crappy shell scripts and pipelines. Sure, there's a discipline to building pipelined tooling well, just as with any other kind of software. Your problems seem to stem from a lack of disciplined software engineering rather than the tools, or maybe just an environment that encouraged high technical debt.
The kicker? We were using pipeline-based tooling ... running on a Hadoop cluster. Honestly, I'm a bit surprised to see such an apparent mindshare split (judging by some recent HN posts) between performant single-system approaches and techniques used in-cluster. The point that "be sure your data is really, truly big data" is obviously well made, and still bears repetition. Yet the logical follow-on is that these technique are even more applicable to cluster usage. Why would anyone throw away multiple orders of magnitude performance going to a cluster-based approach?
Unix/POSIX backgrounds are pretty common among the Hacker News crowd. Not so in "Enterprise" development. (Beam me up Scottie, there's no intelligent life here, only risk avoidance)
Enterprise development is predominated by 2 or 3 trusted operating systems: Windows (/ .NET), and the JVM. POSIX systems are only useful in-so-far-as they are a cheaper (or sometimes more reliable) place to host Java virtual machines. Enterprise dev groups generally have very limited exposure to, and a lot of fear of, things like Borne shell, AWK, Perl, Python. These languages don't have Visual Studio or Eclipse to hold your hand while you make far reaching refactorings like renaming a variable.
Sure, you and I would crawl log/data files trivially with a few piped commands, but that's a rare skill in most shops, at least since the turn of the century.
Ugh, that sounds cliche, but it's hard not to feel that way after being drowned in "Java or nothing" for so long at work.
I agree with @roboprog. Most software shops employ engineers who don't have exposure into UNIX tools. Only few hardcore engineers have exposure or interest learning UNIX tools. For majority of engineers it is just a job. They simply use the same tool for everything. And they tend to use the tools that seem to get them into well paying jobs. If hadoop can get them good paying jobs, they would like us "hadoop" for something in their current job, even if that job can performed by a set of CLI utils. I have seen 100s of resume builder projects in the my past experience.
Does anybody without a unix/POSIX background even bother tinkering with PowerShell? Yes, it's a cool idea, as well, especially if your source data is in MS-Office, but I've not seen it put to much use.
The problem with shell scripting is that nearly nobody is very, very good at it. The Steam bug doing an rm -rf / is an example, but it's very common for shell scripts to have horrible error handling and checks for important things. The shell is just not suitable for extremely robust programs. I would bet that 80%+ of people who think they're good at shell scripting... aren't.
> The problem with shell scripting is that nearly nobody is very, very good at it. The Steam bug doing an rm -rf / is an example
The steam bug is an example of of utter incompetence; not of someone not being very, very good at it. Whoever is happy with shipping `rm -rf $VAR/` without extreme checking around it should get their computer driving license revoked.
> The shell is just not suitable for extremely robust programs.
Incorrect. "The shell" can go as robust as you can handle. In bash, `set -e` will kill your script if any of the sub-commands fail (although ideally you'll be testing $? (exit code of prev. op) at the critical junctions), `set -u` will error on usage of undefined variables, etc.
A huge part of the "glue" that holds your favourite linux distro together is bash files.
> I would bet that 80%+ of people who think they're good at shell scripting... aren't.
The same probably goes for driving[1], this doesn't make cars any less robust.
> The same probably goes for driving[1], this doesn't make cars any less robust.
I don't think I can imagine anything less robust than cars, in terms of the frequency and severity of operational failure. They're pretty much the deadliest thing we've ever invented that wasn't actually designed to kill people.
It's actually a good example of the point developer1 was making: cars and shell scripts are perfectly safe if operated by highly competent people, and only become (extremely) dangerous when operated by incompetents, but in practice most operators are incompetent, in denial, and refuse to learn from others' mistakes.
> I don't think I can imagine anything less robust than cars, in terms of the frequency and severity of operational failure.
Maybe US cars :P
> They're pretty much the deadliest thing we've ever invented that wasn't actually designed to kill people.
It's a box weighing 1-2 tons that travels at 100kmh+. Millions (billions?) of km are driven every year. There will be accidents for both good drivers and bad. This won't change.
> cars and shell scripts are perfectly safe if operated by highly competent people, and only become (extremely) dangerous when operated by incompetents, but in practice most operators are incompetent, in denial, and refuse to learn from others' mistakes.
That's simply untrue - both points. Highly competent drivers will have accidents. I highly doubt you feel extreme danger when you get behind the wheel/in a car. The way you phrase it, one expects millions of fatalities daily.
Part of the journey into Linuxdom is learning a healthy dose of fear for that command. I always pause over the enter key for a few seconds, even when I'm sure I haven't typo'ed.
The sorts of bugs people experience with Java mostly result in a crashed/stalled/hung process. Bash bugs erase your entire file system. The thing about Bash is that it is trivially easy to make these sorts of mistakes- the language just isn't suitable to general purpose scripting.
It shouldn't be able to erase your filesystem unless you are running as root or doing something equally stupid. That's pretty much common sense stuff for anyone that isn't a beginner.
Yeah the "common sense stuff for anyone that isn't a beginner" argument is repeated ad nauseam, and even the largest companies make this mistake in their largest products. Take Valve - they should know how to write good code, right? And yet, last week an article was on top of HN, outlining how they put:
"rm -rf '$STEAMROOT'/*" in their code, used to remove the library. But hey, no one checked if $STEAMROOT is not empty, so when it was for one user, Steam deleted all of his personal files, whole /home and /media directories next time it started.
I'm not saying that command line tools shouldn't be used,but sometimes they are just too powerful for some users,and stupid mistakes like this happen.
You're right to an extent, but this isn't relevant to the Java vs Bash discussion. The largest companies make this kind of mistake in whatever language they happen to use.
People delete data and screw things up in MapReduce jobs for Hadoop. A lot.
If you're worried about that, don't give the script permissions to access your entire filesystem. Easily handled with separate users, cgroups, assorted containerisation, and more.
> The shell is just not suitable for extremely robust programs.
Absolute statements like this are usually wrong. This one does not escape the rule. When Linux distros init is mostly bash scripting, there is very little need to further prove that robust systems can be written in bash scripting without the language fighting the developer.
Wait, is it really a good argument for shell-based approach when all major distros are switching to the systemd due to the configuration/maintainability/boilerplate issues with bash init scripting?
I'm not going into the systemd VS sysvinit discussion. For my argumentation, it is enough to recognize bash based sysvinit has been with us for circa 20 years with no stability problems.
I think most of was was written also applies to any normal programming language. You could write this in Python, Ruby, Javascript, Java or C# without any problems. The code would probably be easier to read also. The only special thing is the web page scraping that could be done by a library but the same thinking about scalability and the use of a single computer instead of a hadoop cluster still holds even if you're reading from file systems or databases.
Something to keep in mind is that while a single app might be best served on a single machine piping data, multiple apps working the same data set probably wouldn't scale. Hadoop for all its faults does provide a nice, relativily simple programing platform to support multiple data processes.
>Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
Yeah, and then they produce some over-engineered monstrocity, late, over-budget and barely able to run...
I look at this article as a criticism of the hadoop being the wrong tool for small data sets.
This starts to become a question of data locality, and size. 1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.
The fact that shell commands were used makes for an easy demo that might be hard to support, but if a solution were written using a traditional language with threading or IPC instead of relying on hadoop you should always be faster, since you don't incur the latency costs of the network.
> That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.
Not at all, because data is being streamed. It could just as easily be 35TB and only use a few MB of RAM.
The IO bandwidth of the system will limit you more loading 35TB of data in ram on a single system, even if it is streamed. You'll need more than one disk, and network card to do this in a timely fashion.
1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system.
It depends on what you do with the data. If you are processing the data in 512KB chunks and each chunk takes a day to process (because expensive computation), you probably do want to spread the work over some cluster.
I don't think of hadoop being built for high complexity computation, but high IO throughput.
When you describe this kind of setup, I imagine things that involve proof through exhaustion. For example prime number search is something with a small input and large calculation time. However, these solution don't really benefit from hadoop since you don't really need the data management facilities, and a simpler MPI solution could handle this better.
Search indexing could fit this description(url -> results), but generally you want the additional network cards for throughput, and the disks to store the results. Then again the aggregate space on disk starts looking closer to TB instead of GB. Plus in the end you need to do something with all those pages.
I think the article said that you don't need to use Hadoop for everything and that it might be much faster to just use command line tools on a single computer. Of course you might find a use case where the total computing time is massive and in that case a cluster is better. I still don't think many use cases have that problem.
We are doing some simple statistics at work for much smaller data sizes and the computing time is usually around 10-100 ms so it could probably compute small batches at almost network speed.
Definitely. I was reacting to my parent poster, because size does not say everything. 1TB can be small, 1GB can be big - it depends on the amount of computation time that is necessary for whatever processing of the data you do.
I hate developers who over engineer everything and then when it's time to perform some of that support and extensibility, they leave because maintenance is beneath them.
They put this behemoth together with a thousand moving parts and then walk away from it.
And I can't stand developers who overengineer things. We have a couple of them at my company and something that should take a few hours always take several weeks just because of all the reasons you mention. Most things don't need that kind of features and maintainability and if they do in the future we can just rewrite them from scratch. The overall expected return on investment is still better since we seldom need to.
Because in all too many companies, re-writing from scratch is a no-go, no matter how quickly and sloppily an initial solution was thrown together. I've worked on a prototype => production type project, where the throwaway was never thrown away. (the initial team made some mistakes, chief among them was building one prototype of a whole system, rather than one per major risk)
This is a systemic problem. Engineering is always subordinate to business. This simply should not be the case. We desperately need new business organization models.
Quite the opposite, and, quite simple: engineers over-engineer thing in order to make things generic. and generic make solutions robust. that's basic science. Unless the problem and solution are well understood, your investment won't guarantee a return at all.
Generic, by default, does not in any way make things more robust.
We've gone from engineering solutions to meet specific problems to engineering solution frameworks that (supposedly) will solve the problem and allow for any unknowns. The problem is, no matter how hard the engineer tries, he can never anticipate the unknowns to the extent that the application framework can support all of them.
We should go back to solving the specific problem at hand. In both scenarios you get the customer who wants a feature that absolutely doesn't fit with the current application, therefore a rewrite is necessary. And with the specific solution, you don't have nearly the man hours wasted.
No, developers over-engineer because setting up a 20-node Hadoop cluster is fun, whereas doing the same task in an hour in Excel means you have to move onto some other boring task.
Generic doesn't mean robust either, I don't know where you got that from,the two concepts are entirely unrelated.
Generic -> robust. i... i dont know how to explain that. honestly i haven't thought about the necessity of explaining things like this. its... basic mathematics.
I'm sorry, but if you cannot explain it, you simply do not understand it yourself.
That's harsh, I get it, and I'm truly sorry, but that's a basic fact.
No. Look to safety-critical software for intuition on why.
Simpler is more reliable. Also, it's hard to know enough about a problem to make a generic solution until you've solved the problem 2-3 times already. But ... having solved a problem multiple times increases the risk that you will be biased towards seeing new problems as some instance of the old problem and therefore applying unsuitable "generic" solutions.
While it drives some point home, the chart eludes the question of robustness (a written script will run twice the same way, whereas human error, especially on routine tasks, may hit one hard) and documentation (writing even a lightly commented script to do yearly maintenance is guaranteed to help your future self remember things one year from now).
That chart assumes 24 hours days. The reality of (my?) productivity is that I have perhaps six productive hours in a day. If I can save eight productive hours per month, that's sixteen days a year, not four.
What about the failed pages? How about shoving those on a queue and retrying n times with an exponential backoff between. What about the total number of failed pages? What about failed pages by site? etc etc etc
But so what -- the principle is still sound. All I described is still a 100 line python script, written in an afternoon, instead of 3 weeks of working bringing up emr, installing and configuring nutch, figuring out network issues around emr nodes talking to commodity internet, installing a persistent queue, performing remote debugging, building a task dag in either code or (god help you) oozie/xml, and on and on.
Anybody can throw some crap together and make it stick. And it's a perfectly valid solution.
My issue is when there is criticism laid against those solutions which are actually engineered in a way that allows for supportability and extensibility. They are arguably far more important than execution time.
I think, in many peoples minds, extensibility == pain; either lots of code configuration (hello java, ejb), or xml (hello hadoop, java, spring, ejb), or tons of code (hello java, c++), etc. When nice languages don't make things painful, it sometimes feels like it's wrong, or not really enough work, or in some other way, insufficient. But people can mistake the rituals of programming for getting actual work accomplished.
Simple: because std utils are programs that do what they supposed to do. if problems bound are well within the definition domain of a std util then its all good. Supportability and Extensibility is way too generic for you to draw a line saying std utils can handle them all. After all, they are programs, not programming languages.
There are command line tools available to help the transition from 'hack' one liner to a more maintainable / supportable solution. For instance drake (https://github.com/Factual/drake) a 'Make for data' which does dependency checking would allow for sensible restarts of the pipeline.
The O'Reilly Data Science at the Command Line book (linked elsewhere in the comments) has a good deal to say on the subject: turning one liners into extensible shell scripts, using drake, using Gnu Parallel.
I've been using GNU Parallel for orchestrating running remote scripts/tools on a bunch of machines in my compute and storage cluster. Its now my goto tool for almost any remote ssh task that needs to hit a bunch of machines at once.
- There are things like pv(1) which allow you to monitor pipes. Things like systemd open other interesting possibilities for implementing, grouping and monitoring your processes.
- Recovery could be implemented by keeping a logfile of completed steps like a list of completely processed files or moving processed files to elsewhere in the file system (could be done in memory only using ramfs or tmpfs). Of course, it depends on the case whether it's feasible or not.
- Extensibility: Scripts and configurations can be done in shell syntax. Hook systems and frameworks of varying complexity exist. I agree that doing extensibility in shell code is going to turn out to be hazardous when done without proper concept and understanding of the tool at hand.
I fully agree with all the operational / restart / features comments. However, I've often been surprised on how a little thought / research can build all these requirements on top of off-the-shelf components. I also agree that it is likely that one will eventually outgrow wget, but, for example, one may run out of business / pivot before that.
We don't really "come up" with the xargs/wget approach. The approach is already there, waiting to be utilized by someone who understands the tools. The "cool kids" don't like(or are not able) to understand the tools.
The author (I think) is trying to point out that these problems are already solved, decades ago, with existing UNIX tools.
I've implemented this. It isn't too bad up to a certain point. You have to be a bit careful about your filesystem/layout of files, lots of filesystems don't particularly like it when you have a few hundred million files in one directory.
Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
I can't stand developers who come up with the xargs/wget approach, hack something together and then walk away from it. I've seen it far too often and it's great for the short term. Dreadful for the long term.