GNU Parallel Tutorial

CJefferson · on Nov 13, 2016

Feel I should mention this -- the confusing and non-clarified (I mailed and asked for clarification, I was told to stop using the software) licence which says if you use this as an academic, you must cite it or pay. I don't use it for that reason -- this request isn't scalable.

I either use xargs, or this reimplementation which had all the stuff I need:

https://github.com/mmstick/parallel

dahart · on Nov 13, 2016

Most clarification I've seen on the Parallel nag/citation message points at the original email thread Ole started to ask whether this was a good idea: http://lists.gnu.org/archive/html/parallel/2013-11/msg00006....

When I see it, I get irritated, but then I disable it and completely forget for another year.

It is not a legal requirement or license requirement to cite usage of parallel, and the GNU terms actually forbid making it a requirement:

https://www.gnu.org/licenses/gpl-faq.en.html#RequireCitation

The nag's wording is also annoying and wrong about academic tradition requiring citations of all tools used, whether or not the tool is integral to research. Academic tradition does require citing others' research you build your research on top of. It does not require citing MS Word because I wrote my paper in MS Word.

This is a small and stupid snag on what is otherwise a really awesome tool. My recommendation is to make the citation go away and continue using the tool, and if you use it in a way that your research actually depends on, then give the citation. If you just use gnu parallel to speed things up, and it in no way changes what you're doing or how you're thinking about it, then ignore it. Though I'm sure Ole would love to hear about projects that use parallel - what he's asking for is PR not citations, and blog posts or press or kind words that praise gnu parallel will also help him get what he needs.

CJefferson · on Nov 13, 2016

Sure GPL divide the restriction, but effectively the code is under two contradictory licences (GPL v3, and cite or pay). Why are you sure the GPL v3 is the one that "wins"?

dahart · on Nov 13, 2016

IANAL, but personally I'm certain myself that GPLv3 is the one and only license. The cite or pay is just a nag request, not a license. Also note it doesn't say the only options are either cite or pay exclusively, it says (obnoxiously) if you pay 10,000 EUR, feel free to not cite. If you don't pay, then you are only obligated by this request to feel guilty.

Again, Ole really just wants PR here, he made that clear in the email thread. He's only asking for help, in a forceful way. If you have a reason to give him some help, and you make heavy use of parallel, it would make his day and it wouldn't hurt you.

Otherwise, there is not legal requirement nor academic tradition to cite usage of parallel in academic papers, unless the papers build directly on gnu parallel.

I really wouldn't be surprised if part of this is Ole felt like there were research papers about parallelism that built on gnu parallel's efforts directly and should have cited him, that he's been short changed by a few academics.

But simply speeding up your data processing for doing something in a field that is not related to parallel processing at all, then parallel is simply another random tool in a huge bucket of tools to get your job done, and it has no place in an academic citation.

semi-extrinsic · on Nov 13, 2016

Completely agree with your last paragraph. Case in point: it's not common to cite the BLAS paper [1] either when you're just using BLAS for data processing, even though BLAS is arguably more "scientific" than GNU Parallel.

[1] http://dl.acm.org/citation.cfm?doid=355841.355847

RhysU · on Nov 14, 2016

I cited Ole in a paper. Took much less time than parsing through this thread. Ole wins. Ole keeps working on parallel. I win. The community should do everything it can to encourage the creators and custodians of scientific software to continue, as we all benefit from the ecosystem.

dahart · on Nov 15, 2016

Awesome! And I totally agree about encouraging anyone writing good software, including Ole.

FWIW, keep in mind a few things:

- Note that @CJefferson is afraid to use parallel because Ole made the license restrictions unclear by adding the citation request. Ole is discouraging some applications of his software by being so forceful about his request.

- Ole's citation is not from a scientific or peer-reviewed publication. It simply cannot be used in some contexts. Even the BLAS link above was in a research journal, but Parallel's is not. In many contexts, a citation of Parallel is highly inappropriate.

- The small amount of time it takes to either cite Ole or read and understand this thread is largely irrelevant to the question of whether a citation is either warranted or appropriate or allowed. Understanding the license is a one-time event, and it's very important for the license to be legally clear. Parallel's citation notice is causing confusion. The question is whether Parallel can and should be used without having to worry (forever) about the legal consequences of the contract I've agreed to by using the software.

- This approach isn't scalable. If other GNU tools, or other free software started using the same language as Ole, it would cause widespread problems.

temac · on Nov 13, 2016

Why do you think there are 2 licenses? Where did you see any "cite or pay" license?

claudius · on Nov 13, 2016

Meh. The website says "Please cite", not "you must cite". To derail this a bit further, I would actually be very interested in an academic version of the AGPL - that is, a proper license that defines "publication of data produced by the work" as "distribution of the work", meaning that if someone uses the software to produce some data and then publishes said data, they also must publish (or make available on request) all changes made to the software. Unfortunately, the very strict definition in the AGPL and the interlinking between it and the GPL make this difficult to add-on later.

And thus we are stuck with "please cite this so I can justify working on it, please?" instead of "You must cite this if you want to use it".

Honestly, I also don’t see how this request isn’t scalable, could you expand on that?

mikegerwitz · on Nov 13, 2016

The output of a program isn't a derivative work, so this can't be enforced with a copyright license; you'd need something else. Unless you make it a derivative work by outputting copyrighted material, but that'd be difficult with raw data.

Anyway: there's a wonderfully specific answer to this in the FAQ:

https://www.gnu.org/licenses/gpl-faq.en.html#RequireCitation

https://www.gnu.org/licenses/gpl-faq.en.html#GPLOutput

claudius · on Nov 13, 2016

I read this, but I honestly don’t understand it. Isn’t a license essentially a contract between the licensor and the licensee? And shouldn’t it hence be possible to write such a clause into a license? After all, lots of things are put into licenses which do not cover derivative works (e.g. conditional patent grants)?

I understand that I cannot add this to the GPL (due to the specific clauses of the GPL) and that RMS might not consider the resulting software free (though it would probably pass all of Debian’s freedom guidelines, for example), but it should be possible to have such a license in general?

mikegerwitz · on Nov 13, 2016

The issue with adding such a restriction to the license is that it has to work within the domain of Copyright. Assuming The author of foo holds copyright on foo, but not to the output _generated_ by foo---it's an original work (assuming that foo doesn't output anything that its author does actually hold a copyright on).

I guess a good example would be a Madlib-style program, where it asks you questions and fills in the blanks in a story, often resulting in something highly amusing. The original story containing the blanks is copyrighted. The output of this program would then be a derivative work, because the original story has been modified.

But consider that the program took a story of your own (the data its processing) and output statistics, such as the word count, frequency of certain words, grammar errors, etc. This is not a derivative work. Similarly, if GNU Parallel is being used to process your input, its output isn't a derivative work.

With that said, you can have a separate EULA-type thing---which is _outside_ the domain of copyright---that imposes these terms. But that is incompatible with the terms of the GPL.

paulmd · on Nov 13, 2016

This is not true, or at least not universally true. For example, gameplay videos are also considered copyrighted as they contain assets and other copyrighted elements. Your "mad libs" example probably falls under a similar classification. A completed Mad Libs would still contain large distinctive elements of a copyrighted work.

In these cases, the EULA actually may contain clauses that allow you to distribute gameplay videos. But if they do not contain exemptions, it's copyright that will restrain you, not the EULA.

http://www.develop-online.net/analysis/uploading-gameplay-co...

https://support.google.com/youtube/answer/138161?hl=en

Now in the specific case of GNU Parallel, I don't see how the output would contain any distinctive elements of the original program. As a counterexample you could not use GNU parallel to process its own source code and end up with your own copyright on the output, however.

mikegerwitz · on Nov 14, 2016

> Your "mad libs" example probably falls under a similar classification.

Yes, by stating that it's a derivative work, I meant that the output would be subject to the Madlibs copyright.

GNU Parallel's output isn't subject to the GPL.

zerocrates · on Nov 13, 2016

The scope of restrictions in a license don't have to fall only within copyright. Proprietary licenses routinely limit what you can do with or to the software, and don't restrict themselves to derivative works: the license can have "teeth" simply by revoking your copyright license to the original, unmodified software.

If you're thinking, well that still only is effective if your use involves making copies of the software, note that US courts have held that just using software and thereby causing it to be copied into RAM counts as making a copy for the purpose of copyright law (see MAI Systems v. Peak).

CJefferson · on Nov 13, 2016

The software prints:

If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

That sounds to me like cite or you will be taken to court for not citing.

If I cite parallel, should I cite Perl? C library? Linux kernel? Systemd?

claudius · on Nov 13, 2016

I would make a distinction between “scientific software” and “system software”. The latter is commonly used for all sorts of things and could in principle easily be replaced. The former is specifically adapted to produce the results at hand and the results (and only the results, not for example the typesetting of the paper) depend crucially on this software and its correctness.

Examples of the latter are e.g. Linux, libc, systemd etc.: All things used during production of the results, but not aimed at producing such results. The former are e.g. Matlab, the Intel MKL, ALPS, TensorFlow or similar tools and suites. A bug in the MKL may make your results wrong, while a but in the kernel will most likely just crash everything. In my experience, scientific results depend on a handful of software packets in such a way, and adding five-or-so citations to a paper is certainly no big deal and would make reproducing the results much easier.

I have no idea where GNU parallel sits there, it probably depends on how you use it – do you schedule your Monte Carlo runs with it or just rescale pictures for the website? Admittedly, it is a very grey area, but I don’t consider a request for citation unreasonable.

vacri · on Nov 13, 2016

It sounds more to my like it's telling me how to feel. "Should feel" and "are" are not synonymous.

emmelaich · on Nov 14, 2016

Disabling the notice requires you to type 'will cite'.

In other words, it effectively requires me to LIE.

No thanks.

avar · on Nov 13, 2016

It's licensed under the GPL, where you're not allowed to add any additional terms. The citation notice it spews out is annoying, but can be disabled, and you definitely don't have to include it anywhere.

The author just asks for citations because he's in academia, and citation backrubs is how anyone can justify getting time to do anything (including maintaining some software) in academia.

filmor · on Nov 13, 2016

2 questions:

- Isn't it common courtesy to cite important tools being used for data processing? Why is this a problem?

- Wikipedia says the tool is under GPLv3, so from a bare licence perspective you should be safe to use it for whatever you want, shouldn't it?

dahart · on Nov 13, 2016

> Isn't it common courtesy to cite important tools being used for data processing?

No, no it's not. It is common to cite previous work, the research you build on top of. It is not common at all to cite tools you use, unless those tools are directly related to your research. If by "important" you mean gnu parallel counts as previous work, then citation is more of a requirement than a courtesy, otherwise citing all the tools you used can be very discourteous to your readers.

It would be extra courteous to give parallel some PR if you love it and it really helped your research get done.

> Why is this a problem?

What if all gnu tools asked the same thing parallel does? What if using emacs and find and tar all asked you to cite them when using them, with obnoxious and demanding language? I use more than 100 free tools to get any paper implemented and written, my papers would be rejected if I tried to cite them all, the vast majority are not relevant to the research.

> you should be safe to use it for whatever you want

Yes, correct! GPL does not allow requiring citations.

beevai142 · on Nov 13, 2016

> Yes, correct! GPL does not allow requiring citations.

I understand that the copyright holder can in general specify whatever clauses in the license, even if this makes it self-contradictory, and is not bound by any statements by FSF. In this case, it becomes unclear what use of the software is actually allowed.

mikegerwitz · on Nov 13, 2016

Even if the software were licensed under two different licenses, as long as one of them were the GPL, you could choose that one. But even if you add additional terms to the GPL, you are free to remove them under Section 7:

"When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it."

https://www.gnu.org/licenses/gpl.html

Note that the GPL only applies to distributors of the software, not to users, so you wouldn't be bound to such restrictions anyway.

beevai142 · on Nov 13, 2016

The point is not that there are two licenses, but that there is one license, "GPL + additional requirements", where the copyright holder is free to add requirements contradicting the GPL.

For example, a copyright holder may say "MIT license, BUT, the Software shall be used for Good, not Evil.", and lawyers will balk. [1,2]

[1] http://www.json.org/license.html [2] https://www.cnet.com/news/dont-be-evil-google-spurns-no-evil...

mikegerwitz · on Nov 13, 2016

The copyright holder is free to use whatever license they wish, yes.

But if they choose to use the GPL, they must abide by its terms. They aren't free to rewrite parts of it---their license would then be a derivative work of the GPL, which isn't permissible. Section 7 would have to be modified to say that the user can't remove such terms.

If they use the GPL and impose extra terms, state that Section 7 is invalid, then there is a contradiction in the license, and it'd be up to a court to decide. I'm sure any sensible lawyer would tell them that they should use a different license; it's senseless using the GPL if you're going to try to do such a thing.

beevai142 · on Nov 13, 2016

I don't think Section 7 enables automatically ignoring non-permissive additional terms --- the "additional permissions" are defined in the section to be exceptions from conditions, not as further restrictions. Interpreting what exactly happens in this case probably requires a lawyer.

It's of course the case that using GPL and adding such clauses does not make any sense. GNU parallel does not appear to be doing this, as the license text is unmodified GPL with no additional clauses.

rpdillon · on Nov 13, 2016

Further down in section 7:

> All other non-permissive additional terms are considered “further restrictions” within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term.

mikegerwitz · on Nov 13, 2016

The exception would be on Section 2:

"This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work."

dahart · on Nov 13, 2016

> there is one license, "GPL + additional requirements"

While these additional requirements may (or may not) be possible and legal, in the case of GNU Parallel, the license is unmodified GPLv3. The project would need to identify a separate license that inherits GPLv3 and explicitly states those additional requirements, but in this case it does not. GPLv3 is used directly and verbatim on the project home page https://www.gnu.org/software/parallel/

The text the executable spits out asks for citations, but it does not identify itself as a license, it does not state that citations are a legal requirement, and it does not state a relationship to GPLv3 nor carve out exceptions or additional requirements.

CJefferson · on Nov 13, 2016

The problem with point 1 is where to stop? Well parallel is written using Perl, so cite that? I have some bash and python scripts. Cite them? The Linux kernel? Systemd?

I would cite parallel if it is required for the reader to understand or reproduce my research (for example, I report timings which were sped up by parallelisation). Otherwise it's just a tool, Like bash, the C standard library or the Linux kernel.

Point 2, the author is giving the code two contradictory licences. I'm not sure which applies. Do I get to choose or does he? I'm not a lawyer.

mikegerwitz · on Nov 13, 2016

> (I mailed and asked for clarification, I was told to stop using the software)

By whom? Was it the author? If so, can you cite your question on a mailing list?

CJefferson · on Nov 13, 2016

Look through this thread. https://lists.gnu.org/archive/html/parallel/2016-03/msg00015...

mikegerwitz · on Nov 13, 2016

Thank you.

pi-rat · on Nov 13, 2016

Love this tool, often end up using it with imagemagick.

  # Resize all jpgs to 800x600 using 8 jobs/cores.
  parallel -j 8 convert {} -resize 800x600 {.}_small.jpg ::: *.jpg

  # Or get the help of a few servers (via SSH) to do the same job.
  parallel -S serverA,serverB -j 8 --transfer --return {.}_small.jpg convert {} -resize 800x600 {.}_small.jpg ::: *.jpg

dahart · on Nov 13, 2016

I do the same thing and resize all the pictures I ever import into both small & thumbnail versions.

Fun tip - if your resize script prints only the output file name to stdout, you can pipe the result into another parallel command, e.g.,

  parallel <resize to 800x600> ::: *.jpg | parallel <resize to thumbnail>

This way generating thumbnails runs on the small image instead of the full size image, saving more time.

Also, if you're on a mac, using sips is quite a bit faster than imagemagick, and sips comes built-in.

I've never used servers with parallel for image resizing, what are the benefits? I'd have guessed it would take longer and saturate the network, versus doing it locally. Is it useful for really long running jobs when you don't want to load your local cpu? Is it actually faster sometimes, or are there other more important reasons? I could see it being useful if I didn't have a local imagemagick install, but had access to servers with it there. Maybe it'd be useful in cases where I'm running docker environments for the job processing? What other use cases and scenarios have you run into?

pi-rat · on Nov 13, 2016

sips is a good tip, thanks!

I often sort and process images with a macbook on my lap curled up on the sofa. Resizing locally makes the mac blazing hot with fans whining - not comfortable and kills the battery. Transfer speed is not too bad over 802.11AC. So I use it mainly as a method of moving cpu intensive work away from my lap.

Also, I sometimes use

  parallel -S server,: .......

The semicolon adds the local machine to the list, it will saturate both the laptop and whatever it manages from the other computer. I have to admit I've never tested this scientifically, but it seems to be faster even with the overhead (gain of remote imagemagick seems to be more than cpu overhead of SSH file transfer).

dahart · on Nov 13, 2016

> I often sort and process images with a macbook on my lap curled up on the sofa. Resizing locally makes the mac blazing hot with fans whining - not comfortable and kills the battery. Transfer speed is not too bad over 802.11AC. So I use it mainly as a method of moving cpu intensive work away from my lap.

:) I have the exact same workflow: MacBook+Sofa. Totally trying out the server options today.

zevv · on Nov 13, 2016

Or, good enough for 95% of the use cases, simply use 'xargs' which is part of findutils and installed by default on all linux distributions. See the '-P/--max-procs' option in the man page for more details.

bandrami · on Nov 13, 2016

Right, but it's those remaining 5% of situations where you want parallel execution on multiple remote hosts that make this cool (xargs itself can be replaced 95% of the time with backticks, for that matter). Combining parallel, pies[1], and shepherd[2] makes for some very cool possibilities, which you can check out with the GNU System Distribution[3], which I've recently migrated all my personal systems to.

[1] http://www.gnu.org.ua/software/pies/

[2] https://www.gnu.org/software/shepherd/

[3] https://www.gnu.org/software/guix/

sigil · on Nov 13, 2016

You can use xargs for parallel remote execution too. In the simplest case:

    list-hosts |
    xargs \
    -P8 \
    -n1 \
    -I% \
    ssh % some-command

bandrami · on Nov 13, 2016

Sure, and you can also run an ssh command inside backticks too. There's something of a Turing tar pit here.

adrianN · on Nov 13, 2016

With xargs I always have the problem that lines get mangled when you have concurrent output, so you have to work around that. GNU parallel prevents this by some magic.

sigil · on Nov 13, 2016

The "magic" is simply output buffering inside GNU parallel...something you should be aware of if the concurrent programs are producing large amounts of output.

ole_tange · on Nov 14, 2016

"inside" = temporary files - not RAM. See comparison on https://www.gnu.org/software/parallel/parallel_design.html#B...

adrianN · on Nov 13, 2016

Reproducing that behavior with just bash and xargs is too difficult for me.

xorcist · on Nov 13, 2016

If you find yourself depending on the order of output lines, your are probably better off using something like Python Fabric instead.

psi-squared · on Nov 13, 2016

If I've read this correctly, the 'sem' mode lets you submit several lots of jobs with an overall limit on the total number of tasks running at a time (rather than one limit per lot of jobs).

That on its own is super useful for what I'm working on right now. But what would make it even more useful, is: can you get GNU make to use 'sem' instead of its own jobserver? That way I could run almost everything I need to under one overall task limit, and that would be really nice to have.

(For this reason, I'm a fan of the idea that every program with its own 'parallel execution' mode should be able to interact with a common jobserver. The 'make' jobserver is, as far as I know, the simplest, and should be pretty easy to support: http://make.mad-scientist.net/papers/jobserver-implementatio... )

dahart · on Nov 13, 2016

I don't think you can use a different job server for make, all you can do is have a make rule that launches parallel or something else, but then you lose dependency tracking on the sub-tasks.

Are you running parallel make tasks where each task is also doing something multi-threaded or parallel? Like using make -j 8 won't work for you?

Make does have the -l load average task limiter when but I've never gotten it to work reliably, it always starts way too many jobs at first and chokes for a while before calming down. Often that won't work for me, but maybe it will help you?

psi-squared · on Nov 13, 2016

My current workflow has a lot of "Run make -j<lots> to build, followed by parallel -j<lots> to run all the tests", but sometimes I want to compare/test multiple different versions of the code (in a way which, sadly, doesn't work well with incremental builds). In that case it'd be nice to be able to just spin everything up and not have to worry about overloading/under-loading the machine I'm working on.

I know what you mean about the load average limiter - parallel behaves like that too. I think the --delay option to parallel is supposed to solve that (I haven't tried it - will try tomorrow), but I don't know if make has anything similar.

Finally, on further reading, it definitely seems technically possible, even if it hasn't been done so far. The make documentation has a section on the jobserver protocol, which looks complete enough to write both the client and server parts: https://www.gnu.org/software/make/manual/html_node/Job-Slots...

So if nothing exists so far, it's something I might look into writing myself.

dahart · on Nov 13, 2016

> Finally, on further reading, it definitely seems technically possible https://www.gnu.org/software/make/manual/html_node/Job-Slots.... So if nothing exists so far, it's something I might look into writing myself.

Holy moly, that's kinda crazy, but would be fun, you should totally do it! Looking forward to your blog post! ;)

rdtsc · on Nov 13, 2016

GNU parallel is pretty cool.

Another little thing I realized a while back is that `make` (yes the crusty old make + -j flag) can be used to parallelize jobs. We do it for compiling usually, but it can be used for other jobs as well.

ole_tange · on Nov 14, 2016

Fun fact: GNU Parallel was originally a wrapper script that used `make` for the parallelization: https://www.gnu.org/software/parallel/history.html

rdtsc · on Nov 14, 2016

Ha! Really cool. That was a fun fact!

Make was right there under my nose I just never imagined using it for anything but compiling and building things. In that case I was forced by circumstances (was developing on a constrained ancient version of RHEL), couldn't use GNU Parallel and someone suggested `make`. The use case of obvious once a co-worker mentioned it. But it was definitely It was one of the memorable "thinking outside the box" example as they say.

oftenwrong · on Nov 13, 2016

I am a long-time user of gnu parallel, but I feel it has a bit of "featuritis". Just look at all of those options in the man page!

ec109685 · on Nov 13, 2016

Unless it is hurting the product, is it more like featureful?

dagw · on Nov 13, 2016

While I kind of agree with you, I've yet to find 3 people who agree on which features are unnecessary and which are killer features without which gnu parallel would be useless.

janoc · on Nov 13, 2016

Please, don't:

"Install the newest version using your package manager or with this command:

  (wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

"

facepalm

dahart · on Nov 13, 2016

I think this discussion has already gone the rounds on HN multiple times, and it usually gets pointed out that running any installer at all has more or less the same risks as pipe something from the internet straight to the shell -- while this looks scary, it's actually no less safe than how we install any other software.

I agree it doesn't look ideal and may not be best practice, but what do you feel is a better realistic alternative? What is the main issue for you? Is it the lack of a hash or checksum to verify what you downloaded, and make sure you didn't get a malicious site or a compromised package?

heinrich5991 · on Nov 13, 2016

This is HTTP though, not HTTPS. You're not just trusting the authors, but also trusting the network.

dahart · on Nov 13, 2016

Yes, absolutely! So package validation is the issue for you. But how certain are you about everything else you install using HTTPS? (I'm honestly asking, not being rhetorical.) And what is stopping anything else you install from downloading something over HTTP, and executing it?

gravypod · on Nov 13, 2016

Who is pi.dk? This isn't using https? If you have a package manager--that has developer signatures for every release and has been checked by multiple others in the community--when even recommend using this?

This is far less secure then running apt-get, yum, or even pacman.

vurpo · on Nov 15, 2016

The download is not over HTTPS, so you also have to trust the network. Also, the package you're downloading is not signed (which it would be if it came from a software repository, which almost all major operating systems have) so you have to trust whoever runs that website.

IgorPartola · on Nov 13, 2016

It's downloaded over HTTP.

haddr · on Nov 13, 2016

Wow! Parallel is simply unmatched!