Why use a library for it? Can it not be done with just Python? And if using the library is much different from normal Python, does it mitigate Python's problems with functional programming? (For example one expression only lambdas and no TCO.)
I also do use some functional concepts in my Python work, but do not use a library for it. Only procedures or functions. No additional dependencies.
> Most programmers have written code exactly like this over and over again, just like they may have repeated the map control pattern. When we identify code as a groupby operation we mentally collapse the detailed manipulation into a single concept.
If you don't use a library, then you have to re-write something like groupby many times, I would expect. Or WORSE, you don't even use the pattern, writing "code exactly like this over and over again".
You probably know this, but for the other readers, I'd like to note that "groupby" specifically is part of Python's standard library (in "itertools" module)
Huh, interesting. This does not remind me of the database groupby operation, but rather of partition, like in SRFI-1. I mean in natural language it is clear to me, why they'd name it groupby, but in programming terms I think partition is more appropriate, as groupby is already "blocked" by the database operation.
Often one only needs one of the partitions though, which is when filter is sufficient. Otherwise I guess one can easily write partition oneself and then use that function over and over again, without resorting to a library.
But perhaps it is a good example, so that you do not have to write partition in every project and if the additional dependency is OK to have, why not, if it is indeed a good one.
> This does not remind me of the database groupby operation, but rather of partition, like in SRFI-1. I mean in natural language it is clear to me, why they’d name it groupby, but in programming terms I think partition is more appropriate, as groupby is already “blocked” by the database operation.
But…this is exactly what a database GROUP BY does. (You’ll always have aggregations in the SELECT clause which work on the data in the groupings, but the GROUP BY itself just specifies splitting the dataset up into this kind of groupings.)
> Otherwise I guess one can easily write partition oneself and then use that function over and over again, without resorting to a library.
Yeah, literally all a library is avoiding having to rewrite code that someone has already written once.
Ah I see. Perhaps the need to always have an aggregation confused me, which is specific to relational databases (all? most of? some?). Of course the aggregation has to work on something and that might be the same as a partitioning. Thanks for clearing that up!
The downside of a library is often, that it comes with its own dependencies and a lot of things you might not need. In general you should not buy into using a library whenever one is available, that among other things offers one procedure, which you need. The decision to use a library should be thought about a little more.
In some databases like Postgres, you can do something like “SELECT x, array_agg(y) GROUP BY x” to get the exact same effect of this groupby operation in toolz.
The source code for that function (groupby in toolz) is bizarre! Creating a defaultdict where the entries are append-to-list functions, calling them, then going over the dictionary again to extract the underling list objects. Does anyone know what this pattern is for, and why one wouldn’t just create a defaultdict(list)?
> why one wouldn’t just create a defaultdict(list)?
One shouldn't even do that. groupby is supposed to assume that the input is already sorted by the given key, so it can be implemented as a generator. What's more, it should be implemented as a generator (that's the way the Python stlib's itertools.groupby does it), to avoid having to realize the entire iterable at once.
The Toolz version has a different purpose than the one from Python's stdlib (itertools). One is unsorted input, while the other one is for sorted input. The Toolz version is not a replacement and the documentation states
The idea is that referencing the `append` method in the for grouping loop, i.e. doing
d[key(item)].append(item)
takes more time than rebuilding the _rv_ groups dictionary with lists instead of `append` methods.
Of course, a benchmark should be run and see how much longer the input sequence needs to be than the resulting groups, for this to happen.
I also do use some functional concepts in my Python work, but do not use a library for it. Only procedures or functions. No additional dependencies.