Why use a library for it? Can it not be done with just Python? And if using the ...

gugagore · on Jan 21, 2021

I think `grouby` is a compelling example:

https://toolz.readthedocs.io/en/latest/control.html#other-pa...

> Most programmers have written code exactly like this over and over again, just like they may have repeated the map control pattern. When we identify code as a groupby operation we mentally collapse the detailed manipulation into a single concept.

If you don't use a library, then you have to re-write something like groupby many times, I would expect. Or WORSE, you don't even use the pattern, writing "code exactly like this over and over again".

theamk · on Jan 22, 2021

You probably know this, but for the other readers, I'd like to note that "groupby" specifically is part of Python's standard library (in "itertools" module)

https://docs.python.org/3/library/itertools.html#itertools.g...

Don't forget that python is batteries included -- and you can avoid 3rd party dependencies a lot of times.

zelphirkalt · on Jan 21, 2021

Huh, interesting. This does not remind me of the database groupby operation, but rather of partition, like in SRFI-1. I mean in natural language it is clear to me, why they'd name it groupby, but in programming terms I think partition is more appropriate, as groupby is already "blocked" by the database operation.

Often one only needs one of the partitions though, which is when filter is sufficient. Otherwise I guess one can easily write partition oneself and then use that function over and over again, without resorting to a library.

But perhaps it is a good example, so that you do not have to write partition in every project and if the additional dependency is OK to have, why not, if it is indeed a good one.

dragonwriter · on Jan 22, 2021

> This does not remind me of the database groupby operation, but rather of partition, like in SRFI-1. I mean in natural language it is clear to me, why they’d name it groupby, but in programming terms I think partition is more appropriate, as groupby is already “blocked” by the database operation.

But…this is exactly what a database GROUP BY does. (You’ll always have aggregations in the SELECT clause which work on the data in the groupings, but the GROUP BY itself just specifies splitting the dataset up into this kind of groupings.)

> Otherwise I guess one can easily write partition oneself and then use that function over and over again, without resorting to a library.

Yeah, literally all a library is avoiding having to rewrite code that someone has already written once.

zelphirkalt · on Jan 22, 2021

Ah I see. Perhaps the need to always have an aggregation confused me, which is specific to relational databases (all? most of? some?). Of course the aggregation has to work on something and that might be the same as a partitioning. Thanks for clearing that up!

The downside of a library is often, that it comes with its own dependencies and a lot of things you might not need. In general you should not buy into using a library whenever one is available, that among other things offers one procedure, which you need. The decision to use a library should be thought about a little more.

joppy · on Jan 22, 2021

In some databases like Postgres, you can do something like “SELECT x, array_agg(y) GROUP BY x” to get the exact same effect of this groupby operation in toolz.

joppy · on Jan 22, 2021

The source code for that function (groupby in toolz) is bizarre! Creating a defaultdict where the entries are append-to-list functions, calling them, then going over the dictionary again to extract the underling list objects. Does anyone know what this pattern is for, and why one wouldn’t just create a defaultdict(list)?

The function: https://toolz.readthedocs.io/en/latest/_modules/toolz/iterto...

pdonis · on Jan 22, 2021

> why one wouldn’t just create a defaultdict(list)?

One shouldn't even do that. groupby is supposed to assume that the input is already sorted by the given key, so it can be implemented as a generator. What's more, it should be implemented as a generator (that's the way the Python stlib's itertools.groupby does it), to avoid having to realize the entire iterable at once.

ciupicri · on Jan 22, 2021

The Toolz version has a different purpose than the one from Python's stdlib (itertools). One is unsorted input, while the other one is for sorted input. The Toolz version is not a replacement and the documentation states

> Not to be confused with ``itertools.groupby``

ciupicri · on Jan 22, 2021

It's the avoiding dots pattern [1].

The idea is that referencing the `append` method in the for grouping loop, i.e. doing

    d[key(item)].append(item)

takes more time than rebuilding the _rv_ groups dictionary with lists instead of `append` methods. Of course, a benchmark should be run and see how much longer the input sequence needs to be than the resulting groups, for this to happen.

[1]: https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avo......