Hacker News new | past | comments | ask | show | jobs | submit login
Counting Things in Python: A History (treyhunner.com)
208 points by th on Nov 10, 2015 | hide | past | favorite | 51 comments



What a beautiful, understated piece of writing. This particular problem seems to come up a lot when thinking about how to use new Python classes (defaultdict, Counter) and affordances (list comprehensions, +=). It's nice to see it captured in timeline format.


Worth noting that Counter itself uses what the article calls get Method, but with a common performance optimization (caching a bound method).

    def _count_elements(mapping, iterable):
        mapping_get = mapping.get
        for elem in iterable:
            mapping[elem] = mapping_get(elem, 0) + 1


Also worth noting that those lines are immediately followed by:

    try:                                    # Load C helper function if available        
        from _collections import _count_elements
    except ImportError:
        pass


The author seems to misunderstand one part of The Zen of Python:

| Simple is better than complex.

by saying:

> Our code is more complex (O(n2) instead of O(n)), less beautiful, and less readable

The Zen is not about computation complexity! It's about complexity of the source code.

The code in question is:

    color_counts = dict((c, colors.count(c)) for c in set(colors))
And while I agree that this is inefficient and shouldn't be used in a library, I find this to be very readable and would always prefer that code if I know that I'm dealing only with small lists. It translates neatly to a natural-language description of the problem:

    Give me a dictionary that maps each distinct color
    to the number of times it occurs in the list.
So I don't think this violates the guide "Simple is better than complex". Rather, it is a good example where it makes sense to introduce additional complexity to improve the performance of an often-used helper function.


I don't think you should get into the habit of writing O(n^2) algorithms if the O(n) solution is not much more complex. Unless the constants are very dissimilar, you probably hurt your performance already for a few hundred elements. Writing reusable code includes using algorithms that are ok for a wide range of input sizes.


I have no idea how the author found this code more complex. Probably he's still more used to imperative programming, but functional programming styles and one liners will mostly greatly simply the understanding of the code.


The try block method would be considered bad in most languages, and I hope it is considered bad in Python as well. Using exception handling as part of a normal flow of control is bad style, bad taste, and bad for performance.

EDIT: I'm glad to see the Python documentation addresses this: https://docs.python.org/2/faq/design.html#how-fast-are-excep...


Essentially, in CPython, you're already paying for an exception check on either a lot of the opcodes or all of them, so you might as well use it. Statements are very rich in Python, so generally to get the best CPython performance the name of the game is to get as much out of each one as you can, and minimize the total number of them. (I have to specify CPython because PyPy can be much more complicated.)

Certainly nowadays that would be in bad style, but, back in 1.5 it wouldn't have been so bad. Also, when 1.5 was out, 200MHz was still a pretty decent machine! The little things could matter quite a lot.


It's not considered quite that bad. In many languages, using exceptions for non-exceptional flow control is bad style. Python is a bit more ambivalent about this, for instance take a look at the iterator protocol which uses an exception to signal the iterator is done.

https://docs.python.org/2/library/stdtypes.html#iterator-typ...


That turned out to be not that great of an idea, and it's changed in Python 3.5: https://www.python.org/dev/peps/pep-0479/


That's not a change to the iterator protocol. It's a change to 'StopIteration handling inside generators'.

The following will still raise a StopIteration in Python 3.5+:

    >>> next(iter([]))


Yep, the fundamental termination mechanism seems to be the same. Years and years ago I got into some silly irc nerdgument with a python expert (I think one of the twisted people) about the ugliness of this design and for a moment I thought I got to triumphantly yell 'Told you so!' a decade later. Alas, not the case.


I take it you prefer the explicit test for the end of iteration?

As an historic note, the StopIteration form grew out of the earlier iterator form, which called __getitem__ with successive integers until there was an IndexError. That may explain a bias towards an exception-based approach.


I probably do, although in practice, given how well (and composable) python comprehensions/iterators/generators have turned out, getting all worked up about some implementation detail now seems a bit churlish and pointless.


Exceptions in python have historically been very cheap.


That's not true.

There was a paper at the 1997 Python conference on this topic titled "Standard Class Exceptions in Python". A copy is at https://web.archive.org/web/20030610173145/http://barry.wars... . It evaluated the performance of try/except vs. has_key and concluded:

> This indicates that the has_key() idiom is usually the best one to choose, both because it is usually faster than the exception idiom, and because its costs are less variable.

The take-home lesson is that actually raising an exception in Python 1.5 was about 10x more expensive than a function call, but the try/except block when there is no exception was not expensive.


Interesting. My information was not only out of date; it was also wrong. The dangers of cargo-culting, although in my case, more theoretical than real, as I never had performance-sensitive python code in production.


It really depends. Usually, you would do something like this in Python:

def whatever(some_string): try: return some_string.split() except AttributeError: return some_other_parsing_stuff(some_string)

Instead of checking the type of some_string, or seeing if it has the method split. Reason is: it's more straight forward, and it instantly tells the reader this function is meant to handle strings, and it will split them. If it gets a not-string for some reason, oh well, it'll still handle it.

You would check for values in a circumstance like this:

def dict_breaker(some_dict): if 'items' in some_dict: return parse_some_items(some_dict['items'])


The latter isn't actually the typical Python style. As the article discusses, Python generally prefers "EAFP"[1] to "LBYL"[2], e.g.

  def dict_breaker(some_dict):
      try:
          return parse_some_items(some_dict['items'])
      except KeyError:
          pass
    
Or even

  def dict_breaker(some_dict):
      try:
          items = some_dict['items']
      except KeyError:
          pass
      else:
          return parse_some_items(items)
These versions may be more efficient (only have to do one hash and interaction with the hashmap, but this probably won't be visible with integers/short strings), and don't suffer from race conditions in concurrent code with the hashmap being modified between the check and the use (yes, this can occur even with GIL).

[1]: https://docs.python.org/2/glossary.html#term-eafp [2]: https://docs.python.org/2/glossary.html#term-lbyl


Is there a reason you chose to `pass` instead of the more explicit `return None`? The former seems like it would be less idiomatic since its return value is not explicitly stated.


To emulate the parent exactly e.g. maybe it is just the prefix of the "dict_breaker" function and other things happen later if the key can't be found.


I would rephrase it. The try-block method in this example is not useful. There is only one exception we care about and that's 404 with key.


> Per the Zen of Python, “there should be one– and preferably only one– obvious way to do it”. This is an aspirational message.

"aspirational" is an optimistic way to describe it. As the article illustrates, every ~2 years the "pythonic" approach is different. Innovation is good, but it hurts re-use of code and of skills.


I was expecting a larger difference in times, about 3.5x is the largest spread. https://gist.github.com/treyhunner/0987601f960a5617a1be

Nice article.


This makes me appreciate autovivification and casting in perl so that you can just say "$color_counts{$color} += 1" without all the initialization.


Yes! This was one of the biggest things I missed when I moved to Ruby. I try and tell people how great autovivification is but unless they've coded with it the feature just sounds strange. But it lets you build some really great data structures on the fly!


I love it as well, and would hate to do without it, but it does come with it's own warts. Such as this:

    use Data::Dumper;
    my %h;
    if ( $h{foo}{bar}{baz} ) { say "Never happens"; }
    say Dumper \%h;
And you get this:

    $VAR1 = {
              'foo' => {
                         'bar' => {}
                       }
            };


You'd want to use the exists operator in that case. It checks if the hash key is present without auto vivifying it.

  my %hash = ();
  if (exists $hash{foo}) {print "This doesn't run!";}


I'm well aware of exists, and how it functions, and that it in no way solves the problem presented. Exists tests for the existence of a hash key, so you can tell if it exists but is possibly undefined, but it does not stop autovivification in any way.

Your example doesn't cause autovivification even without exists. Autovivification is the automatic creation of the underlying hashes and arrays in a multiple level data structure when they are used while accessing a nested data-structure.

For example, given an empty hash %hash, $hash{foo} does not cause autovivification, but $hash{foo}{bar} will automatically create an empty hash and assign a reference to it to $hash{foo}.


Fwiw Perl 6 only autovivifies anything, including intermediary data structure levels, when writing to a data structure. So nothing happens in this case:

  my %h;
  dd %h
  if %h<foo><bar><baz> { say "Never happens" }
  dd %h
prints:

  Hash $var = {}
  Hash $var = {}


Nice, I wasn't aware of that feature, but the features I'm still not aware of in Perl 6 could fill a small book, since I still haven't gotten around to using it for much. :/


Asking as a Perl ignoramus, what does this give you over the defaultdict(int) example from the article?

Does autovivification mean that the default value to use depends on the operation? eg 0 for addition and "" for concatenation?


I'm a Python ignoramus, but I think they're the same, except for Perl's being more concise. How would you do a 3-level hash though in Python ?

What you call "value to use depends on the operation" is called casting. Depending on the context, a variable can be different values. For example a variable that has been defined but nothing else is treated as 0 if you try to add a number to it. It's an empty string if you use it in a string operation.

Wikipedia explains it well: autovivification is the automatic creation of new arrays and hashes as required every time an undefined value is dereferenced. Perl autovivification allows a programmer to refer to a structured variable, and arbitrary sub-elements of that structured variable, without expressly declaring the existence of the variable and its complete structure beforehand.


> I'm a Python ignoramus, but I think they're the same, except for Perl's being more concise.

Isn't everything always more concise in Perl? No matter what language it's compared to :)

> How would you do a 3-level hash though in Python ?

You might/should be able to wrangle that with some customisation of the default_factory or a defaultdict subclass, but yeah the arbitrary depth wouldn't be as 'plug n play' as with Perl.

> What you call "value to use depends on the operation" is called casting. Depending on the context, a variable can be different values. For example a variable that has been defined but nothing else is treated as 0 if you try to add a number to it. It's an empty string if you use it in a string operation.

Yeah Python's strong typing (usually) balks at automatic casting. You'd need to choose what the default is.


> Isn't everything always more concise in Perl? No matter what language it's compared to :)

Ha, you need more exposure to things like APL :)


So how do I know if "1" gets added to a 0 or to an empty string?


Perl uses separate operators for mathematical addition and string concatenation.

  1 + 1 == 2;
  1 . 1 eq 11;
It uses separate operators for most mathematical and string operations, for that matter. There isn't much of a meaningful difference between strings and numbers in Perl -- it converts between then depending on the operation.


I suppose there's some type inference there and you'll get the default be 0 since 1 is also an integer, not a string.


I miss that in Perl a lot.


  $ txr
  This is the TXR Lisp interactive listener of TXR 123.
  Use the :quit command or type Ctrl-D on empty line to exit.
  1> [hash-update [group-by identity
                            '(brown red green yellow yellow
                              brown brown black)]
                  length]
  #H(() (green 1) (red 1) (brown 3) (black 1) (yellow 2))
Form a hash by grouping like items into lists. The identity function is the key in the hash and the basis for equality, so the keys are colors, and the values are lists of colors.

Then update the hash values by filtering through the length function.


I guess this is off topic, but neat language.

But that algorithm allocates a bunch of intermediate lists and iterates through the hash table when it doesn't need to. Here it is in common lisp:

    (defun count-elements (lst)
      (loop
         with rval = (make-hash-table)
         for val in lst
         do
           (incf (gethash val rval 0))
         finally (return rval)))


It's better (by default) to optimize for programmer time and succinct expression unless something else takes priority, like running time, memory use, image size, etc.

The TXR version of the iterative approach:

  1> (let ((h (hash)))
       (each ((c '(brown red green yellow yellow brown brown black)))
         (inc [h c 0]))  ;; (inc (gethash c h 0)) can be used
       h)
  #H(() (black 1) (yellow 2) (green 1) (red 1) (brown 3))
Or:

  2> (let ((h (hash)))
       (mapdo (do inc [h @1 0])
              '(brown red green yellow yellow brown brown black))
       h)
  #H(() (black 1) (yellow 2) (green 1) (red 1) (brown 3))
I toyed with porting some loop macro implementation to TXR Lisp. I looked at some historic sources and just went "gack". Followed by, "what am I thinking".

I think in the end I would go for something like iterate, but even more Lispy. (Iterate still has "for x in y" type syntax in the clauses, just with parentheses around it; I would drop the infix "in" cruft and just have a symbol followed by nothing but semantic arguments)


I've come up with a function group-reduce which generalizes the construction of hashes from sequences, while streamlining the syntax. The idea is that the hash table entries are accumulators, and we can combine the grouping operation which distributes items into hash keys with reduction over the accumulators. For instance, take the integers 1 through 10, group them into even and odd, and sum the groups:

  1> [group-reduce (hash) evenp + (range 1 10) 0]
  #H(() (t 30) (nil 25))
Hence, evens add to 30, odds to 25.

Now with this, we can obtain a histogram easily, because a left (or right) reduce/fold can count items:

  2> [reduce-left (do inc @1)
       '(brown red green yellow yellow brown brown black) 0]
  8
The (do inc @1) gives us (lambda (blah . rest) (inc blah)) which can take two arguments (suitable for reduce) and just returns (+ 1 blah). We can use this reducer function with group-reduce:

  3> [group-reduce (hash) identity (do inc @1)
       '(brown red green yellow yellow brown brown black) 0]
  #H(() (black 1) (yellow 2) (green 1) (red 1) (brown 3))
group-reduce will be in the next release of TXR (124).

I deliberately made it accept an existing hash so it can be called repeatedly on the same hash to accumulate multiple jobs. That also answers the question of where to specify the hash table attributes.

One last detail is that there is an optional argument we are omitting in the group-by call: a filter function which is optionally applied to each element in the table after the accumulation is done. For instance, we can specify this as nreverse, and then we can express group-by using group-reduce.

First naive attempt: oops cons arguments wrong way:

  1> [group-reduce (hash) evenp cons (range 1 10) nil]
  #H(() (t (((((nil . 2) . 4) . 6) . 8) . 10)) (nil (((((nil . 1) . 3) . 5) . 7) . 9)))
Fix with flipargs. Better, but the groups are consed up in reverse:

  2> [group-reduce (hash) evenp [flipargs cons] (range 1 10) nil]
  #H(() (t (10 8 6 4 2)) (nil (9 7 5 3 1)))
Now the optional argument kicks in to fix this:

  3> [group-reduce (hash) evenp [flipargs cons] (range 1 10) nil nreverse]
  #H(() (t (2 4 6 8 10)) (nil (1 3 5 7 9)))
Compare with group-by:

  4> [group-by evenp (range 1 10)]
  #H(() (t (2 4 6 8 10)) (nil (1 3 5 7 9)))


Pedantry: this won't work with strings if they're not EQL.


One way to do it in Perl 6:

  > <brown red green yellow yellow brown brown black>.Mix
  mix(red, yellow(2), brown(3), black, green)
See http://doc.perl6.org/type-composite.html for a little more info about Mix and other composite types shipped in stock Perl 6 distributions. (Warning: this end user doc is still very incomplete and immature.)


Would be interesting doing the same exercise for Java as well. We've been through Hashtable, HashMap, basic for loops, Enumeration, Iterator, foreach loops, generics, and now we have lambdas.


just for fun...

  import itertools
  colors = ["brown", "red", "green", "yellow", "yellow", "brown", "brown", "black"]
  dict([(color, len(list(grp))) for color, grp in itertools.groupby(sorted(colors))])
or

  dict([(color, len(filter(lambda c: c==color, colors))) for color in set(colors)])
...because sometimes job security is important too.


Or this (not efficient but fun, and using a dict comprehension and no itertools):

    {a: sum(a == b for b in colors) for a in set(colors)}


I thought this was going to end around 2.4 with the .count(c) method, since you don't really need the dict at that point.


collections.Counter() was the slowest option according to the benchmark from 2010 http://stackoverflow.com/questions/2522152/python-is-a-dicti...


Great blog post Trey!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: