One of the things that most pisses me off about numpy, pandas, and the whole sci...

nevi-me · on April 5, 2020

Apache Spark has a great 'lazy' story, and I recommend it as a compute supplement for pandas and numpy. I'm always fascinated by the lengths the authors go through to optimise queries, from the cluster to one's machine. Also, the integration with Apache Arrow over the years has bore much fruit, to the extent that a pyspark user can cross over to Pandas with little serde cost.

Fun story, about 18 months ago I was consulting for a telco. One of my tasks was to run a segmentation model on their subscriber base. They wanted me to use SAS in their cluster, but because I had 'brought my infra/laptop', I told them that I could do it with Python instead. I got the gist of it working on Pandas, but it was 'slow'. So they asked me if I was going to run it on a sample, but I said I'll port it to Spark in the evening.

The following morning I presented them with results, and the porting to Spark took about 2 hours (I hadn't used SparkML before), and the thing took an hour to run on the whole population.

bobbylarrybobby · on April 5, 2020

It really is remarkable that Python's numerical computing libraries have such poor performance. When doing chains of elementwise operations on large arrays (such as toy example `elementwise cos(sqrt(sin(pow(array, 2))))`), Julia appears to outperform Python by a factor 2! Numpy cannot avoid computing each intermediate array, which means it has to allocate a ton of wasteful memory. Meanwhile Julia does the smart thing and coalesces all operations into one and applies that single operation elementwise, allocating only a single new array.

Pandas does not also defer computations, which means computing Boolean functions that include the same data multiple times must make multiple passes over said data. Absurd.

a_t48 · on April 5, 2020

numpy appears to have optional arguments for the storage location of outputs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.c... - Could you elaborate a little further? The syntax might not be as nice as another language or framework, but it's not "unavoidable".

Disclaimer: have never used numpy, have used python fairly extensively.

bobbylarrybobby · on April 5, 2020

You are right about the `out` argument, I'd forgotten about that. But even avoiding the wasteful memory allocations, numpy still is about 60% slower than Julia, as it makes multiple passes over the input data. (If there's a way to get numpy to just make a single pass over the data and remain performant, I'd love to know.)

a_t48 · on April 5, 2020

The first thing I'd reach for is to refactor into a list comprehension. Looks like this is the proper way to do it:

  for x in np.nditer(a, op_flags = ['readwrite']):
    x[...] = cos(sqrt(sin(pow(x, 2))))

That's some knarly syntax though. I've never seen that ellipsis operator?

Edit: just read about `Ellipsis`. I'm a fan, even if it's sort of nonstandard across libraries. Those readwrite flags are a travesty though, but at least you can paper over it with a helper function.

Something like:

  def np_apply(a, f):
    for x in np.nditer(a, op_flags = ['readwrite']):
      x[...] = f(x)

  np_apply(x, lambda x: return cos(sqrt(sin(pow(x, 2))))))

or

  def np_apply(a, *argv):
    for x in np.nditer(a, op_flags = ['readwrite']):
      for f in argv:
        x = f(x)
      x[...] = x

  np_apply(lambda x: return pow(x, 2), sin, sqrt, cos)

Edit3: There's a way to turn this into "pythonic" list comprehension code, but it would probably only make it look prettier rather than more performant.

quotemstr · on April 6, 2020

Yes, you can use the out arguments, but doing so 1) complicates your code, and 2) isn't necessarily a win. You have to do the same number of memory write bus cycles whether you write to a specified output array or to some new array.

a_t48 · on April 6, 2020

I agree with point (1), wholeheartedly - by default the syntax is ugly unless you wrap it in a utility. (2) isn't true as far as I can tell - you can keep reusing the same output buffer, or double buffer if the output can't be the same as the input.

Either way see https://news.ycombinator.com/item?id=22786958 where I figured out the syntax to do it in place, which shouldn't result in any allocations at all and should solve both your concerns.

quotemstr · on April 6, 2020

Regarding #2: whether you use the same buffer or a new buffer, you still have to actually write to the memory. The bottleneck at numpy scale isn't the memory allocation (mmap or whatever) but the actual memory writes (saturating the memory bus), and you need to perform the same number of writes no matter which destination array you use.

a_t48 · on April 6, 2020

It should still be a perf gain:

1. Memory allocation isn't free 2. Doing multiple loops over the buffer (once per operation) is going to be slower than doing all the operations at once, both because of caching and the opportunity to do the operations on a register and write at the end (though who knows if the python interpreter will do that).

quotemstr · on April 6, 2020

The Python interpreter is too stupid to do operation coalescing. That's the whole point of my initial comment.

nmca · on April 5, 2020

JAX supports this; as do other performance-targeting numerical libraries. There are hoops of course.

quotemstr · on April 5, 2020

> Julia appears to outperform Python by a factor 2

Now I have some more reading to do. Thanks.

timClicks · on April 5, 2020

Julia is very, very good. I say this as a person who has invested about a decade in the Python data science ecosystem. Numpy will probably always be copy-heavy, whereas Julia's JIT can optimize as aggressively as the impressive type system allows.

pangoraw · on April 5, 2020

Have you heard about Weld[1] ?

It is an ongoing effort to built an intermediary representation that numerical library could share to produce an optimized instructions tree.

[1] https://github.com/weld-project/weld

muldvarp · on April 5, 2020

I never understood why we limit ourselves to sending SQL to databases. Instead of sending SQL we should be sending actual logical plans to the database. These logical plans could then be optimized by the database and executed. This would solve two issues I have with SQL:

1. Without CTEs SQL isn't that expressive and with CTEs queries can be hard to understand. Logical query plans on the other hand are rather easy to understand and can express more queries than SQL can.

2. SQL is hard to modify programmatically. To most software, SQL is simply a string without any inherent meaning. A logical query plan on the hand is a tree structure that can be introspected and modified.

That's basically the same thing you want, except that you want it implemented on dataframes while I want it implemented for database systems.

fulafel · on April 5, 2020

Isn't the alternative in the end a computation DSL and optimizing compiler? GPU HLLs like Halide, Futhark, etc show the way here.

logicchains · on April 5, 2020

>What I really want to see is something lazy that can do sane query planning and that can work within limited system resources. Maybe one day I'll open source the work I've done in this space.

Eigen does this via expression templates; it essentially optimises the computation graph at compile time.

quotemstr · on April 6, 2020

> Eigen does this via expression templates; it essentially optimises the computation graph at compile time.

That's wrong too. The right query execution plan can depend heavily on the sizes and statistical distributions of the operands. Eigen doesn't have enough information at compile time to make good decisions in all cases. This kind of computation really calls for a JIT.

adamnemecek · on April 5, 2020

Do you have an opinion on julia?