What's wrong with statistics in Julia?

vegabook · on Nov 29, 2014

Is Julia seriously targeting the R user base? Or, if it is honest with itself, is it going after the matlab people first. My sense is that engineers (Matlab users) and to a certain extent scientists (Python rules) will be drawn in, but the stats crowd requires subtly different priorities, that seem to be alluded to here. Graphics are one such priority. The excellent ggplot2 gets all the glory, but base graphics are mega-fast, extremely robust, and deeply customizable, and there is the unsung hero of grid graphics which provides an extraordinary set of abstractions for building things like ggplot2 and/or Lattice. My point is that so much graphical quality speaks directly to one of the key requirements of statisticians, where at the end of the analysis, communication is usually required. This is much less the case for engineers or financiers (big matlab users) for example, where correct and fast answers are the endpoint. Where is Julia on graphics? Last time I checked it was still trying to interface to R and/or Matplotlib.

The other thing that intrigues me is Julia's scalar computations being "at least" as important as vectors. This has the whiff of For loops (an immediate eye-roller for R people) accustomed to vectorization everywhere and essentially, exclusively. I am not suggesting that Julia doesn't do vectors well, just that, like any set of priorities, it is not catering first for statisticians, whose requirements are often quite different from those of scientists and engineers who use Matlab and Python.

complicated · on Nov 30, 2014

In regards to graphics, there is really nice work being done by Daniel Jones. His Gadfly.jl (https://github.com/dcjones/Gadfly.jl) package implements the grammar of graphics and it can already produce high quality plots similar to ggplot2.

I will also disagree with your statement about Julia not catering to statisticians. Especially as datasets grow, performance keeps becoming more and more important. And this essentially means that statisticians implementing new methods in R packages have to drop down to C++ via Rcpp anyway. And this is exactly the problem Julia tackles. Of course, you could argue that practitioners do not care about the trouble the method developer went through in order to optimize the code via Rcpp. Yet, the practitioner will just use the language which offers the statistical tools he needs, thus Julia will draw him in, if it can first attract method developers.

There's also the people working on systems biology. People there usually use Matlab to do all of their simulations, modelling, fitting the data, etc. Yet, for some of the data, such as gene expression data, which is fed into the models, it is a lot easier to analyze/normalize it with packages available in R/BioConductor. Here Julia could also provide a unifying interface, so you won't need to be constantly switching languages.

vegabook · on Nov 30, 2014

For my own use case, and I use 5-10-giga-plus financial tick datasets, I find Numpy to outperform Julia 3:1. R users who need perf have already found other ways (like me). I think this Julia performance argument is hugely overrated. When I don't need Numpy, R gives me more than everything I will eve need and I don't see where these developers are going to come from to reinvent the wheel for a language which does not target their user base. I just don't see what Julia brings to the table. Now if it had gone the whole hog and give us full infrastructure for ingesting massive streaming data, then it would have skated to where the puck is going to be (see Clojure/Storm). As it stands, the only main benefit I see for Julia, is as an open source (and better) replacement for Matlab.

complicated · on Nov 30, 2014

Actually, I think your point rather showcases that Julia could be useful for the statistics community! Due to performance reasons you occasionally have to abandon R to use Numpy, but for everything else you use R. Thus, in a sense you still have the problem that Julia tries to solve: You constantly have to jump between two different languages depending on the problem size.

Also, just like you chose Numpy, someone else might choose Julia. So this turns into a Numpy versus Julia comparison. And here I feel like (and this is not really an argument, just my gut feeling) Julia might be better at attracting people with a non-CS background, who want to implement some statistical methods or analyze their biological datasets.

vegabook · on Nov 30, 2014

Yes and yes. Julia is better because, having seen the mistakes of 3 x 20 year old languages, it is creating an (incrementally) better PythonRMatlab base language (not libraries). Had Julia dropped into my lap 5-10 years ago I would have been a fool to use anything else. The fact is though, it is very late to a party where everyone has already danced the night away, and just as the music is about to change (massively parallel multi node functional).

sgt101 · on Nov 30, 2014

I think (this is gut - not an authority!) that Julia is very well placed for the massively parallel multi-node functional future of which you speak. lazy.jl and the cluster implementations such as spark.jl are evolving.

infinite8s · on Nov 30, 2014

With true coroutine support it is only a matter of time before someone writes a pytoolz like package for Julia.

stewbrew · on Nov 30, 2014

"It's only a matter of time" in about any language. This isn't a valid argument.

yummyfajitas · on Nov 30, 2014

Julia does vectors well, but scalar computations are very important. The problem with vector computations:

    z = x*(v+w)

This involves a memory allocation for v+w, and a second one for the product x times result. One way to avoid this in numpy is careful use of out:

    z = zeros(shape=x.shape, dtype=float)
    add(v,w,out=z)
    multiply(x,z,out=z)

An even faster way, which only requires traversing the arrays once, and is more readable in my opinion, is a simple for loop:

    for i=1:len(x)
      z[i] = x[i]*(v[i]+w[i])
    end

This also lets you more easily put if statements inside loops, etc. You can also do accumulative calculations without creating intermediate arrays at all. I.e. one way to create a random walk is:

    walk = cumsum(bernoulli(p).rvs(1024))
    end_result = walk[-1]

Another way is:

    end_result = 0
    for i=1:1024
      if rand() < p
        end_result += 1
      end
    end

makeset · on Dec 1, 2014

This is not a problem with vector computations, but the particular implementation. The vector expression in your example is by far the most readable. It is literally the math you had in mind.

De-vectorization in code is like embedding ASM code: you had to write it out because the compiler sucks. Good language design should favor lucid and concise syntax, and good compiler implementation should make it not necessary to circumvent it for performance. In this case, the compiler should be implementing those vector expressions without allocating unnecessary memory.

WallWextra · on Nov 30, 2014

I've sort of been assuming that Julia is capable of inlining the vectorized notation into a single traversal over the array, at least for simple types. Is that not true?

EDIT: Not yet, I guess. http://pastebin.com/Tw5PuCcJ

data_scientist · on Nov 30, 2014

This problem is addressed by numexpr, which also handle multithreading.

dnautics · on Nov 30, 2014

gadfly interfaces with d3/svg, which while currently at the primitive stage, will ultimately be extremely powerful.

While julia 'prefers' for loops according to the docs, I tend to write things in vectorized form.... I have a toy neural net library I play around with (https://github.com/ityonemo/julia-ann/blob/master/NN.jl) - you'll note that the "nnlog" function is one line, vectorized, and the meat of the evaluated is a dot product. I did once write a benchmark comparing the unvectorized, for loop form of these matrix multiplications, and it was slower than the vectorized form (I didn't bother trying to figure out why).

nkurz · on Nov 29, 2014

I found some development discussion about the changes to missing values here: https://github.com/JuliaLang/julia/issues/8423

I've yet to try out Julia, but seeing productive and intelligent discussion like this followed quickly by execution certainly inspires confidence that it's a language with considerable promise.

username223 · on Nov 30, 2014

> In a future blog post, I’ll describe how Nullable works in greater detail.

This is the part that interests me. If you're not using sentinel values like NaN, then it seems like you're left with pointers (terrible) or tags and tag tests (also pretty bad). If Julia can't use the processor's SIMD instructions (or the GPU in the near future), it's not suitable for inner loops. Do you special-case "Nullable{Double}" to use NaN as its "NA" value?

johnmyleswhite · on Nov 30, 2014

Thankfully, the solution paths you're enumerating aren't the only viable solutions to the problem of representing missing values. Because Julia has value types (called immutables) and because Julia can compile functions based on their input arguments' types so that the functions don't need to do any run-time tag checking, a Nullable{Float64} value can be stored using at most 16 bytes -- 8 bytes for a Boolean flag indicating whether a value is present or not and 8 bytes for the actual 64-bit floating point value. The latter 8 bytes can store arbitrary bits if the Boolean flag is set to indicate that the Float64 value is missing.

If you're interested in the details of how Nullable{T} works, you can start with http://github.com/JuliaLang/julia/pull/8152 and then read through Julia's compiler code to understand how immutables are represented in memory and how functions that employ them are compiled to machine code.

UPDATE: Looking back at my answer, I should make clear that, when you want to work with SIMD or GPU operations, one can (and we will) use a representation of arrays of Nullable objects that places the values and the Boolean flags into two separate arrays. The values array is then identical to a normal 64-bit floating point array and allows all of the same operations. Setting the Boolean flags appropriately requires a bit more work. (How much work is strongly dependent on the exact computation being performed.)

nkurz · on Nov 30, 2014

Yikes. While this sidesteps some of the issues in the other approaches, it also doubles your memory usage and halves your potential SIMD performance. Maybe this is better than the downsides of the alternatives, but as someone working on optimizing large and slow R code with C++ extensions speaking to someone who is trying to replace R with something better, this seems like a choice that may haunt you in the future. If you can, I'd keep your options open for a "Structure of Arrays" approach or "buddy bitmap" that keeps your data contiguous and compact.

johnmyleswhite · on Nov 30, 2014

We'll keep re-evaluating this over time, but I'm currently not sure it's that serious a problem. For many numerically intensive computations, you'll want to move over to a pure array of 64-bit floating point values before the long-running computation begins, at which point you can adopt NaN == NA semantics if appropriate. I believe the solution I outlined in my updated comment above may already describe the buddy bitmap solution you're referring to. It certainly describes a solution that I would refer to as a struct-of-arrays.

username223 · on Nov 30, 2014

Thanks for the informative reply.

> ... when you want to work with SIMD or GPU operations, one can (and we will) use a representation of arrays of Nullable objects that places the values and the Boolean flags into two separate arrays.

"When" is pretty much "always" with arrays. Trusting a sufficiently smart compiler and runtime to get rid of the tag checks makes me nervous, though. This optimization needs to be user-predictable and/or -controllable to be truly useful. If only compiler gurus can understand when slow, tag-checking, scalar loops will be replaced with fast SIMD code, something has gone wrong.

johnmyleswhite · on Nov 30, 2014

We will probably be pushing people very heavily towards always using specialized arrays for working with nullable objects.

I'm a little surprised by your comment about sufficiently smart compilers, since Julia seems to have taken the exact opposite approach to compiler design. The language is systematically designed to require little intelligence from the compiler. Being able to safely drop tag checks inside of a function's body seems like a fairly simple part of the language's design, since the type of all inputs to a function and all function-local variables are, in general, easily knowable as invariants before a function call even begins evaluating.

peatmoss · on Nov 30, 2014

I've got a perhaps naïve question related to:

> In particular, I’d like to replace DataFrames with a new type that no longer occupies a strange intermediate position between matrices and relational tables.

Namely, why is an embedded SQLite database not used for all things tabular in languages like R/Julia/Foo? I was thinking about this as I was attempting to reconstruct a visualization in Racket using their (pretty good!) 2d plotting facilities and lamenting not having a tabular data structure.

SQLite is embeddable. It has fast in-memory database support. It can operate reasonably quickly on data that is stored to disk. It supports indexing. NULL values already exist to represent missingness. SQLite allows for callbacks to user-supplied functions that I'd imagine could be created relatively easy in something like Julia.

As a side benefit, it seems like a SQLite-oriented tabular data store could be extended, like dplyr has done, to support other databases.

When I think about the use cases I've encountered where I've found myself reaching for DataFrame or data.frame, I am struggling to think how a tightly integrated SQLite wouldn't work.

Are there Computer Science reasons why this is a silly idea? I know Pandas claims nominally better performance than SQLite in some circumstances, but then again SQLite has also recently seen some substantial performance gains.

johnmyleswhite · on Nov 30, 2014

I am definitely going to use embedded SQLite in some of future prototypes. The only problem is that SQLite doesn't, to my knowledge, support all of the features I'd like Julia to support -- including columns that contain arrays, dictionaries and even complete tables. But, that said, it supports >90% of the features I'd be interested in.

peatmoss · on Nov 30, 2014

Ah, that's a very succinct reason. Because my needs are somewhat ordinary, I often forget about the notion that columns might be anything other than vectors containing single values. I wonder if some of the more complex datatypes could be added to SQLite in the same sort of way that Spatialite has added geospatial geometries as column types. In fact, I wonder if some of those geospatial types might be abused to push past the 90% use case barrier.

Hmm... Now I'm wondering what level of extension to SQLite would be required to create a shared library that could be plugged in to any of the various languages that need dataframe functionality. DataLite?

johnmyleswhite · on Nov 30, 2014

I'm not sure. I think, for many users, SQLite is sufficient, which is why I'll be using it for prototype work in the future. Trying to add things like arrays seems like a lot of work, since you also need to add things like HQL's EXPLODE at the same time.

tlmr · on Nov 30, 2014

Check out continuum's blaze project for python. Provides table and array abstractions for a variety of backends, including SQL solutions.

soegaard · on Dec 1, 2014

Racket comes with support for SQLite (the index of Scribble is built with help from SQlite for example). That is, it is possible to make such a tabular data structure with a reasonable effort.

It would be interesting to hear what kind of features such a tabular data structure. A post on the Racket mailing list is welcome.

peatmoss · on Dec 4, 2014

Thank you for the example of SQLite being used with Scribble. I was playing a little bit with the sqlite interface, but using it rather naively to pass queries and return values.

Not quite sure I have the Racket chops to implement something like a data.frame abstraction over an in-memory SQLite database (or even dplyr style query construction). Maybe a project for after I get off Hacker News and finish up a couple of articles that have needed finishing...

howeman · on Nov 30, 2014

I'd be interested in seeing things you think are done correctly as well. We're working to build a statistics package in go as well (github.com/gonum/stat). It's good that people are taking a fresh look. Our capabilities are clearly limited at the moment, and the features of Go are quite different from those of Julia, but there's still a lot that can be learned in common.

johnmyleswhite · on Nov 30, 2014

To echo what I said in the post, I think using Nullable as the representation of missing values is the right strategy. I think it achieves sufficiently strong performance while also providing a lot of other desirable features.

As for other things that we've done well, I think the Distributions package is one of the best designed interfaces to probability distributions I've ever seen in any programming language. I don't know if Go supports value types, but the existence of value types in Julia allows each distribution to have its own distinct type without having to pay any cost for maintaining object identity. When combined with Julia's ability to do multiple dispatch, the Distributions package allows one to encode analytic knowledge about probability distributions directly into the type system. And existence of parametric types lets you talk about things like location-scale families in a natural way.

I'd also encourage you to look at the optimization libraries that have been built for Julia. The language makes it easy to do automatic differentiation, which is probably the most under-utilized tool in all of scientific computing. In general, I think there's really high quality work being done on optimization -- which I see as essential for enabling ad hoc statistical modeling.

nyir · on Nov 29, 2014

Isn't the rather big flexibility of R to not eagerly evaluate, or to rewrite function arguments entirely (and apparently scope manipulation, that's a new one for me) one of the major points of critic? It always seemed to me that having "proper" macros is a selling point rather than only an approximation/emulation.

jacobolus · on Nov 30, 2014

> Isn't the rather big flexibility of R to not eagerly evaluate, or to rewrite function arguments entirely (and apparently scope manipulation, that's a new one for me) one of the major points of critic [sic]?

For anyone coming from a programming background, the way R deals with scope and function arguments is incredibly confusing and frustrating, and causes all kinds of nasty bugs. It seems pathological.

For several of the academic statisticians I’ve talked to though, R seems to be their first programming language, and they learn (in a slightly fuzzy way, but enough to do their work) the behavior of its APIs without thinking too deeply about issues of scope or eager vs. lazy argument evaluation, etc. However it works is “just the way it is.” When their expectations about what it should do are violated, they futz with the code until it starts working again and call it a day. People are typically using R to solve narrow concrete problems instead of building systems out of it, so most end-users don’t really have practice with thinking about API design. (R is similar to Matlab in this way.)

Personally I think many of R’s idioms are net negative for the language because they make the abstractions much less simple/clean, and make it more difficult to solve problems in general/adaptable ways. But for someone who already has substantial amounts of time invested in learning R’s API quirks, and doesn’t know any other way, it might not seem like a big deal.

stewbrew · on Nov 30, 2014

R (or rather S) wasn't intended to be a general purpose programming language. It was written as a glue language (or rather a DSL) for low-level code for _interactive_ use in the domain of statistics. The goal to have a language that should be used interactively in a REPL IMHO explains many of those "quirks". You should take R for what it is.

johnmyleswhite · on Nov 30, 2014

A fun test to see if R users understand how R works is to ask them to predict the behavior of the following snippet of code:

foo <- function(a, b) { if (a == 0) { return(1) } else { return(2) } }

foo(0, stop("Error!"))

foo(1, stop("Error!"))

Hansi · on Nov 30, 2014

Why? Seems to simple. I normally go straight for drop=FALSE the most evil default issue of the language.

jghn · on Nov 30, 2014

Agreed. There are many WTF things I can think of in R, this is absolutely not one of them.

johnmyleswhite · on Nov 30, 2014

My experience asking people about that snippet is that very few people can correctly predict the results of that function without actually executing the code. What interests me about the question is that many people's mental model of how R programs are evaluated is often too vague to enable them to produce sharp predictions about program behavior prior to executing code.

RA_Fisher · on Nov 30, 2014

This is funny because I was heavily skeptical yet completely reversed my opinion within 10 minutes. I correctly predicted the results, but I program R professionally. Then I asked my partner who's a non-programmer to read the snippet and her prediction was that the program would stop. That led me to reflect back on your statement. I think the key word is "sharp". I correctly predicted the result, but that's only because I trust lazy-eval heavily. Even then, it's hard to really cement trust in lazy-eval (for me) as it's difficult to mentally map. That's not exactly sharp after all!

jghn · on Nov 30, 2014

"but I program R professionally"

To be fair to JMW, I used to as well - and perhaps more importantly I was writing end user software in R and not doing my own data wrangling.

coldtea · on Nov 30, 2014

>My experience asking people about that snippet is that very few people can correctly predict the results of that function without actually executing the code.

Which is still not any serious barrier in using R succesfully.

stewbrew · on Nov 30, 2014

Lazy evaluation in R isn't that difficult to grasp once you accept the fact and don't superimpose an evaluation model dominant in most other languages. It turns out, this kind of lazy evaluation can actually be quite useful.

jghn · on Nov 30, 2014

My first answer was correct, although i'll admit to having run the snippet afterwards as I figured there was some trick in there that I couldn't see.

johnmyleswhite · on Nov 30, 2014

I didn't mean to imply I thought you would guess the wrong answer, but rather that most R users do not understand delayed evaluation well enough to understand how it will impact their code. Sorry if my comment seemed rude.

jghn · on Nov 30, 2014

I didn't find it rude, I just have trouble seeing it as a snippet which would really cause confusion. If you have seen people get confused by it though, I suppose it is.

johnmyleswhite · on Nov 29, 2014

I fully share that viewpoint, but there are many R users who believe that R's idiosyncratic evaluation rules are essential to writing "readable" code.

stewbrew · on Nov 30, 2014

They sometimes allow to write functions for which you would have to use macros in other languages.

lottin · on Nov 30, 2014

One of Julia's main selling points has been speed. They said there was no reason a dynamic language had to be slow, but now we see that as soon as they start adding basic features, such as support for missing values, it's beginning to take a toll on performance. I wonder if Julia 1.0 will be any faster than R or Python.

ninjin · on Nov 30, 2014

> ...now we see that as soon as they start adding basic features, such as support for missing values, it's beginning to take a toll on performance.

Of course there is a price to pay for adding dynamic features, but only when you use them. This has been the case for Julia as long as I can remember (for example, fields of type `Any`). As far as I can see, isn't the whole point of Julia that you can go towards the dynamic end of the spectrum if you need to, while still being able to generate blazingly fast code by adhering to type constraints wherever it is necessary? All this within the same language, rather than with R and Python where you need to write extensions in a lower level language and deal with a C API.

johnmyleswhite · on Nov 30, 2014

I'd encourage you to read the post again, since it's primarily focused on describing design patterns that allow one to support missing values without taking a performance hit.

gajomi · on Nov 30, 2014

I am a big fan of much of the work going on with statistics in Julia.

I'd like to point out another (related) thing though that is wrong with the state of statistics in Julia, in my opinion. Actually my problem has less to do with statistics, per se, than the mathematical foundations of statistical calculations. There are, of course, many different approaches to statistical inference (the Bayesian vs. frequentist camps infamous among these), but the calculations all come down to reasoning about probabilities which is a well posed but not always easy task. The Julia developers, in their wisdom, have recognized this, and as such have put together Distributions.jl the purpose of which is to provide datatypes and utility functions over probability distributions (plus some vestigial methods about maximum likelihood inference, which thankfully stay out of the way). If you haven't seen it, check it out. It's got a nice design, I think.

But there is presently a clear server limitation: the parameterized type hierarchy. The requirement is that every distribution have support over a set in which all members are either Univariate, Multivariate or Matrixvariate with elements in the fields being either Discrete or Continuous. This obviously misses the general picture of the kinds of sets from which one draws random variables, which plays into issues that the article mentions (if the probability distribution datatype can't model the data you have to do ad hoc things to account for it). Indeed a huge chunk of the issues currently open in the Distributions github page essential boil down to problems with representing the sets and/or spaces from which elements in the distribution are drawn:

https://github.com/JuliaStats/Distributions.jl/issues/147 https://github.com/JuliaStats/Distributions.jl/issues/309 https://github.com/JuliaStats/Distributions.jl/issues/224 https://github.com/JuliaStats/Distributions.jl/issues/283

End users have a variety of well developed ideas in mind about the sets that their samples belong to and even the spaces from which they are drawn from which currently exceed the representational capacity of the existing types. In my opinion the way to fix this is to start a separate library that focuses on types and methods for representing and manipulating sets and spaces (i.e. topological information attached to sets). This could then be consumed by the Distributions people as well as others modeling things outside of probability.

johnmyleswhite · on Nov 30, 2014

I think it would be great to develop support for generic sets in a separate package that could eventually be merged into Distributions. I'm not sure that I think all of the issues you've linked to are related to that problem, though. For example, the issue about using Vector{Rational} as an input to the Categorical distribution type's constructor is a relatively minor issue since there are very few cases in which you'd want to evaluate the PDF of a Categorical variable using rational numbers rather than floating-point numbers. As such, it's not clear that you'd want to have a Categorical{Rational} type that's separate from the current Categorical type (which could be called the Categorical{Float64} type).

gajomi · on Nov 30, 2014

>I'm not sure that I think all of the issues you've linked to are related to that problem, though

It's probably true that many of these issues could be resolved without the need for this generic sets fantasy of mine. The resolution that you propose for Categorical{Rational} I think makes the right choice, given the tradeoffs that one currently has to make in designing these things.

The particular reason that I considered this issue to be related to the problem of the representation of sets is that if it is true that both the support of such a distribution and the values of the pdf were rationals one could support exact computation of moments and/or cumulants and/or coefficients of generating functions and the like. The "exactness" of these procedures becomes important if there is important information in the factorization of said moments/coefficients/whatever. In fact in these cases one actually might care to know not just that they are Rationals but Rationals with such and such restrictions, which is where more general type parameters would really start to shine. So perhaps someday somebody will want it...

johnmyleswhite · on Nov 30, 2014

That's true if Rational means Rational{BigInt}, but the default in Julia is for Rational to mean Rational{Int}, which can potentially produce less reliable answers that using Float64.