More

vsbuffalo · on Feb 11, 2016

I really like all plotting systems in R. First, I used base graphics for a few years—and loved it. You learn your way around par(), commit esoteric argument names to memory (oma, mar, mgp, mfrow, etc). It feels powerful — you're just drawing on a screen; its history traces to the original pen plotters. Second, I learned lattice. You can't help but fall in love with lattice after a year or two with creating panel plots in base graphics. The biggest learning curve with lattice is panel functions, but once you learn to throw a browser() in a panel function for stack variable introspection, you can do anything. Somewhere on a dusty bookshelf is a well-worn lattice book I splurged on while taking an R course at UCD.

I like this article, because I think for production graphics, the author has a point. If you're placing lines, points, and labels on a screen — you can create anything. You can draw polygons and arcs. It's like drawing with raw SVG. But I'd have a hard time thinking of an exploratory data analysis situations I wouldn't reach for ggplot2 first. Since it looks at dataframe column types (integers, factors, numerics), it automatically matches these two the appropriate type of color gradient. Coloring a scatter plot by a potential confounder is one additional argument to aes(), e.g. aes(x, y, color=other_col). More than once during EDA I've done this and seen some horrifying pattern in data that shouldn't be there. That's a powerful tool for one extra function argument — the cost of checking for a confounder with color (or shape) is essentially near zero.

I'd make the case that this is a more costly operation in base graphics, and is thus much less likely to be done. You may already have your plots in a for loop to create panels, then you have a few extra lines for adjusting margins and axes (rather than facet_wrap(~col)). It took a lot of code to set that up — there's already a lot of cruft when you just need to do a quick inspection. Then you need to create a vector of appropriate size of colors, and then map this to data. Sure it's easy-ish, but it takes at least double the time as color=some_col. In EDA visualization, I want every single barrier to checking a confounder to be as small as possible—which is what ggplot2 does.

That said, I really liked this article because I do agree that going from EDA visualization to production is a hassle. Just after reading this, I remade some production ggplots with base graphics and love the simple aesthetic — which to mirror in ggplot takes a lot of hassle.

What I really long for is a lower-level data to visualization mapping (like d3) in R. d3 is a pain to learn, but it's really the only data abstraction (even though it is a low-level abstraction) that is seemingly limitless in what it does and can do. I always hope for a general data-join grammar like d3's to be the norm, built on top of base plotting (analogously: svg elements), and then have abstractions like ggplot for tabular data built on top of that.

Lofkin · on Feb 11, 2016

What do you think of bokeh: https://github.com/DataWookie/MonthOfJulia

pwang · on Feb 11, 2016

> that is seemingly limitless in what it does and can do

With great power comes terrible debugging!

Have you checked out rBokeh? Full browser interactivity, support for many more points than D3 (and way more if you turn on webGL), support for both server-based and serverless interactivity, all straight from R. http://hafen.github.io/rbokeh/

phillc73 · on Feb 11, 2016

There are quite a lot of R htmlwidgets which interface with d3.[0]

[0] http://www.htmlwidgets.org/showcase_metricsgraphics.html

vsbuffalo · on Feb 10, 2016

You're treating this sample-is-the-population issue as if it's resolved in the statistics literature. It is not. Gelman has written on this [1][2], as the issue comes up in political science data frequently. As Gelman points out, 50 states are not a sample of states—it's the entire population. Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.

[1]: http://andrewgelman.com/2009/07/03/how_does_statis/

[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)

dragonwriter · on Feb 10, 2016

> Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Its the entire population of wars meeting a certain criteria in that time frame. If that is the topic of interest, then it is also the whole population. OTOH, datasets like that are often used in analysis that is intended to apply to, for instance, "what-if" scenarios about hypothetical wars that could have happened in that time frame, in which case the studied population is clearly not the population of interest, but is taken to be -- while there may be specific reasons to criticize this in specific cases for reasons other than "its the whole population, not a sample" -- a representative sample of a broader population.

bonoboTP · on Feb 10, 2016

Exactly. There is an interpretation where the "population" is interpreted as a mathematical ideal process (with potentially infinite information content) and any real, physical manifestation is considered a "sample".

The old-school interpretation is stricter and considers both the "population" and the "sample" to be physical real things. It's understandable because these methods were developed for statistics about human populations (note the origin of the terminology), medical studies etc. (The word "statistics" itself derives from "state").

Somehow, frequentist statisticians are usually very conservative and set in one way of thinking and do not even like to entertain an alternative interpretation or paradigm... I'm not sure why it is so.

nanis · on Feb 10, 2016

As an economist, I am also aware of the logical contortions we have to go through to be able to run regressions on historical data (i.e. pretty much all of economic data). None of this applies here. The data generating process consists of the minds of the writers.

For your reasoning to be applicable here, you have to put together a model of the data generating process from which you can derive a proper model that allows inference. What exactly are the assumptions on P( word_i | character_j ) that make it compatible with these particular tests' assumptions?

vsbuffalo · on Nov 10, 2015

This isn't the most useful statistical figure. The heart feature is novel — the increase could be due entirely to folks trying it out. A more meaningful figure would be to look at the folks who have constant favoriting habits (with stars that is) and see how their behavior changed. Personally, I'm more reluctant to heart tweets, as my Twitter account is mostly professional and it feels a bit unprofessional to "heart" a colleague's tweet.

vsbuffalo · on Oct 26, 2015

I shared this in light of the recent article by Matt Ridley, "On the Myth of Basic Science"[1] which I find less honest and filled with anecdotal examples that fit his particular narrative.

[1] www.wsj.com/articles/the-myth-of-basic-science-1445613954

mfoy_ · on Oct 26, 2015

That article is pretty interesting, although I disagree with the conclusion.

He's basically arguing that science is deterministic and therefore doesn't need public funding.

vsbuffalo · on Sept 17, 2015

Completely agree. I can't understand how professional developers are able to give up their text editors for iPython. Don't get me wrong, I love iPython/jupyter, but the slick interface comes with huge productivity drops due to the lack of a real text editor.

broken_symlink · on Sept 18, 2015

I use ipython in emacs with elpy.

escherize · on Sept 18, 2015

Sounds Awesome! Do you have a resource for setting that up?

vsbuffalo · on Aug 30, 2015

See Noodler's "Ben Bernanke" inks[1] which are fast drying and meant for lefties. The name is a play on the fast drying inks Bernanke is using to print more money.

[1]: http://www.gouletpens.com/noodlers-bernanke-black-3oz-bottle...

vsbuffalo · on Aug 30, 2015

I love fountain pens and highly encourage everyone to try them (with a good notebook helps). First, ballpoint pens are wasteful — 1.6 billion pens a year are thrown away[1]. Fountain pens are reusable, ink is comparatively cheap and lasts forever, and finding your ink is a fun and personal experience (I really like the "bulletproof" Noodler's inks which are waterproof, bleach proof, etc.). Fountain pens last forever — which is why folks still hunt around for 40+ year old used ones.

Second, it really does make writing fun. I hated writing — my handwriting is messy, it's slow, and it's not as easy as typing. As the article argues, a good fountain pen makes it much, and in my experience much more enjoyable.

Third, it doesn't need to be expensive. Get a Lamy Safari (EF), a Lamy converter and a bottle of Noodlers ink. I also love my Faber Castell Loom[2] (it's the smoothest pen I own), and I carry around a Kaweco Al-Sport[3] everywhere (it's the perfect pocket pen).

[1] http://www.epa.gov/superfund/students/clas_act/haz-ed/ff06.p... [2] http://www.gouletpens.com/faber-castell-loom-metallic-orange... [3] http://www.jetpens.com/Kaweco-AL-Sport-Fountain-Pen-Fine-Nib...

vsbuffalo · on Aug 5, 2015

Well done. Though shouldn't the sidenotes use HTML5's <aside></aside>? How does the vertical alignment of sidenote callout and sidenote work?

exupero · on Aug 5, 2015

I agree.

It looks like sidenotes are floated right and given negative right margin. (https://github.com/daveliepmann/tufte-css/blob/master/tufte....)

dredmorbius · on Aug 5, 2015

aside doesn't position relative to text.

There really ought to be a native HTML reference / footnote / sidenote entity with appropriate default styling.

My take, responsive:

http://imgur.com/a/TXpis

(Using floats and negative margins.)

vsbuffalo · on March 1, 2015

Science isn't about taking obvious relationships as fact, it's about using evidence and induction to show support for certain hypotheses over others. The geocentric model of the universe was once "obvious".

vsbuffalo · on Feb 6, 2015

I find this deeply depressing: "the most heavily studied problem in computer science: how to get people to buy more things".

cle · on Feb 6, 2015

I don't find it depressing at all. I find it empowering. It's the vehicle we use to drive technological progress. We've figured out ways to have people voluntarily give us money to drive technological progress, and so far this has been phenomenally successful. I was just thinking this morning about NASA's use of AWS to crunch data. That was driven by Amazon's ability to get people to buy more things.

cma · on Feb 6, 2015

This is like saying we should create more government beauracy , because the software and computing tools companies will build to deal with it will in the end get us to Mars faster.