I'm one of the authors of the blog post as well as this new API. Feel free to as...

bonobo3000 · on Feb 17, 2015

Have there been any changes to the in-memory columnar caching used by SchemaRDDs in 1.2? I noticed some problems with that, for example if a SchemaRDD with cols [1,2,3] on parquet files [X,Y,Z] is cached, and then I create a new one with a subset of the cols say [1,2] on the same files [X,Y,Z], the new SchemaRDDs physical plan would refer to the files on disk instead of an in memory columnar scan. I'm wondering if DataFrames handle this differently and implications for caching.

For some context - In our case, loading a reasonable set of data from HDFS can take upto 10-30 mins so keeping a cached copy of the most recent data with certain columns projected is important.

data_scientist · on Feb 18, 2015

One useful feature is to consider the neighbors rows, for example for differentiating a time series. Do you plan to make an efficient method for that?

hadley · on Feb 18, 2015

How does the API communicate between the client and the server? Any interest in talking to it from R? (I'd be happy to help)

djcater · on Feb 18, 2015

Are there any timelines for when this (and Spark in general) will fully support ORC files (including predicate-pushdown)?

pwendell · on Feb 18, 2015

Very likely in Spark 1.4. Hortonworks has been helping out with this, we just need some internal refacotring to the API to make it work.

century19 · on Feb 17, 2015

Are DataFrames RDDs with a new DSL?

rxin · on Feb 17, 2015

In a way yes. It is a little bit more than that because DataFrames internally are actually "logical plans". Before execution, they are optimized by an optimizer called Catalyst and turn into physical plans.

century19 · on Feb 17, 2015

Normal RDDs won't benefit from this optimisation, only DataFrames? Is that because using this new DSL allows Spark to more precisely plan what needs to happen for DataFrames?

I guess this means DataFrames should be used all the time in the future, or will there still be a reason to use plain RDDs in the future?

You guys are doing great work !

rxin · on Feb 17, 2015

Indeed, DataFrames give Spark more semantic information about the data transformations, and thus can be better optimized. We envision this to become the primary API users use. You can still fall back to the vanilla RDD API (afterall DataFrame can be viewed as RDD[Row]) for stuff that is not expressible with DataFrames.

rkrzr · on Feb 18, 2015

Could you give an example of something that could not be expressed with DataFrames? Would e.g. tree-structured data be a bad fit for DataFrames, since it doesn't fit well with the tabular nature?