Have there been any changes to the in-memory columnar caching used by SchemaRDDs in 1.2? I noticed some problems with that, for example if a SchemaRDD with cols [1,2,3] on parquet files [X,Y,Z] is cached, and then I create a new one with a subset of the cols say [1,2] on the same files [X,Y,Z], the new SchemaRDDs physical plan would refer to the files on disk instead of an in memory columnar scan. I'm wondering if DataFrames handle this differently and implications for caching.
For some context - In our case, loading a reasonable set of data from HDFS can take upto 10-30 mins so keeping a cached copy of the most recent data with certain columns projected is important.
In a way yes. It is a little bit more than that because DataFrames internally are actually "logical plans". Before execution, they are optimized by an optimizer called Catalyst and turn into physical plans.
Normal RDDs won't benefit from this optimisation, only DataFrames? Is that because using this new DSL allows Spark to more precisely plan what needs to happen for DataFrames?
I guess this means DataFrames should be used all the time in the future, or will there still be a reason to use plain RDDs in the future?
Indeed, DataFrames give Spark more semantic information about the data transformations, and thus can be better optimized. We envision this to become the primary API users use. You can still fall back to the vanilla RDD API (afterall DataFrame can be viewed as RDD[Row]) for stuff that is not expressible with DataFrames.
Could you give an example of something that could not be expressed with DataFrames?
Would e.g. tree-structured data be a bad fit for DataFrames, since it doesn't fit well with the tabular nature?