Have there been any changes to the in-memory columnar caching used by SchemaRDDs in 1.2? I noticed some problems with that, for example if a SchemaRDD with cols [1,2,3] on parquet files [X,Y,Z] is cached, and then I create a new one with a subset of the cols say [1,2] on the same files [X,Y,Z], the new SchemaRDDs physical plan would refer to the files on disk instead of an in memory columnar scan. I'm wondering if DataFrames handle this differently and implications for caching.
For some context - In our case, loading a reasonable set of data from HDFS can take upto 10-30 mins so keeping a cached copy of the most recent data with certain columns projected is important.
For some context - In our case, loading a reasonable set of data from HDFS can take upto 10-30 mins so keeping a cached copy of the most recent data with certain columns projected is important.