Hacker News new | past | comments | ask | show | jobs | submit login

I'm one of the authors of the blog post as well as this new API. Feel free to ask me anything.



Have there been any changes to the in-memory columnar caching used by SchemaRDDs in 1.2? I noticed some problems with that, for example if a SchemaRDD with cols [1,2,3] on parquet files [X,Y,Z] is cached, and then I create a new one with a subset of the cols say [1,2] on the same files [X,Y,Z], the new SchemaRDDs physical plan would refer to the files on disk instead of an in memory columnar scan. I'm wondering if DataFrames handle this differently and implications for caching.

For some context - In our case, loading a reasonable set of data from HDFS can take upto 10-30 mins so keeping a cached copy of the most recent data with certain columns projected is important.


One useful feature is to consider the neighbors rows, for example for differentiating a time series. Do you plan to make an efficient method for that?


How does the API communicate between the client and the server? Any interest in talking to it from R? (I'd be happy to help)


Are there any timelines for when this (and Spark in general) will fully support ORC files (including predicate-pushdown)?


Very likely in Spark 1.4. Hortonworks has been helping out with this, we just need some internal refacotring to the API to make it work.


Are DataFrames RDDs with a new DSL?


In a way yes. It is a little bit more than that because DataFrames internally are actually "logical plans". Before execution, they are optimized by an optimizer called Catalyst and turn into physical plans.


Normal RDDs won't benefit from this optimisation, only DataFrames? Is that because using this new DSL allows Spark to more precisely plan what needs to happen for DataFrames?

I guess this means DataFrames should be used all the time in the future, or will there still be a reason to use plain RDDs in the future?

You guys are doing great work !


Indeed, DataFrames give Spark more semantic information about the data transformations, and thus can be better optimized. We envision this to become the primary API users use. You can still fall back to the vanilla RDD API (afterall DataFrame can be viewed as RDD[Row]) for stuff that is not expressible with DataFrames.


Could you give an example of something that could not be expressed with DataFrames? Would e.g. tree-structured data be a bad fit for DataFrames, since it doesn't fit well with the tabular nature?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: