Hacker News new | past | comments | ask | show | jobs | submit login

Are DataFrames RDDs with a new DSL?



In a way yes. It is a little bit more than that because DataFrames internally are actually "logical plans". Before execution, they are optimized by an optimizer called Catalyst and turn into physical plans.


Normal RDDs won't benefit from this optimisation, only DataFrames? Is that because using this new DSL allows Spark to more precisely plan what needs to happen for DataFrames?

I guess this means DataFrames should be used all the time in the future, or will there still be a reason to use plain RDDs in the future?

You guys are doing great work !


Indeed, DataFrames give Spark more semantic information about the data transformations, and thus can be better optimized. We envision this to become the primary API users use. You can still fall back to the vanilla RDD API (afterall DataFrame can be viewed as RDD[Row]) for stuff that is not expressible with DataFrames.


Could you give an example of something that could not be expressed with DataFrames? Would e.g. tree-structured data be a bad fit for DataFrames, since it doesn't fit well with the tabular nature?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: