Hacker News new | past | comments | ask | show | jobs | submit login

Spark the platform seems awesome. I'm somewhat less convinced by mllib - I'm not sure there are as many use cases for distributed machine learning as people seem to think (and I would bet that a good deal of companies that use distributed ML don't really need it). I've seen a lot of tasks that could be handled by simpler, faster algos on large workstations (you can get 250 GB RAM from AWS for like $4.00/hr). I'd love to hear counterarguments, though!



While fitting the algorithm might not often benefit from partitioned data, I see two upsides from using spark for predictive modeling.

First it makes it easy to do the feature extraction and model fitting in the same pipeline, hence make it possible to cross-validate the impact of the hyper-parameters of the feature extraction part. Feature extraction generally starts from a collection of large, raw datasets that needs to be filtered, joined and aggregated (for instance a log of user clicks, sessionized by user id over temporal windows, then geo-joined to GIS data via a geoip resolution of the IP address of the user agent). While the raw datasets of clicks and geographical databases might be too big to be processed efficiently on a single node, the resulting extracted features (e.g. user session statistics enriched with geo features) is typically much slower and could be processed on a single node to build a predictive model. However spark RDDs make it natural to trace the provenance hence trivial to rebuild downstream models when tweaking upstream operations used to extract the features. The native caching features of Spark make that kind of workflow very efficient with minimal boilerplate (e.g. no manual file versionning).

Second, while the underlying ML algorithm might not always benefit from parallelization in itself, there are meta-level modeling operations that are both CPU intensive and embarrassingly parallel and therefore can benefit greatly from a compute cluster such as Spark. The canonical case are cross validation and hyper-parameter tuning.


The benefit of Spark and related systems is you get a flexible infrastructure that can handle a wide range of tasks reasonably well. You pay once for infrastructure, training, devops, and so on.

You can optimise any particular use case to perform better than Spark, but then you are going to incur the above costs for every project you create.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: