Show HN: Smart Fruit – A Python schema-based machine learning library

IanOzsvald · on June 30, 2018

I've recently written a related library - given a DataFrame it'll run sklearn's RandomForest to check which columns predict other columns. The goal is to learn which relationships exist within a DataFrame. Typically in the exploratory process in machine learning we want to learn how the data holds together - this tool helps with that discovery exercise. It'll auto-LabelEncode text and allows classification or regression. There are two example Notebooks (Titanic & Boston) to show what it is doing. Correlations (Pearson, Spearman, Kendall) can also be calculated. The RandomForest result can show non-linear relationships that aren't exposed by correlations. https://github.com/ianozsvald/discover_feature_relationships

pX0r · on June 30, 2018

I liked the boldness of this idea. But 'something' needs to select the sklearn model, tune its hyper-params - how long can you keep it all hidden away from the user?

The training phase can be considerably long. Have you thought of some kind of an async wrapper that Smart Fruit might provide or will the user be expected to code it up?

This is more of a user experience comment - when the interface is designed to feel as if one is interacting with a DB / ORM the user may come to assume that the outcomes will be deterministic... While the returned results will remain deterministic given the training data, model and hyper-parameters remain the same - it won't feel as deterministic when either of these is updated... I am not sure if I communicated my concern clearly. I am trying to understand who the intended end-user is, of this package...

ghukill · on June 30, 2018

I would propose a potential user as someone interested in some of the meta considerations and patterns of statistical reasoning, aka machine learning. There are is a vast amount of particulars the second hand on my watch operates (e.g. vibrating quartz, digital), but I can use that mostly reliable device to investigate higher level phenomenom, like calculating distance of planets by timing their movement. This library opens a direct line to these algorithims such that one might intuit, and apply, their high level behavior; as I could not time planets if consumed with the fidelitity and reliability of resonating quartz, it would slow my ability to explore this kind of reasoning if concerned with the minutiae.

That said, all points taken. If this sparks interest in someone, as is stands, it would be on them to dig in to all the considerations you've outline.

ghukill · on June 30, 2018

I love it. Pasted in the column headers to `iris.data` from the Iris website. Voila, up and running per instructions on Github. For prototyping / exploring ideas, for the syntactical layman, but conceptuallly familiar, what a boon.

zitterbewegung · on June 30, 2018

This looks like a good porcelain to sklearn. Many including myself find it intimidating at times.