Show HN: Appelpy – library for easier regression modelling in Python

mfarragher · on Nov 22, 2019

Hi there, I'm a data scientist and economist who uses the main Python stats libraries regularly. I was frustrated by how long it takes to fit basic regression models and diagnose models, so I began working on a package called Appelpy (Applied Econometrics Library for Python).

The aim: Make regression modelling as easy as pie.

Now that I've tightened up my code coverage and fleshed out some documentation – ReadTheDocs and notebook tutorials that can be viewed with Binder & Nbviewer – I'm sharing the library more widely!

The library is built upon Statsmodels but I've tried to make a more cohesive interface for regression modelling, with model diagnostics in mind especially. Model diagnostics are the time-consuming and repetitive part of regression modelling in Python, but through Appelpy diagnostics can be done with minimal code.

This is the first project I've released on PyPI and I was working on it to hone my software engineering skills, so I'm interested in tips and feedback.

- Mark

mushufasa · on Nov 22, 2019

Thanks. The lack of something like this has been keeping me drawn to R for pure statistics work. I'll definitely give this a try.

Canadauni · on Nov 22, 2019

I know you reference your introductory notebook throughout your docs but I think it would be helpful to include some of the plots inline in your docs.

Seeing that this lib is built on top of statsmodels my first response was that I'll just keep using statsmodels. However the simplicity of the diagnostic plots actually seems really nice value add. Showcasing those plots right in your docs might make it more attractive to people checking out your project for the first time.

mfarragher · on Nov 22, 2019

Yes, that's a good point on the plots. I'll try to make them more prominent in future docs.

These are some things I've included in the library which aren't implemented in Statsmodels:

- Breusch-Pagan studentized test of heteroskedasticity (available in R)

- Standardized / beta coefficients (still an open feature request in https://github.com/statsmodels/statsmodels/issues/3857 )

- Leverage vs residuals squared plot (there's an influence plot but not something similar to Stata's lvr2plot)

Even the most common metric I use for assessing models – root MSE – isn't stored in the Statsmodels object summary. To assess an OLS model in Statsmodels I'd find I do so much repetitive code, yet in Stata the commands are fairly succinct.

Other things I also added to make encoding of variables easier:

- InteractionEncoder

- DummyEncoder (to cover different ways of treating missing values)

The more I thought about these missing features, the more I thought they can be wrapped up in a more coherent way. :-)