Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Five methods for Filtering data with multiple conditions in Python (kanoki.org)
64 points by min2bro on Jan 21, 2020 | hide | past | favorite | 22 comments


Title should probably clarify that this is with Pandas, that's much more specific and less generally useful than "in Python".

Original title: "Pandas dataframe filter with Multiple conditions"


To help readability I tend to do something like this:

f1 = (df["col1"] == condition1)

f2 = (df["col2"] == condition2)

df[f1 & f2]

This is equivalent to the 'pandas boolean indexing multiple conditions' method.


I do the same without the parenthesis.


This is probably gonna be sacrilege to the Pythonians, but I often wish there was support for some SQL-like syntax when working with (pandas) data frames. It certainly would make the process a lot smoother for some tasks.


TileDB[1] offers an embedded SQL experience for python[2]. We use MariaDB built in embedded mode to allow running sql against TileDB Arrays. This can be combined with pandas.read_sql to load the results directly into a pandas array. I wrote this implementation for the embedded SQL, so happy to answer any questions on it. The embedded SQL in its early stages, but should be fully functional for any queries that MariaDB itself supports. TileDB and the embedded SQL are both open source, TileDB is MIT licensed and the embedded SQL is under GPL.

[1] https://tiledb.com/

[2] https://docs.tiledb.com/developer/api-usage/embedded-sql

(disclosure: TileDB, inc. employee)


I'd love if someone took the next step from Pandas being influenced by R to port dplyr-esque syntax to it.


I've been working on a library over the past year that does exactly that, including generating dbplyr style SQL queries!

Would love your feedback :)

https://github.com/machow/siuba


This looks interesting! Can I ask, what's the origin of the name? Both "siu" and "siuba"? Would be a useful addition to the docs, IMO.


Ah, thanks for pointing out the lack of explanation. I'll add one to the readme.

It's a transliteration of the cantonese word for minibus, 小巴 :).

edit: siu (小) means little!


The dplyr syntax is IMO the best feature of the tidyverse


Isn't that just `pd.DataFrame.query` as outlined in the article? Or do you mean actual full-fat SQL?


You could try dumping your dataframes to an in-memory SQLite instance using the to_sql method and then running queries on that. Not sure how performant the to_sql bit would be, but I'd expect joins in SQLite to be blazing fast compared to Panda's joins.



This seems to be about doing filtering with Pandas, not pure python. The title should probably be changed to reflect this.


Yeah it looks like his code that this person uploaded isnt escaping the HTML or is being unescaped when it should be escaped.

df.loc[(df['Salary_in_1000']>=100) & (df['Age']< 60) & (df['FT_Team'].str.startswith('S')),['Name','FT_Team']]


Would have been nice to see a comparison of performance, or at least which is suggested style.


My guess is that eval and query have the worst performance, since they need to be interpreted first. As for the other methods, I'm not too sure how they rank. Personally, for style I agree with the first comment on the linked page, which is to split up the conditions for readability:

  f1 = (df[“col1”] == condition1)

  f2 = (df[“col2”] == condition2)

  df[f1 & f2]


One thing that really surprises me: NONE of these methods work with grouped DataFrames.

But grouping data is extremely common in data analysis.

Basically, the strategy with grouped data, is taking the loc approach, and sprinkling in a bunch of additional .transform calls. :/


I wonder if this could be improved with SQL


I strongly prefer .query() for legibility and that it can but used in a pipe. My only problem is that often flake8 will not detect the use of a variable inside of the query string. Has anyone else come across this before?


If you supply a function to `.loc`, you can use it in pipes pretty easily. E.g.

``` my_dataframe.loc[lambda df: df['col'] > 0.8] ```

Would this help work around your issue?


Some speed comparison on a larger dataset would be interesting




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: