Five methods for Filtering data with multiple conditions in Python

danpalmer · on Jan 21, 2020

Title should probably clarify that this is with Pandas, that's much more specific and less generally useful than "in Python".

Original title: "Pandas dataframe filter with Multiple conditions"

RobinL · on Jan 21, 2020

To help readability I tend to do something like this:

f1 = (df["col1"] == condition1)

f2 = (df["col2"] == condition2)

df[f1 & f2]

This is equivalent to the 'pandas boolean indexing multiple conditions' method.

abricot · on Jan 21, 2020

I do the same without the parenthesis.

TrackerFF · on Jan 21, 2020

This is probably gonna be sacrilege to the Pythonians, but I often wish there was support for some SQL-like syntax when working with (pandas) data frames. It certainly would make the process a lot smoother for some tasks.

Shelnutt2 · on Jan 21, 2020

TileDB[1] offers an embedded SQL experience for python[2]. We use MariaDB built in embedded mode to allow running sql against TileDB Arrays. This can be combined with pandas.read_sql to load the results directly into a pandas array. I wrote this implementation for the embedded SQL, so happy to answer any questions on it. The embedded SQL in its early stages, but should be fully functional for any queries that MariaDB itself supports. TileDB and the embedded SQL are both open source, TileDB is MIT licensed and the embedded SQL is under GPL.

[1] https://tiledb.com/

[2] https://docs.tiledb.com/developer/api-usage/embedded-sql

(disclosure: TileDB, inc. employee)

thenipper · on Jan 21, 2020

I'd love if someone took the next step from Pandas being influenced by R to port dplyr-esque syntax to it.

closed · on Jan 21, 2020

I've been working on a library over the past year that does exactly that, including generating dbplyr style SQL queries!

Would love your feedback :)

https://github.com/machow/siuba

bkfunk · on Jan 21, 2020

This looks interesting! Can I ask, what's the origin of the name? Both "siu" and "siuba"? Would be a useful addition to the docs, IMO.

closed · on Jan 21, 2020

Ah, thanks for pointing out the lack of explanation. I'll add one to the readme.

It's a transliteration of the cantonese word for minibus, 小巴 :).

edit: siu (小) means little!

alexilliamson · on Jan 21, 2020

The dplyr syntax is IMO the best feature of the tidyverse

truculent · on Jan 21, 2020

Isn't that just `pd.DataFrame.query` as outlined in the article? Or do you mean actual full-fat SQL?

wswope · on Jan 21, 2020

You could try dumping your dataframes to an in-memory SQLite instance using the to_sql method and then running queries on that. Not sure how performant the to_sql bit would be, but I'd expect joins in SQLite to be blazing fast compared to Panda's joins.

_8huj · on Jan 21, 2020

Something like this:

https://en.wikipedia.org/wiki/Language_Integrated_Query

SiempreViernes · on Jan 21, 2020

This seems to be about doing filtering with Pandas, not pure python. The title should probably be changed to reflect this.

brian_herman__ · on Jan 21, 2020

Yeah it looks like his code that this person uploaded isnt escaping the HTML or is being unescaped when it should be escaped.

df.loc[(df['Salary_in_1000']>=100) & (df['Age']< 60) & (df['FT_Team'].str.startswith('S')),['Name','FT_Team']]

lordgrenville · on Jan 21, 2020

Would have been nice to see a comparison of performance, or at least which is suggested style.

cyorir · on Jan 21, 2020

My guess is that eval and query have the worst performance, since they need to be interpreted first. As for the other methods, I'm not too sure how they rank. Personally, for style I agree with the first comment on the linked page, which is to split up the conditions for readability:

  f1 = (df[“col1”] == condition1)

  f2 = (df[“col2”] == condition2)

  df[f1 & f2]

closed · on Jan 21, 2020

One thing that really surprises me: NONE of these methods work with grouped DataFrames.

But grouping data is extremely common in data analysis.

Basically, the strategy with grouped data, is taking the loc approach, and sprinkling in a bunch of additional .transform calls. :/

devxpy · on Jan 23, 2020

I wonder if this could be improved with SQL

data_ders · on Jan 21, 2020

I strongly prefer .query() for legibility and that it can but used in a pipe. My only problem is that often flake8 will not detect the use of a variable inside of the query string. Has anyone else come across this before?

truculent · on Jan 21, 2020

If you supply a function to `.loc`, you can use it in pipes pretty easily. E.g.

``` my_dataframe.loc[lambda df: df['col'] > 0.8] ```

Would this help work around your issue?

antman · on Jan 21, 2020

Some speed comparison on a larger dataset would be interesting