Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nice article.

In Python, I have been finding Polars nicer to use:

  (purchases
     .filter(pl.col("amount") <= pl.col("amount").median().over("country") * 10)
     .group_by("country")
     .agg(total = (pl.col("amount") - pl.col("discount")).sum())
  )
Not as compact as the R example but gets a bit closer compared to the pandas approach.

- https://pypi.org/project/polars/

- https://github.com/pola-rs/polars/



Why not SQL for pure declarative queries? Here's llm-hallucinated sql query of the polars example:

    SELECT country, SUM(amount - discount) AS total
    FROM purchases
    WHERE amount <= (
        SELECT MEDIAN(amount) * 10
        FROM purchases
        WHERE country = purchases.country
    )
    GROUP BY country;
It might be just an issue of familiarity but sql seems the most straightforward and easy to understand for me.


Probably because the article wasn't about comparing to SQL, or any other database, but rather looked at the R vs Python debate specifically?


What is wrong suggesting an alternative approach that makes the solution more readable?

Using an appropriate DSL for the problem may be useful. In Python:

    df.to_sql('purchases', db, index=False)
    print(*db.execute(query))
    # -> ('Canada', 270) ('USA', 8455)
e.g., we can use regexes to query text. Python is a general-purpose language, you can query text without using regexes but it would be insanity to ignore regexes completely (I don't know how easy is to invoke regexes from R). Another example, bash pipeline can be embedded in Python ("generate --flag | filter arg | sink") without reimplementing it in pure Python (you can do it but it would be ugly). No idea how easy it is to invoke shell commands from R. SQL is just another DSL in this case -- use it in Python when it makes the solution more readable.


It looks like llm hallucinated the query that doesn't group by country to get the median. Here's version generated after asking to fix it:

    SELECT p.country, SUM(p.amount - p.discount) AS total
    FROM purchases p
    JOIN (
        SELECT country, MEDIAN(amount) *  10 AS median_amount
        FROM purchases
        GROUP BY country
    ) m ON p.country = m.country
    WHERE p.amount <= m.median_amount
    GROUP BY p.country;


You get into a lot of other problems that are straightforward in pandas/R but very difficult in SQL.


It is not either or. Use Python where it is strong, and execute SQL queries from Python where appropriate.

Python as a glue language is one of its strong sides.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: