More

ryanmonroe · 2024-06-07T17:57:58 1717783078

> Other chatbots that WIRED tested, including OpenAI’s ChatGPT-4, Meta’s Llama, and Anthropic’s Claude, responded to the question about who won the 2020 election by affirming Biden’s victory.

ryanmonroe · on Jan 9, 2024

DuckDB maintains a performance benchmark of open source database-like tools, including Polars and Pandas

https://duckdblabs.github.io/db-benchmark/

nerdponx · on Jan 9, 2024

I'd like to see Kdb in here.

ryanmonroe · on Dec 21, 2023

Good article. Some smaller changes you could make to the final function: In the last line `as.data.frame(do.call(cbind, out_list))` is used to convert the list to a data.frame. Passing it to `cbind` converts the list to a matrix (i.e. combines it into one long vector internally), and then `as.data.frame` converts it back to a list (as noted in the article data frames are lists). Instead, you can use `as.data.frame(out_list)` to make your list a data frame directly, to avoid converting the list to a matrix and back to a list again. The `unlist(lapply(split(cvec, groups), aggfun))` is also doing a lot of work, if you don't mind using an external package*, `collapse::BY(cvec, groups, aggfun)` is much faster (doesn't require converting `groups` to factor, doesn't copy `cvec`'s contents like `split`).

Here's some completely not-the-point of the article code review since I can't help myself. If you can set up earlier steps give you a named list for `col_grouping`, and use `lapply`, the code is a little more concise:

    efficient_flow_agg <- function(dat, col_grouping, gpcol_name="GroupMembership") {
      make_postproc <- function(gp, groups) {
        gp$preproc(dat[gp$which_cols]) |>
          lapply(collapse::BY, groups, gp$aggfun) |>
          gp$postproc()
      }
      col_grouping |>
        lapply(make_postproc, groups = dat[[gpcol_name]]) |>
        as.data.frame()
    }

* I had previously written here that `tapply` is probably faster, but apparently `tapply` does exactly `unlist(lapply(split(x, g), f)))` anyway? wtf R. Strange there's not something like `collapse::BY` in base R.

franklin_p_dyer · on Dec 21, 2023

Thanks for the feedback! The business with `cbind` is a facepalm, I'll definitely fix that. I don't think it will affect performance much since that last step won't be repeated many times, but it makes me cringe now knowing how redundant it is.

Good advice on `col_grouping` as well, accessing those components of an aggregation rule by index rather than by name is a bad code smell and decreases readability for sure.

kuhewa · on Dec 22, 2023

Main issue with cbind()ing to matrix then data.frame is conversion of all columns to the same type and potential loss of information

ryanmonroe · on July 13, 2022

I think “pay” is commonly understood to mean monetary, not adjusted for location, risk, time, or transportation cost, and understood to be stated in terms of total rather than hourly. Saying people are accepting lower pay to work remotely seems like a pretty objective way of describing the decision being made. What you’re describing are the motives, the fact that pay is not the whole picture. I don’t think we need to recast what “pay” means to understand that there’s more to be considered than just that.

chiefalchemist · on July 13, 2022

> I don’t think we need to recast what “pay” means to understand that there’s more to be considered than just that.

Yes, we do. Because a fair number of people are math / numbers "shy" and don't do the calculations. That favors the employers. Painting "pay" more accurately levels the playing field.

ryanmonroe · on March 3, 2022

The options are not "must walk to this place at this specific time" and "never do any exercise or take a break during your day". This type of dichotomy is drawn in so much of the back-to-office discussion. Mandating something that happens to involve, in part, something that could be beneficial, is not in itself an argument for the mandate when you can take that part by itself anyway. Like, even if it's a zoom meeting you could, at that same exact time right before the meeting, take a walk for five minutes. The mandate isn't helping you out in that respect.

ryanmonroe · on Feb 24, 2022

If you pay well and post a junior level position on LinkedIn with the salary range included, there is approximately 0% chance “no one applies”

ryanmonroe · on Feb 24, 2022

How can you write an entire article about how hiring is difficult in some field, and only mention salary in a short parenthetical with no concrete numbers? (“salaries have risen in some cities by as much as 10 percent”)

ryanmonroe · on Jan 26, 2022

To reference variables in the outer scope, you would do

    mutate(df, b = .env$a + 1)

And if you have a string (contained in a_var) which identifies a variable you can do

    mutate(df, b = .data[[a_var]] + 1)

You could argue these feel clumsy, but I wouldn’t say it’s “hard” to do either of these things with dplyr.

krumbie · on Jan 26, 2022

I don't think it's just about whether it's hard to do, your syntax example looks short enough and one can memorize these two patterns relatively quickly.

However, both patterns are another special case how identifiers are resolved in the expression. Aren't `.env` and `.data` both valid variable and column names? So what happens if I have a column named `.data`?

Another example, which is the reason why we chose the `:column` style to refer to columns in `DataFramesMeta.jl` and `DataFrameMacros.jl`:

What happens if you have the expression `mutate(df, b = log(a))`. Both `log` and `a` are symbols, but `log` is not treated as a column. Maybe that's because it's used in a function-like fashion? Maybe because R looks at the value of `log` and `a` in their scope and sees that `log` is a function an `a` isn't?

In Julia DataFrames, it's totally valid to have a column that stores different functions. With the dplyr like syntax rules it would not be possible to express a function call with a function stored in a column, if the pattern really is that function syntax means a symbol is not looked up in the dataframe anymore.

In Julia DataFrameMacros.jl for example, if you had a column named `:func` you could do `@transform(df, :b = :func(:a))` and it would be clear that `:func` resolves to a column.

This particular example might seem like a niche problem, but it's just one of these tradeoffs that you have to make when overloading syntax with a different meaning. I personally like it if there's a small rule set which is then consistently applied. I'd argue that's not always the case with dplyr.

ryanmonroe · on Jan 26, 2022

I hadn't thought of that tradeoff. After testing just now, if you have a column named `.data` or `.env` those constructs work as if there was no such column, and actually in that case `mutate(df, b = .data + 1)` is an error.

Personally I'll happily take not being able to use those as column names if it means I can avoid always typing : before every in-data variable, but your comment gave me a better understanding of why it would be bad for some other person or scenario, perhaps where short term ease-of-use is lower on the list of priorities.

For your second example, it doesn't come up in R because a data frame column cannot be a function. Columns must be vectors (including lists) and you could have a vector where one or all elements are functions, but the column itself cannot not be a function (functions are not vectors), so there's no ambiguity there. To call a function stored in your data frame you'd have to access an element of the column, and any access method, e.g. `[[` or `$` would make the resulting set of characters invalid as the name of an object (without backticks, which would then disambiguate the intent)

    df <- tibble(x = list(function(x) x + 1))
    df %>% 
      mutate(y = x[[1]](3))

Separate from dplyr, in R when you use `(` to call a function it searches only for functions by that name.

    log <- 3
    log(1)
    # 0

    frog <- 3
    frog(3)
    # Error in frog(3) : could not find function "frog"
    
    log <- function(x) x^2
    log(1)
    # 1

pdeffebach · on Jan 27, 2022

In Julia you could have an `AbstractVector` type also be callable, or more likely a vector of callable objects (and the operation is performed row-wise).

I agree it's unlikely that a user will name their column `.data`. But it certainly saves developer effort from thinking about these issues.

The larger concern, really, is that Julia needs to know which things are columns and which things are variables in an expression at parse time in order to generate fast code for a DataFrame. It needs to do this without inspecting the data frame, since the data frame's contents aren't known at parse time.

One option would be to make all literals columns. But then you run into issues with things like `missing`, which would have to be escaped or not recognized as a column. Its hard to predict all the problems there, and any escaping rules would definitely have to be more complicated than R's. So we require `:` and take the easy way out, which has the added benefit for new users who might get confused about the variable-column distinction.

pdeffebach · on Jan 26, 2022

It would be interesting to profile the 2nd version though. Assuming the non-standard evaluation has performance benefits (which they do in DataFramesMeta.jl), are you eliminating those benefits when you use

    .data[[a_var]]

?

ryanmonroe · on Jan 25, 2022

What are the wages now? What is the wage you want? Seems odd to ask people to sign a petition for higher pay without these basic details.

ryanmonroe · on Jan 3, 2022

Can do this in iOS also, Settings > Focus