Some thoughts on dataframe. - Don't put methods on it like pandas.DataFrame. You...

orlp · on April 5, 2020

> Don't put methods on it like pandas.DataFrame. You won't get the API right the first many times and you'll end up with a million methods.

I was actually thinking about this for a bit. Since the data layout seems unlikely to ever backwards-incompatibly change, can you not define the API entirely in traits on the struct, and package them into one super trait (e.g. DataFrameAPIVersion1)?

Then you can simply import the DataFrame struct as well as the version of the API you'll use, and be 100% compatible with any other code regardless of API version they use to manipulate the dataframes.

j88439h84 · on April 5, 2020

Then application-specific functions would need a different syntax from the built in methods. Like df.pipe(f). I think that is not desirable. Better to have them all be detached from the frame

orlp · on April 5, 2020

That's not true. You can implement your own traits for foreign types.

aldanor · on April 5, 2020

"Pure functions" is usually a bad idea with large numeric data structures if they are non-lazy - because you'll end up copying the data.

One of the common pure alternatives is to make them lazy - i.e., all those pure methods don't do anything but rather just collect the information on what to do with your data. But then you need to write an execution engine for arbitrary computation graph, which ain't easy.

orlp · on April 5, 2020

> "Pure functions" is usually a bad idea with large numeric data structures if they are non-lazy - because you'll end up copying the data.

This isn't necessarily true due to Rust's strong ownership. Methods can take `self` (without reference) as an argument which means the method takes ownership of itself. There is no copy for 'double' in the below example, yet it is pure:

    #[derive(Clone)]
    struct A {
        v: Vec<i32>,
    }
    
    impl A {
        fn double(mut self) -> Self {
            for x in &mut self.v {
                *x *= 2;
            }
            self
        }
    }
    
    fn main() {
        let a = A { v: vec![2, 3, 4] };
        let b = a.double();
        dbg!(b.v[1]);
    }

If you now tried to access `a` the compiler would error out, saying you're trying to access a moved-from variable. If you still wanted to keep the original `a` around you simply write `let b = a.clone().double()`.

nevi-me · on April 5, 2020

Yes, that's the approach that I'm taking with the library, but with a simple execution flow. One of the things that used to bite me a lot with Pandas was having to manually free memory from intermediate dataframes that I would create. I'm expecting Rust to work better in this case, because when executing a 3-step pipeline, step 2 consumes the output from step 1, and the dataframe is freed from memory.

I'll write an experiment on this in the coming weeks/months, to see if my assumption would work.

uryga · on April 5, 2020

i use pure functions wherever i can, but is this always the right approach when dealing with large dataframes? i imagine when you're chaining a few methods, it'd generate a large number of intermediate results that immediately get discarded/transformed again:

  frame.foo_columns().bar_rows().baz()

the result of .foo_columns() is basically linear – it gets passed to .bar_rows() immediately, with no other references to it. maybe this'd be a good place for rust's safe mutability magic?

wtetzner · on April 8, 2020

If each of those methods consumes `self` instead of taking a reference to it, then the methods are effectively pure, and don't require copying.

uryga · on April 9, 2020

yeah, that's what i meant by "linear" (as in linear types, similar to rust's ownership stuff, just terminology i'm more familiar with)