JDF.jl – Julia DataFrames serialization format

jstx1 · on July 15, 2023

I don't see the value of a language-specific format when we have things like parquet.

ChrisRackauckas · on July 15, 2023

The repo is pretty clear on its intentions to not be a language-specific format:

> Development Plans

> I fully intend to develop JDF.jl into a language neutral format by version v0.4. However, I have other OSS commitments including R's {disk.frame} and hence new features might be slow to come onboard. But I am fully committed to making JDF files created using JDF.jl v0.2 or higher loadable in all future JDF.jl versions.

It's beyond v0.4 and at v0.5.1 right now so it seems delayed, but the repo describes intentions to be a language-neutral format after the dust settles. I think we're just taking a snapshot too early in the timeframe to really know what this project is truly all about. And given that the author is an OSS dev for multiple languages, this intention makes a lot of sense.

thumbuddy · on July 15, 2023

I think Julia is one of those researcher languages like Haskell so it's probably just someone having fun, writing a paper, or whatever.

VHRanger · on July 15, 2023

Julia is still niche, but it's definitely used in production?

The NY Fed have their main model of the economy in Julia (ported from Matlab)

agumonkey · on July 15, 2023

this https://juliahub.com/case-studies/ny-fed/ ?

VHRanger · on July 15, 2023

This:

https://frbny-dsge.github.io/DSGE.jl/latest/

thumbuddy · on July 16, 2023

Sounds like research to me? I dunno...

Buttons840 · on July 15, 2023

I don't agree with this. Julia does not have new ideas the same way Haskell does, and Julia has done a lot of work to be practical, like having a good way of calling Python and R code from early on in its life. If Julia has done anything new, it's probably related to being fast, but that's not exactly a novel idea.

mrfox321 · on July 15, 2023

Sure it does.

Have you seen the type system used to generically dispatch matrices to GPUs or cpus.

Or the dispatching to give you auto-differentiation?

Maybe it's not your definition of "new ideas", but they are really useful and original.

gugagore · on July 15, 2023

Julia is unique in using multiple dispatch pervasively. It did not invent multiple dispatch, but it certainly popularized it.

thumbuddy · on July 16, 2023

How come you never see anything practical made in Julia in production somewhere that isn't a pile of research mathematics? I personally can't think of a single tool or company. It's not like it's a young language either? I have seen dozens of niche papers written using it though.

Being fair, I have seen dozens of not hundreds of products and mass deployed projects written in similar languages. Even Haskell made its way to a lot of desktops in the Linux world, and Haskell is super niche research territory imo.

ChrisRackauckas · on July 16, 2023

The clinical analysis for the COVID-19 vaccines was pretty practical and that was Pumas used by Moderna. There's a JuliaCon talk about that from the Director Head of Pharmacometrics at Moderna. There's Formula 1 usage, there's a some satellites running Julia onboard, etc. Some are documented here https://juliahub.com/case-studies/ though some people post what they are doing on the Discourse and Slack and so those end up being the most comprehensive source. Of course you can always argue whether 30 public examples is enough, or 50, or etc. but it's far from 0. And of course, most examples are never for public release.

thumbuddy · on July 23, 2023

Fifteen years ish and there's maybe ten-thirty production use cases that kinda can't be verified because they all sound like research projects doesn't sound super great to me. No offense, but I do wonder if all of those cases could have been accomplished more easily/sustainably with another tool and if they exist simply because someone was selling the language to a company as a contractor as a form of vendor lock in or something.

I guess I'll be curious once I install some software or an OS and see it brings in Julia in as a dependency or something. Otherwise I worry it's Matlab 2.0 with less of a mindshare...

akdor1154 · on July 15, 2023

If the parquet julia lib is slow, just go and fix that? Sorry I don't mean to discourage people trying cool new things, but seriously parquet is great, and I'd much rather see a diversity of great parquet implementations than a diversity of language specific parquet competitors.

markkitti · on July 15, 2023

The comparison in the README is to the parquet package for R.

braza · on July 15, 2023

It’s sad that Julia lost the so much preference in the Data Ecosystem. I think the syntax it’s clean, it’s fast, a lovely community and a clear vision on what needs to be done.

goerz · on July 15, 2023

Has it “lost” preference? I was more under the impression that it just hasn’t yet gained as much popularity as it deserves based on its technical merits, due to being quite a young language and the entrenchment of a large existing Python ecosystem.

moomin · on July 15, 2023

Isn’t the Python data frame format becoming an established format? I mean, there’s already commercial products that talk it natively.

constantcrying · on July 15, 2023

Python is definitely more popular and its formats are naturally more widely spread.

Python is also a very bad language for many of the things it is used for (anything to do with numerics, e.g. machine learning) and having a competing format from a language which is, at least in that regard, far superior seems like a good thing.

jstx1 · on July 15, 2023

It's not competing if it's limited to Julia. And saving a dataframe isn't on anyone's list of technical concerns anyway.

It's a big stretch to say that a tiny Julia library which does a thing we can already do well in Julia and Python and just about any other language has any impact on the language ecosystems and how they compare to each other.

constantcrying · on July 15, 2023

>It's not competing if it's limited to Julia.

Sure, but it should.

keithalewis · on July 15, 2023

It is impossible to have an intelligent discussion with someone who only knows Python, FSVO know. Dunning-Kruger in action.

tomrod · on July 15, 2023

I have extensive work in Matlab, C, Fortran, R, Python, Julia, SAS (IML), and am learning Go and Rust. I have played with Java and Kotlin on non-serious projects.

I typically use Python for data-related tasks.

Julia is just okay. It lost its steam due to long beta, slow time to first plot after 1.0 release, and generally not being a strategically adopted language by ML innovation groups. A great example of what is versus what was hoped is John Myles White's work[0], where the headline is "worked on large community Python projects" but is one of the originators of Julia's fundamental data packages.

I've had lots of intelligent people who use code as a tool instead of as a craft -- for data folks, their craft is working with data and not the tooling around it. For folks like this, Python is simply a better option in today's environment.

[0] https://www.johnmyleswhite.com/about/

crabbone · on July 15, 2023

> For folks like this, Python is simply a better option in today's environment.

By what metric? Me, being involved with the infrastructure of research projects, I see how much time and effort is wasted just by working around the sheer stupidity of Python and its tooling daily. Because researchers generally don't see the infrastructure work as important, they also tend not to associate the effort it takes to get the infrastructure decently functional with their own work. So, they tend to think that it doesn't matter what language environment they are using (and thus prefer the familiar one).

Inevitably, this spills into researcher's work anyways. One of the typical problems (related to Python tooling) I see is this: the project worked for a while with a set of dependencies with which it was initially created, and one day everything explodes because some dependency screw something up. Non-infra people usually have hard time figuring out what broke and how to fix it, so they choose the path of least resistance: either not build / test project automatically anymore (only on individual developer's machines) or try to "freeze" the dependencies, preventing the update from breaking the old (broken) stuff from being exposed, or "fixing" by mindlessly copying some StackOverflow "solution" that does some asinine thing, but allows for the project to keep "working".

In the end of the day, Python-based projects tend to have very short shelf life. Often by the time they near completion they are already so behind the "latest and greatest" that others, who might have wanted to adopt them, don't want to do that because they'd have to make special downgrades in their infrastructure to even try it. Larger Python projects are almost guaranteed not to work with other projects because of overlapping and conflicting dependencies. Making Python-based projects available to non-programmers is another painful experience that usually ends in a failure.

So, you might achieve higher development velocity on a particular stretch in your development journey, but you definitely didn't account for the whole thing, definitely not with an eye for sustainability / longevity.

johnmyleswhite · on July 16, 2023

Given you're referring to me and my career, can you explain what you mean about "worked on large community Python projects"? I don't see that text on the page you're citing. Is it an interpretation of "I’ve been part of teams working on a variety of things, including extensions to the Python language and A/B testing"?

tomrod · on July 16, 2023

Hey John! Yes, you're accurate here in how I interpreted your work on extensions in the Python language. Let me know if that is in error, would love to hear an alternative opinion to Julia's growth and status. My subjective take has been that the initial excitement of Julia has waned a bit (a shame, it's a good language) and that it's not seeing strong adoption.

EDIT: That said, I just looked up the TIOBE index and it appears to be growing. My subjective take is likely in error here!

johnmyleswhite · on July 16, 2023

I still use Julia a lot and still think it is the best bet for a dynamic language with acceptable speed. I don't work on many OSS projects anymore, so I'm not an active contributor, but I'm not contributing to other languages in my spare time either. The projects you see in my "about" page are my paid efforts.

I think Julia's adoption is still not at all comparable to Python or R, but neither of those languages were comparably popular at this point in their own lifetimes. I think people underestimate how slow change is here.

Julia absolutely could fail to compete in the long run, but it is still growing and it is improving.

tomrod · on July 16, 2023

> . I don't work on many OSS projects anymore, so I'm not an active contributor, but I'm not contributing to other languages in my spare time either. The projects you see in my "about" page are my paid efforts.

I appreciate the clarification and correction. Apologies for muddying the water.

keithalewis · on July 15, 2023

Are you disagreeing with something I said?

tomrod · on July 15, 2023

Indeed.

ssivark · on July 15, 2023

The world is a very heterogeneous place, and we're all immersed in our own bubbles -- each with its own assumptions and needs. "Data related" is a fairly vague description.

- It appears that there's substantial community effort invested in smoothing the pathway for scientific modeling and related ML. So that's one area where I expect anyone picking up Julia will see quick+substantial wins.

- Recent Julia versions (1.9, 1.10 on the way) have made tremendous progress in TTFX: https://julialang.org/blog/2023/04/julia-1.9-highlights/

- I happen to work on problems (with more of a computation/simulation flavor, rather than "data related") where Python is extremely painful and Julia happens to be the best fit by far (and rapidly getting even smoother!).

> I've had lots of intelligent people who use code as a tool instead of as a craft

Agreed -- so there's no point being fanatical about it. Folks can keep an eye out and periodically re-evaluate whether the ecosystem has reached maturity for their needs. For those who want to "use libraries", it will take a bit longer than those who want to "write libraries".

Julia aims to solve the two-language problem. If you live in Python and never have to touch another language like C or Fortran, then the value proposition of Julia is going to feel much more tenuous. But there are a set of people (library developers) who experience the ecosystem in the diametrically opposite way: https://twitter.com/dillonniederhut/status/16791406806799728...

> and generally not being a strategically adopted language by ML innovation groups.

I see the tremendous engineering effort being sunk to massage different (mutually incompatible) subsets of Python into some shape amenable for ML -- supporting new algorithms on top, and more efficient program execution at the bottom (translating from software to hardware).

I've heard the claim that it's sometimes better to delay a solution and let the users feel the pain before they open their minds to the solution. That is what I'm reminded of when I think of ML and Julia. If ML users don't see the value of Julia yet, that's okay -- they might once they dig themselves in deeper.

If they manage to solve all ML problems with Python (with C and what not), that's fine too! I think the world is a bigger place where there are also other interesting things going on, and Julia is helping a lot of people do things they couldn't accomplish otherwise :-)

--

There's no fundamental reasons for languages to "age", unless they happen to be tied to some unsuitable assumptions in how they model the world -- aspects that cannot be rearchitected without fundamentally changing the language. Barring that problem, languages only get more mature.

The problem with Python is not that it is three decades old, but that its revealed priorities (model of the world and software) is out of sync with many of today's needs. Even if we forget advanced things like ML, there's still a bunch of really basic stuff that make Python painful: https://twitter.com/dmimno/status/1679474354579488771

I think the design of Julia is much more robust in those aspects, but we also need to see how multiple dispatch plays out for very large codebases+teams. Meanwhile, I expect Julia will keep improving rapidly in the use cases it supports.

tomrod · on July 15, 2023

Excellent then that you agree that people who use Python as their coding language aren't necessarily Dunning-Kruger exhibits

keithalewis · on July 16, 2023

Pro tip: you can identify who is posting by their username. If you are responding to me you left out the word 'only'. Even the guy you cited is calling you out. Time to slink away?

tomrod · on July 16, 2023

Parent comment here was not in response to you, but to ssivark's comment.

My point earlier was that people who solely use Python aren't DK exhibits in action simply because they know one language. Further, they may do so as a consequence of tech stack used in their roles or in their organizations. Given Python's adoption in the data domain, many people in the space use code simply as a tool to get things delivered.

keithalewis · on July 17, 2023

And should not be relied on to suggest data formats that can be useful in other languages, as others in this thread have pointed out.

Python is useful for people who have no formal training, but they are leveraging off the work of others. I hope you can understand how experts do not attach much weight to people still trying to figure out how computers actually work.

tomrod · on July 17, 2023

And this point:

> experts do not attach much weight to people still trying to figure out how computers actually work

is a much weaker form of this point:

> It is impossible to have an intelligent discussion with someone who only knows Python, FSVO know. Dunning-Kruger in action.

The first point is true for anyone focused on a single language, outside of perhaps Assembly. The second is simply wrong.

jzwinck · on July 15, 2023

There is no practical Pandas DataFrame binary format on disk. They have "tofile" which is not portable across machines. They have "to_hdf" which produces files no other library understands in an intuitive way. They mostly recommend "to_csv" with all the usual problems of CSV.

NumPy does have an official documented format which is straightforward to read and write from other languages if you write some code yourself, but still doesn't have very wide support outside of Python.

benglish11 · on July 15, 2023

Arrow is the more universal data format. It is already used to serialize non-python data structures in libraries like spark , snowflake-snowpark and polars. Pandas can read/write in arrow format as well. In pandas 2.0 the numpy backend can even be replaced with a pyarrow backend. I think this directly contradicts your statement about “no practical pandas binary format exists”. That may have been true a few years ago but certainly is not an issue any more.

aldanor · on July 15, 2023

The sole fact that pandas, while trying to catch up to polars, now sort-of supports arrow as it's in-memory backend doesn't make it a "pandas dataframe binary format" in any possible way.

ayhanfuat · on July 15, 2023

The whole point of Arrow is that it is language agnostic; so of course it is not a "pandas binary format". However, pandas didn't start supporting Arrow "to try to catch up to polars". The creator of pandas, Wes McKinney, came up with the Arrow idea because of the interoperability limitations he came across while working with pandas and the big data ecosystem. Supporting Arrow has been in the works for a long time: https://wesmckinney.com/blog/pandas-and-apache-arrow/ Pandas is an older and larger library with less maneuver capability and millions of users relying on things not being broken.

dijksterhuis · on July 15, 2023

Parquet is a binary file format and is very widely supported.

I think there’s even an Excel plug-in for parquet these days.

radarsat1 · on July 15, 2023

> I think there’s even an Excel plug-in for parquet these days.

While CSV comes with plenty of known problems, its one advantage, which apparently remains a significant one, is that it doesn't require any plugins or special software support at all, just a text editor. It seems that's a big enough feature that no attempt to fully replace it has really worked out. Amazing when you think about it, the power of plain text encoding.

Of course I agree that it's often not the best choice but somehow it remains incredibly useful and can even be hard to make arguments against it on a team project because it just kind or works well enough in a lot of cases and is simply a lowest common denominator when people can't agree on what format to use.

crabbone · on July 15, 2023

> Python data frame format

That is not a thing. The closest would be the NumPy binary format, which is "nothing to write home about". It's not good in any particular way. Some sort of a stop-gap solution for when you really want to save the contents of a NumPy array very much, and you don't want to deal with writing some commonly supported format that's (hopefully) optimized for some particular task.

juujian · on July 15, 2023

Apache Arrow sits in a similar niche, but it has support for multiple languages.

tjrgergw · on July 15, 2023

What exactly is the advantage of this vs Arrow.jl?

ChrisRackauckas · on July 15, 2023

If I'm not mistaken, Arrow is an in-memory format and not necessarily a storage format. Though it is common to use Arrow with the Parquet serialization format, and the benchmarks here show a much faster read performance than Parquet while achieving around the same file size. From my read, that seems to be the impetus for the project (with comparisons made on Linkedin by the author as well) and so in theory Arrow and this could be complementary like Arrow and Parquet if someone setup the links.

zokier · on July 15, 2023

> the benchmarks here show a much faster read performance than Parquet

That felt bit surprising.. is parquet slow to read in general or is the julia implementation just slow?

tjrgergw · on July 16, 2023

Arrow.jl, when compression is off, is just MMAP. I don't know how it can be faster than that.

tjrgergw · on July 16, 2023

The benchmark comparison is vs Feather.jl.

SilverBirch · on July 15, 2023

The difficulty with serialization formats is that you're serializing for a reason. What if I want to read these dataframes into C++ or Python or some other language? It's pretty rare for pure performance to be the only consideration in choosing a serialization format - and unfortunately all the other reasons favour well-supported, long-established projects that can afford to hand out libraries in every language you could want.

ElectronBadger · on July 16, 2023

Thanks, added to the Big Book of Julia (https://adamwysokinski.codeberg.page/bbj).

zarkenfrood · on July 15, 2023

Looks interesting and the syntax appears to be very similar to R data.table. It will be interesting to keep an eye on once it gets to v1.0.

amj7e · on July 15, 2023

This is a pretty cool package and very useful. Is the project stale? What are the plans going forward?

markkitti · on July 15, 2023

There was a release 25 days ago. The HN submitter might even be the author.

amj7e · on July 16, 2023

Yeah, but it was just a project.toml update. There haven't been serious updates to the src for a long time that's why Im asking. Hopefully, it resumes.

zX41ZdbW · on July 15, 2023

Interesting to see a comparison with ClickHouse Native format.

bobbylarrybobby · on July 15, 2023

I think the initialism JDF may already be taken

https://en.m.wikipedia.org/wiki/Jewish_Defense_League

derstander · on July 15, 2023

> I think the initialism JDF may already be taken

Acronyms are lossy compression. I’d bet there’s a JDF that even predates the one you reference in one of your descendant comments.

I’d be interested in knowing (though admittedly not enough to look) if there’s even a three letter acronym using the English alphabet that’s not already in use.

eindiran · on July 15, 2023

That's "JDL", which isn't the same 3 letters.

bobbylarrybobby · on July 15, 2023

Whoops, I meant https://www.jewishdefensefund.org/