I have been interested in "git for binary data" for a while, mostly for ML/computer vision purposes.
I've tried quite a few systems. Of course, there's git-lfs (which keeps "pointer" files and blobs in a cache), which I do use sometimes - but it has a quite few things I don't like. It doesn't give you a lot of control on where the files are stored and how the storage is managed on the remote side. The way it works means there'll be two copies of your data, which is not great for huge datasets.
Git-annex (https://git-annex.branchable.com/) is pretty great, and ticks almost every checkbox I want. Unlike git-lfs, it uses symlinks instead of pointer files (by default) and gives you a lot of control in managing multiple remote repositories. On the other hand, using it outside of Linux (e.g., MacOS) has always been a bit painful, specially when trying to collaborate with less technical users. I also get the impression that the main developer doesn't have much time for it (understandably - I don't think he makes any money off it, even if there were some early attempts).
My current solution is DVC (https://dvc.org/). It's explicitly made with ML in mind, and implements a bunch of stuff beyond binary versioning. It does lack a few of the features of git-annex, but has the ones I do care about most - namely, a fair amount of flexibility on how the remote storage is implemented. And the one thing I like the most is that it can work either like git-lfs (with pointer files), like git-annex (with soft- or hard-links), or -- my favorite -- using reflinks, when running on filesystems that support it (e.g. APFS, btrfs). It also is being actively developed by a team at a company, though so far there doesn't seem to be any paid features or services around it.
Pachyderm (https://www.pachyderm.com) also seems quite interesting, and pretty ideal for some workflows. Unfortunately it's also more opinionated, in that it requires using docker for the filesystem, as far as I can tell.
Edit: a rather different alternative I've resorted to in the past -- which of course lacks a lot of the features of "git for binary data" -- is simply to do regular backups of data to either borg or restic, which are pretty good deduplicating backup systems. Both allow you to mount past snapshots with FUSE, which is a nice way of accessing earlier versions of your data (read-only, of course). These days, this kind of thing can also be done with ZFS or btrfs as well, though.
+1 for DVC. Setting up the backing store can be some extra work if you are doing that yourself, but after that it's a breeze.
What do you use for the backing store?
Git-lfs has been a pain in my seat since my first use of it. Most of the issues stem from the pointer files that have to be filtered/smudged pre/post commit.
Haven't used git-annex myself, but I have heard from coworkers that cross-OS is a pain.
Mostly S3. I used to do SSH, but these days I can afford to keep the data in the cloud. I do appreciate the possibility of migrating to other stores if needed in the future, though - might have to soon, for $reasons.
Actually, a lot has updated about git-annex in the last few years. It actually does support git pointer files like git lfs, which makes it easier to when you want to modify binary files. In fact, it can even use git-lfs servers as one of its back-ends. However, I still prefer symlinks mode, because operations on them are faster because it bypasses the smudge filter.
Also, git-annex uses reflink copies whenever possible, on zfs, btrfs, or apfs. Also, since people were talking about p2p and git, git-annex does this amazing trick for syncing directly to other git-annex repos, even with the checked out branch. There is no need at all for a seperate server.
I have used git-annex for years on OSX, and have not found it be deficient in any way compared to linux.
Yeah, git-annex has a lot of cool features that I have yet to see in other systems. I still use it for some things. My main pain point on MacOS was that the symlink mode didn't work well with some apps that didn't understand symlinks. Obviously this is not git-annex's fault, but it still made it so I couldn't use it. I think I could try again at some point and see if I could get it to use reflinks -- maybe it's a version issue.
I also had weird conflicts with the line ending (the whole CR/LF annoyance) on some of the metadata files git-annex used which I couldn't fix, no matter how many .gitconfigs I tweaked. Again, this is not really git-annex's fault, I think.
I've tried quite a few systems. Of course, there's git-lfs (which keeps "pointer" files and blobs in a cache), which I do use sometimes - but it has a quite few things I don't like. It doesn't give you a lot of control on where the files are stored and how the storage is managed on the remote side. The way it works means there'll be two copies of your data, which is not great for huge datasets.
Git-annex (https://git-annex.branchable.com/) is pretty great, and ticks almost every checkbox I want. Unlike git-lfs, it uses symlinks instead of pointer files (by default) and gives you a lot of control in managing multiple remote repositories. On the other hand, using it outside of Linux (e.g., MacOS) has always been a bit painful, specially when trying to collaborate with less technical users. I also get the impression that the main developer doesn't have much time for it (understandably - I don't think he makes any money off it, even if there were some early attempts).
My current solution is DVC (https://dvc.org/). It's explicitly made with ML in mind, and implements a bunch of stuff beyond binary versioning. It does lack a few of the features of git-annex, but has the ones I do care about most - namely, a fair amount of flexibility on how the remote storage is implemented. And the one thing I like the most is that it can work either like git-lfs (with pointer files), like git-annex (with soft- or hard-links), or -- my favorite -- using reflinks, when running on filesystems that support it (e.g. APFS, btrfs). It also is being actively developed by a team at a company, though so far there doesn't seem to be any paid features or services around it.
Pachyderm (https://www.pachyderm.com) also seems quite interesting, and pretty ideal for some workflows. Unfortunately it's also more opinionated, in that it requires using docker for the filesystem, as far as I can tell.
Edit: a rather different alternative I've resorted to in the past -- which of course lacks a lot of the features of "git for binary data" -- is simply to do regular backups of data to either borg or restic, which are pretty good deduplicating backup systems. Both allow you to mount past snapshots with FUSE, which is a nice way of accessing earlier versions of your data (read-only, of course). These days, this kind of thing can also be done with ZFS or btrfs as well, though.