I'm a little confused about what git annex is - I think it's perhaps a basic fil...

chubot · on April 5, 2021

Yeah it solves the problem of putting big binaries that don't compress well inside git.

If you've ever tried that, then git will start to choke around a few gigabytes (I think it's the packing/diffing algorithms). Github recommends that you keep repos less than 1 GB and definitely less than 5 GB, and they probably have a hard limit.

So what git annex does is simply store symlinks to big files inside .git/annex, and then it has algorithms for managing and syncing the big files. I don't love symlinks and neither does the author, but it seems to work fine. I just do ls -L -l instead of ls -l to follow the symlinks.

I think package repos are something like 300 GB, which should be easily manageable by git annex. And again you don't have to check out everything eagerly. I'm also pretty certain that git annex could support 3TB or 30TB repos if the file system has enough space.

For container images, I think you could simple store layers as files which will save some space for many versions.

There's also git LFS, which github supports, but git annex seems more truly distributed, which I like.

warp · on April 5, 2021

With git-annex the size of the repo isn't as important, it's more about how many files you have stored in it.

I find git-annex to become a bit unwieldy at around 20k to 30k files, at least on modest hardware like a Raspberry Pi or a core i3.

(This hasn't been a problem for my use case, I've just split things up into a couple of annex repos)

antonvs · on April 6, 2021

> I find git-annex to become a bit unwieldy at around 20k to 30k files

Oh well there goes my plan to use it for the 500 million files we have in cloud storage!

chubot · on April 6, 2021

I only recently started using it, but I think most of the limitation on metadata is from git itself. (Remember all the metadata is in git; the data is in the "annex".)

This person said they put over a million files in a single git repo, and pushed it to Github, and that was in 2015.

https://www.monperrus.net/martin/one-million-files-on-git-an...

I'm using git annex for a repository with 100K+ files and it seems totally fine.

If you're running on a Raspberry Pi, YMMV, but IME Raspberry Pi's are extremely slow at tasks like compiling CPython, so it wouldn't surprise me if they're also slow at running git.

I remember measuring and a Rasperry Pi with 5x lower clock rate than an Intel CPU (700 Mhhz vs. 3.5 Ghz) was more like fifty times slower, not 5 times slower.

---

That said, 500 million is probably too many for one repo. But I would guess not all files meed a globally consistent version, so you can have multiple repos.

Also df --inodes on my 4T and 8T drives shows 240 million, so you would most likely have to format a single drive in a special way. But it's not out of the ballpark. I think the sync algorithms would probably get slow at that number.

It's definitely not a cloud storage replacement now, but I guess my goal is to avoid cloud storage :) Although git annex is complementary to the cloud and has S3 and Glacier back ends, among many others.

antonvs · on April 7, 2021

Thanks for the info! I thought I was joking about the possibility of using git-annex in our case, but you've made me realize that it's not out of the realm of possibility.

We could certainly shard our usage e.g. by customer - they're enterprise customers so there aren't that many of them. We wouldn't be putting the files themselves into git anyway - using a cloud storage backend would be fine.

We currently export directory listings to BigQuery to allow us to analyze usage and generate lists of items to delete. We used to use bucket versioning but found that made it harder to manage - we now manage versioning ourselves. git-annex could potentially help manage the versioning, at least, and could also provide an easier way to browse and do simple queries on the file listings.

jpeloquin · on April 5, 2021

Git annex is pretty flexible, more of a framework for storing large files in git than a basic sync utility. ("Large" meaning larger than you'd want to directly commit to git.) If you're running Git Annex Assistant, it does pretty much work as basic file sync of a directory. But you can also use it with normal git commits, like you would Git LFS. Or as a file repository as chubot suggested. The flexibility makes it a little difficult to get started.

The basic idea is that each file targeted by `git annex add` gets replaced by a symlink pointing to its content. The content is managed by git annex and lives as a checksum-addressable blob in .git/annex. The symlink is staged in git to be committed and tracked by the usual git mechanisms. Git annex keeps a log of which host has (had) which file in a branch named "git annex". (There is an alternate non-symlink mechanism for Windows that I don't use and know little about.)

I use git annex in the git LFS-like fashion to store experimental data (microscope images, etc.) in the same repository as the code used to analyze it. The main downside is that you have to remember to sync (push) the git annex branch _and_ copy the annexed content, as well as pushing your main branch. It can take a very long time to sync content when the other repository is not guaranteed to have all the content it's supposed to have, since in that scenario the existence and checksum of each annexed file has to be checked. (You can skip this check if you're feeling lucky.) Also, because partial content syncs are allowed, you do need to run `git annex fsck` periodically and pay attention to the number of verified file copies across repos.

rakoo · on April 5, 2021

The website (https://git-annex.branchable.com/) has many details, including scenarios to explain why it can be useful. git-annex is not so much a backup/sync utility, it's more a tool to track your files if they exist on multiple repositories. Instead of having remote storages holding files that happen to be the same, with git-annex the relationship is inverted: you have files, and each one can be stored on multiple storages. You can follow where they are, push them/get them from any storage that has it, remove them from one place if it's short on free space knowing that other storages still have it...

There was a project of backing up the internet archive by using git-annex (https://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE...). Basically the source project would create repositories of files and users like you and I would be remote repositories; we would get content and claim that we have it, so that everyone would know this repository has a valid copy on our server.

kzrdude · on April 5, 2021

In one way it's a very distributed/decentralized alternative to git-lfs.