In almost every database-backed application I've ever built someone, at some point, inevitable asks for the ability to see what changes were made when and by whom.
My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.
I've run this for a few toy projects and it seems to work really well. I'm ready to try it with something production-scale the next time the opportunity presents itself.
> My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.
How does that solve the problem? All this would do is give you snapshots of the state of the DB at particular points in time.
The only way to know who changed what when is to keep track of those changes as they are made, and the best way to do that is to stick the changes in an eventlog table.
> "I store all dates in spreadsheets in ISO format"
As in SQL, the application’s native date format (which is usually a number with appropriate format) would be the natural choice and a YYYY-MM-DD string would be a distant second. ISO (presumably, 8601) format is too broad, and admits of a mix of formats that are not simply sortable (quick, which is first 2021-W14-2 or 2021-04-07?).
Eh, when people say ISO they mostly mean YYYY-MM-DD.
If a date is going to be string prepended to something else, or if the spreadsheet is going to be exported to CSV, or parsed by some tool, then storing dates in YYYY-MM-DD can be a very justifiable thing to do in a spreadsheet.
Also Excel likes to format dates using the system locale, which is, well, stupid. So I have to enter YYYY-MM-DD as a custom format anyway, or change the locale on the column to China. (Or Cherokee, who apparently also properly format their dates!)
(Also Excel is smart enough if you set a column as YYYY-MM-DD custom format, it'll let you use all the date based pivot table stuff as expected)
He was a radical honestly type guy. It was all on the table and we had lots of success stories. We would give anyone a chance which let a lot of self taught, entry level people get work. I got a ton of experience mentoring people, as well (difficult people, too). Unfortunately, he got screwed over by a shotgun clause in a business agreement and lost the business that to an ex-partner. That's what led me to my current job, though, where I've been happy for over 10 years and it's much more stable.
Assuming by number you mean epoch seconds/milliseconds, a number can store the instant in time, but not a TIMEZONE. Obviously you could do bit magic if you shift it over (so you don't mess up sorting), but that's true for any data and a sufficiently large number type.
I did that in DB2 or Oracle (or both, don’t remember - it was 15 years ago and the project used both) because I needed millisecond time resolution and the database date time field did not support milliseconds.
Among the option of storing (date field, millisecond field) and (date as iso8601 text field with millisecond without time zone) I chose the latter, and I can’t tell for sure but it was almost surely the better choice.
Surely you can do both, so you can still perform sane sorting/indexing on "real" timestamps and then use the extra precision string timestamp for drilling down?
I saw no benefit in storing two copies. That’s recipe for problems - e.g. “update” code that doesn’t update both, or somehow updates them differently (which one is right?). You can probably have t-sql or some “check” constraint to enforce equivalence, but ... why?
For most practical uses, an ISO8601 textual representation is as good, though a little less space efficient.
I remember one design I reviewed that wanted to use XML as something to be queried using simple text comparison (this was before some databases adding explicit support for XML).
Meh, I don't see much wrong with this honestly. There are some cases where using text as the type is the only way. Bet you the junior dev worked with a lot of messy data in the past, where it wasn't always possible to do a direct datestr -> SQL date type.
Just point him in the right direction and move on.
Yeah, I did explain why that wasn't "how we did it", and tried not to sound like a know it all (I am definitely not this).
I am way less judgemental now and a lot of the comments here are making me rethink what I thought about him saying this. He had a lot of other issues and resigned after a couple of days so I guess we'll never know what he was capable of.
Hah, amazing and thanks for sharing. I think it's mostly that you sounded like you were holding a petty grudge over some small technical thing that can make sense. Maybe he resigned because he felt overly judged?
No, he had a day drinking and absentee problem. My boss talked with him, offered him another chance, but he resigned. He really didn't like following requirements either but I think it was just a symptom of his overall problems. He was going through something and we were just a stop on his journey.
Sql databases usually come with their own date types that are implemented with integers behind the curtains. They take up less space and are easier to sort than text fields.
EDIT: also I wouldn’t consider this egregious. If the Senior explains and the person was happy to learn something then that’s a good outcome. If they are stubborn about it, then that wouldn’t be great.
But which very likely store less information than an ISO timestamp does. (The integers stored are usually POSIX timestamps.) So, you might have to store, e.g., timezone information elsewhere. (And that might also be a good thing. It might not. It depends. Without knowing more, the DB's type is probably the right choice.)
I've seen far worse things than this done to a SQL database though…
Except that ISO-8601 doesn’t include anything as useful as a political time zone, just an offset from UTC. That’d be good if UTC offset was useful for anything, but instead it just adds to the confusion.
At least in Postgresql, it is standard practice to use timestamptz type, which saves timestamp together with a time zone. Not only it uses less space on disc, but you can do a lot of operations with it natively in database itself: sort in various ways, compare to NOW(), construct time intervals and even ensure data validity with checks.
That reminds me of https://www.enterpriseready.io/ where you can find a guide of which features, such as an audit log, you may not have though about when building your side project, B2C startup, etc. that will facilitate its uptake by companies.
I would imagine the functional programming advocates here already understand this. For the rest, audit logs are at least a baby step towards that. Progress!
Completely agreed on the inevitability of that ask.
At risk of extreme overkill the other way, something like Debezium [1] doing change monitoring dumping into S3 might be a viable industrial strength approach. I haven't used it in production, but have been looking for appropriate time to try it.
usually businesses use auditing tables to track this, usually by setting up triggers to insert a row into the auditing table
any time a delete, update or insert occurs on the original
That isn’t how this works. Based on OPs description, the fact that Sally made a change is gone since the last editor was Billy. It looks like Billy made the change to both column 1 and 2.
"Just so you know, we only take a snapshot every X minutes... so there's a chance Billy wasn't the person who made that change. If you had told me you were planning to fire people as a result of this tracking system I might have spent more than five minutes hacking it together with a cronjob and a git repository."
Since this is being picked apart a bit in the comments, I should clarify that I don't think this is the BEST solution to this problem. It's the solution that gives you 90% if the value with 5% of the work.
> It's the solution that gives you 90% if the value with 5% of the work.
It’s not, though. History tables don’t provide only 111% of the value of this, and don’t take 20× the effort. They have a deterministic relation to base tables (so generating the schema from that of base tables, and the code necesaary for maintenance, is a scriptable mechanical transform if you have the base table schema.)
OTOH, lots of places I've seen approach this by haphazard addition of some of what I think of as “audit theater” columns to base tables (usually some or all of “created”, “created_at”, “last_updated”, “updated_by”), so its also not the worst approach I’ve seen.
Heh something to that effect. I’m not at Uber anymore so can’t reference the article which makes this assertion. It is quite well written, starts off with a super simple version of configuration management system and describes how it progressively evolves to encompass 80% of Git features.
I should write a SQL as a NoSQL database blog post, showing a trivial one-table schema that does all you could ever want in a NoSQL, especially after choosing to ignore that it still has SQL.
This may or may not be the case anymore[1] - but for a long while Postgres' managed to outperform MongoDB when restricted to just using single JSON column tables.
One of the larger relational DBs beating the OG NoSQL will always fill me with amusement.
1. Ed. looks like this is still the case, but please remember this is MongoDB specifically. AFAICT MongoDB is mostly a dead branch at this point and a comparison against Redis or something else that's had continued investment might be more fair. I'm just happy chilling in my SQL world - I'm not super up-to-date on the NoSQL market.
> The reason why it says stupid in the man-pages is that it makes no assumptions about what content you store in it.
That's not true. The assumption that you are storing text files is very much built in to the design of git, and specifically, into the design of its diff and merge algorithm. Those algorithms treat line breaks as privileged markers of the structure of the underlying data. This is the reason git does not play well with large binary files.
The diffing/merging stuff that you see when you use the git tool is really just some sugar on top of the underlying data which is being stored. If you look at what a commit actually contains, it's just a hash of the tree root, one or more parent commits identified by their hash, and some other metadata (author, date, etc). There's nothing about the commit that cares about the contents of its tree or its parents' trees. It's the git tooling on top of those structures which does.
In git world this distinction is referred to as the "porcelain" vs "plumbing", the plumbing being the underlying structures and porcelain being the stuff most people actually use (diffs, merges, rebases, etc...)
> one or more parent commits identified by their hash
The reason that matters is because you need that information to do merges. So merges are integral to git. They are not "porcelain". They are, arguably, the whole point.
Nothing in the "plumbing" defines what merge/diff algorithm is used or is even needed. You can pick any of the merge strategies, or plug in any diff utility.
The plumbing cares only about one thing, that you have a valid data structure, a data structure that among many things tracks "what was/were the previous state/s before this commit?" as a pointer to the previous commit/s, nothing more. I can construct a git tree by hand / shell script just piping in full text/binary files, and git's porcelain will happily spit out "diffs" of the structure.
Your porcelain on top of that decides what to do with that information, standard builds admittedly just provide diff tooling that focuses on , more expansive UIs can diff some binary data (e.g. GitHub shows diffs of images using a UI more suited for that data, with side by side, swiping, or fading to allow someone to examine the changes)
> The assumption that you are storing text files is very much built in to the design of git
I haven't dived deep in a while to git's source code, but the last time I did, this argument would only really hold true for packfiles / where delta encoding is used to compact the .git directory's contents as that does indeed seem tuned for generic text content instead of binary data.
That isn't related to the diff/merge functionality of git though; that is a feature used to reduce disk space by reducing loose objects (full copies of the previous versions of files), trading increased compute time (rebuilding a previous version of a file from a set of diffs) for reduced disk space.
Would you agree that keeping track of the ancestor(s) of a commit, and the fact that there can be multiple ancestors, is central to git's design? If so, why is that the case? Why is keeping track of ancestry, and the ability to have multiple ancestors, central to the design?
It's so you can do diff/merge. So yes, diff/merge is not directly part of the core. But the core is designed specifically to support diff/merge. That's the whole point. That's the reason the core data structures of git are designed the way they are. It's also the reason that git comes with a diff/merge algorithm built-in, and 99.999% of the people who use git use the default diff/merge. And the default diff/merge assumes text.
So yes, you can use git to store binary files, and you can even modify it to do something sane with them without ripping out everything and starting over from scratch. But that's not what git was originally designed for. That the design is adaptable to other use cases is a testament to Linus's foresight, but it does not change the fact that the assumption of text is woven deeply into the design. Dividing up the design into "plumbing" and "porcelain" does not change this. This part of the porcelain is as much a part of the design as the plumbing is.
Tangentially I started a re-implementation of git in Common Lisp[1], and have completed parsers for most of the file formats except delta encoded objects.
Does anyone happen to know of an implementation or tests for delta encoding I could consult that is available under an MIT-like license? (BSD, Apache v2, etc.)
I like it a lot so far, and I think it could be used for "cloud" stuff, not just backups.
I'd like to see Debian/Docker/PyPI/npm repositories in git annex, etc.
It has lazy checkouts which fits that use case. By default you just sync the metadata with 'git annex sync', and then you can get content with 'git annex get FILE', or git annex sync --content.
Can anyone see a reason why not? Those kinds of repositories all seem to have weird custom protocols. I'd rather just sync the metadata and do the query/package resolution locally. It might be a bigger that way, but you only have fully sync it once, and the rest are incremental.
Yeah it solves the problem of putting big binaries that don't compress well inside git.
If you've ever tried that, then git will start to choke around a few gigabytes (I think it's the packing/diffing algorithms). Github recommends that you keep repos less than 1 GB and definitely less than 5 GB, and they probably have a hard limit.
So what git annex does is simply store symlinks to big files inside .git/annex, and then it has algorithms for managing and syncing the big files. I don't love symlinks and neither does the author, but it seems to work fine. I just do ls -L -l instead of ls -l to follow the symlinks.
I think package repos are something like 300 GB, which should be easily manageable by git annex. And again you don't have to check out everything eagerly. I'm also pretty certain that git annex could support 3TB or 30TB repos if the file system has enough space.
For container images, I think you could simple store layers as files which will save some space for many versions.
There's also git LFS, which github supports, but git annex seems more truly distributed, which I like.
I only recently started using it, but I think most of the limitation on metadata is from git itself. (Remember all the metadata is in git; the data is in the "annex".)
This person said they put over a million files in a single git repo, and pushed it to Github, and that was in 2015.
I'm using git annex for a repository with 100K+ files and it seems totally fine.
If you're running on a Raspberry Pi, YMMV, but IME Raspberry Pi's are extremely slow at tasks like compiling CPython, so it wouldn't surprise me if they're also slow at running git.
I remember measuring and a Rasperry Pi with 5x lower clock rate than an Intel CPU (700 Mhhz vs. 3.5 Ghz) was more like fifty times slower, not 5 times slower.
---
That said, 500 million is probably too many for one repo. But I would guess not all files meed a globally consistent version, so you can have multiple repos.
Also df --inodes on my 4T and 8T drives shows 240 million, so you would most likely have to format a single drive in a special way. But it's not out of the ballpark. I think the sync algorithms would probably get slow at that number.
It's definitely not a cloud storage replacement now, but I guess my goal is to avoid cloud storage :) Although git annex is complementary to the cloud and has S3 and Glacier back ends, among many others.
Thanks for the info! I thought I was joking about the possibility of using git-annex in our case, but you've made me realize that it's not out of the realm of possibility.
We could certainly shard our usage e.g. by customer - they're enterprise customers so there aren't that many of them. We wouldn't be putting the files themselves into git anyway - using a cloud storage backend would be fine.
We currently export directory listings to BigQuery to allow us to analyze usage and generate lists of items to delete. We used to use bucket versioning but found that made it harder to manage - we now manage versioning ourselves. git-annex could potentially help manage the versioning, at least, and could also provide an easier way to browse and do simple queries on the file listings.
Git annex is pretty flexible, more of a framework for storing large files in git than a basic sync utility. ("Large" meaning larger than you'd want to directly commit to git.) If you're running Git Annex Assistant, it does pretty much work as basic file sync of a directory. But you can also use it with normal git commits, like you would Git LFS. Or as a file repository as chubot suggested. The flexibility makes it a little difficult to get started.
The basic idea is that each file targeted by `git annex add` gets replaced by a symlink pointing to its content. The content is managed by git annex and lives as a checksum-addressable blob in .git/annex. The symlink is staged in git to be committed and tracked by the usual git mechanisms. Git annex keeps a log of which host has (had) which file in a branch named "git annex". (There is an alternate non-symlink mechanism for Windows that I don't use and know little about.)
I use git annex in the git LFS-like fashion to store experimental data (microscope images, etc.) in the same repository as the code used to analyze it. The main downside is that you have to remember to sync (push) the git annex branch _and_ copy the annexed content, as well as pushing your main branch. It can take a very long time to sync content when the other repository is not guaranteed to have all the content it's supposed to have, since in that scenario the existence and checksum of each annexed file has to be checked. (You can skip this check if you're feeling lucky.) Also, because partial content syncs are allowed, you do need to run `git annex fsck` periodically and pay attention to the number of verified file copies across repos.
The website (https://git-annex.branchable.com/) has many details, including scenarios to explain why it can be useful. git-annex is not so much a backup/sync utility, it's more a tool to track your files if they exist on multiple repositories. Instead of having remote storages holding files that happen to be the same, with git-annex the relationship is inverted: you have files, and each one can be stored on multiple storages. You can follow where they are, push them/get them from any storage that has it, remove them from one place if it's short on free space knowing that other storages still have it...
There was a project of backing up the internet archive by using git-annex (https://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE...). Basically the source project would create repositories of files and users like you and I would be remote repositories; we would get content and claim that we have it, so that everyone would know this repository has a valid copy on our server.
No lolz here! I actually need exactly this as I'll soon be starting work on a CMS-style editor for Git-based static sites, and it's all Ruby to boot. Awesome sauce!
Yes, it was. It was originally a front end and back end (query server) for CouchDB. But CouchDB 2 really went in a different direction to what I expected with query servers so I gave up and decided to pivot in a different direction with the code base.
I haven't used it myself, but based on their blog, Dolt looks like a nicer way to share tables of data in a git-like way, since you get both branches and merges and SQL support.
I don't know the name of this type of learning style but I would eat up a tutorial on Git that started with the plumbing commands like this and worked their way up. I think this is oddly one of the clearest explanations of Git I've ever read. I know that wasn't really the point of the post but still.
This doesn't start with the plumbing commands, but it's along those lines and seemed helpful to some new-to-git folks at work when we started switching to it last year: https://eagain.net/articles/git-for-computer-scientists/
There’s nothing like writing it yourself for understanding something. This tutorial goes through reimplementing the basics of git in a few lines of Python, it’s surprisingly simple and doable in an afternoon: https://wyag.thb.lt/
> You can query by key … and that’s about it. The only piece of good news here is that you can structure your data in folders in such a way that you can easily get content by prefix, but that’s about it. Any other query is off limits, unless you want to do a full recursive search. The only option here is to build indices specifically for querying. You can do this on a scheduled basis if staleness is of no concern or you can use git hooks to update indices as soon as a commit happens.
Isn't the point of a database the ability to query? Why else would you want "Git as a NoSql database?" If this is really what you're after, maybe you should be using Fossil, for which the repo is an sqlite database that you can query like any sqlite database.
How could you reimplement the poorest of a poor man's SQL on top of a key/value store? The idea I came up with is: whatever fields you want to query by need an index as a separate key. Maybe this would allow JOINs? Probably not?
The point of a database is to hold data, and there are many key/value databases for which you can only query by key unless you add a secondary index. (e.g. Bigtable, DynamoDB, Riak, Redis, etc).
Fascinating article. Looks like it was posted to HN at least six times before today and got no traction. In my browser, the examples are sometimes a concatenation of the command and the output. Does anyone else see this?
How about StackExchangeQL? Write your data as comments on StackExchange questions. To update the data, reply to your comment. It's like NoSQL because somebody else is hosting it for you and there's no schema.
I've been very curious to explore this type of use case with askgit (https://github.com/augmentable-dev/askgit) which was designed for running simple "slice and dice" queries and aggregations on git history (and change stats) for basic analytical purposes. I've been curious about how this could be applied to a small text+git based "db". Say, for a regular json or CSV dumps.
You don't even need to write your key/value in a file. Git accept committing empty commit. and allow any message to be committed with it so content of your database can be the git commit message.
I wrote about that concept before and tried to implement it but never went to the full potential (too many side projects)
https://www.emadelsaid.com/git-as-messages-queue/https://github.com/emad-elsaid/gitmq
Isn't the problem attempting to be solved here solved by event sourcing[0]? Which lets you 're-create' your changes and is - ultimately- suppose to allow you effectively log each change in such a way its re-playable backwards and forwards
Kind of, but there’s more to it. “Event sourcing” is the pattern being implemented by git’s immutable object database, but there’s also all the hashing and reference management.
My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.
I've run this for a few toy projects and it seems to work really well. I'm ready to try it with something production-scale the next time the opportunity presents itself.