Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Git as a NoSql Database (2016) (kenneth-truyers.net)
143 points by EntICOnc on April 5, 2021 | hide | past | favorite | 135 comments



In almost every database-backed application I've ever built someone, at some point, inevitable asks for the ability to see what changes were made when and by whom.

My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.

I've run this for a few toy projects and it seems to work really well. I'm ready to try it with something production-scale the next time the opportunity presents itself.


> My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.

How does that solve the problem? All this would do is give you snapshots of the state of the DB at particular points in time.

The only way to know who changed what when is to keep track of those changes as they are made, and the best way to do that is to stick the changes in an eventlog table.


A developer joined a team I was on once. The boss was infamous for hiring anyone with a pulse and assigning a small client project. Sink or swim.

As the senior dev it was my job to provide info as needed and subtly assess the new, potential rockstar code ninjas.

Anyway, this new guy says "I store all my dates in sql as text in iso format so I can easily sort them".


That is... almost not horrible?

If he had said anything other than "in sql" it would be the correct answer.

"I store all dates in spreadsheets in ISO format"

"I prepend ISO dates to marketing campaign names so I can easily sort them in the horrible web frontends that doesn't have a sort by date option"

"I start all my blog entries with ISO dates..."

I mean, just... so close!


> "I store all dates in spreadsheets in ISO format"

As in SQL, the application’s native date format (which is usually a number with appropriate format) would be the natural choice and a YYYY-MM-DD string would be a distant second. ISO (presumably, 8601) format is too broad, and admits of a mix of formats that are not simply sortable (quick, which is first 2021-W14-2 or 2021-04-07?).


Eh, when people say ISO they mostly mean YYYY-MM-DD.

If a date is going to be string prepended to something else, or if the spreadsheet is going to be exported to CSV, or parsed by some tool, then storing dates in YYYY-MM-DD can be a very justifiable thing to do in a spreadsheet.

Also Excel likes to format dates using the system locale, which is, well, stupid. So I have to enter YYYY-MM-DD as a custom format anyway, or change the locale on the column to China. (Or Cherokee, who apparently also properly format their dates!)

(Also Excel is smart enough if you set a column as YYYY-MM-DD custom format, it'll let you use all the date based pivot table stuff as expected)


It's actually a decent hiring process providing the new hires are made aware of it in advance.


He was a radical honestly type guy. It was all on the table and we had lots of success stories. We would give anyone a chance which let a lot of self taught, entry level people get work. I got a ton of experience mentoring people, as well (difficult people, too). Unfortunately, he got screwed over by a shotgun clause in a business agreement and lost the business that to an ex-partner. That's what led me to my current job, though, where I've been happy for over 10 years and it's much more stable.


> I store all my dates in sql as text in iso format so I can easily sort them

Strong SQLite vibes... (SQLite doesn't have a date format, it has builtin functions that accept ISO format text or unix timestamps)


As a Unix person, aren't dates just number? Why is sort complex? Just parse input and render timestamps


It's pretty miserable when you're trying to look up a date, and all you get are numbers that requires a conversion step to be readable.


Assuming by number you mean epoch seconds/milliseconds, a number can store the instant in time, but not a TIMEZONE. Obviously you could do bit magic if you shift it over (so you don't mess up sorting), but that's true for any data and a sufficiently large number type.


Time zone information makes this a bit tricky.


That's a good point. We were talking MS SQL but yes, that would make way more sense.


I did that in DB2 or Oracle (or both, don’t remember - it was 15 years ago and the project used both) because I needed millisecond time resolution and the database date time field did not support milliseconds.

Among the option of storing (date field, millisecond field) and (date as iso8601 text field with millisecond without time zone) I chose the latter, and I can’t tell for sure but it was almost surely the better choice.


Surely you can do both, so you can still perform sane sorting/indexing on "real" timestamps and then use the extra precision string timestamp for drilling down?


Iso8601 sorts just as well.

I saw no benefit in storing two copies. That’s recipe for problems - e.g. “update” code that doesn’t update both, or somehow updates them differently (which one is right?). You can probably have t-sql or some “check” constraint to enforce equivalence, but ... why?

For most practical uses, an ISO8601 textual representation is as good, though a little less space efficient.


> Iso8601 sorts just as well.

No, it doesn’t, since all of the following are valid ISO 8601 representations of a date:

20210408 2021-04-07 2021-W14-1 2021W143 2021-097 2021098

(Now, consistently using any one of the ISO 8601 formats is sortable...)


Point taken.

I always recommend and use the specific ISO8601 format

    YYYY-MM-DDThh:mm:ss.sssZ
(or drop the ".sss" if not needed), and be 100% consistent about it.

But you are right, I shouldn't refer to this as "use ISO8601" because it is only one of the numeous possible formats described by that spec.


Sounds like someone who's realised how crappy most databases' time handling is.


I once saw an app that had a text column as date, then 3 other columns for day month and year so they could sort by date.


I remember one design I reviewed that wanted to use XML as something to be queried using simple text comparison (this was before some databases adding explicit support for XML).


Not XML, but, the classic 'never parse html with regex'

https://stackoverflow.com/questions/1732348/regex-match-open...

:)


Meh, I don't see much wrong with this honestly. There are some cases where using text as the type is the only way. Bet you the junior dev worked with a lot of messy data in the past, where it wasn't always possible to do a direct datestr -> SQL date type.

Just point him in the right direction and move on.


Yeah, I did explain why that wasn't "how we did it", and tried not to sound like a know it all (I am definitely not this).

I am way less judgemental now and a lot of the comments here are making me rethink what I thought about him saying this. He had a lot of other issues and resigned after a couple of days so I guess we'll never know what he was capable of.


Hah, amazing and thanks for sharing. I think it's mostly that you sounded like you were holding a petty grudge over some small technical thing that can make sense. Maybe he resigned because he felt overly judged?


No, he had a day drinking and absentee problem. My boss talked with him, offered him another chance, but he resigned. He really didn't like following requirements either but I think it was just a symptom of his overall problems. He was going through something and we were just a stop on his journey.


Fair enough. Cheers.


Why is this bad?


Sql databases usually come with their own date types that are implemented with integers behind the curtains. They take up less space and are easier to sort than text fields.

EDIT: also I wouldn’t consider this egregious. If the Senior explains and the person was happy to learn something then that’s a good outcome. If they are stubborn about it, then that wouldn’t be great.


But which very likely store less information than an ISO timestamp does. (The integers stored are usually POSIX timestamps.) So, you might have to store, e.g., timezone information elsewhere. (And that might also be a good thing. It might not. It depends. Without knowing more, the DB's type is probably the right choice.)

I've seen far worse things than this done to a SQL database though…


Except that ISO-8601 doesn’t include anything as useful as a political time zone, just an offset from UTC. That’d be good if UTC offset was useful for anything, but instead it just adds to the confusion.


A lot of this was really cleared up for me by Evan Czaplicki's documentation for the Elm time library: https://github.com/elm/time


UTC offset is way more useful than those 3-4 character timezone abbreviations.

Just have a look at the list from Wikipedia [1] and notice how many repeats there are.

As an example AMT could be either UTC+4 or UTC-4.

And this doesn't even includes the various local acronyms that people use around the world.

[1] https://en.wikipedia.org/wiki/List_of_time_zone_abbreviation...


I didn't downvote you, but just for reference the time zone abbreviations you are supposed to use are in the tzdata db, and are listed https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

If you don't use these, you will inevitably be somewhat wrong due to leap issues, political changes in time, or the like.


Political zones are not the same thing as the abbreviated zones.

IANA time zone identifiers are fully-qualified and unambiguous like America/Los_Angeles or Asia/Kolkata.


Thanks, I have never heard the term “political zones” before, so I made an assumption.


At least in Postgresql, it is standard practice to use timestamptz type, which saves timestamp together with a time zone. Not only it uses less space on disc, but you can do a lot of operations with it natively in database itself: sort in various ways, compare to NOW(), construct time intervals and even ensure data validity with checks.


Timestamptz doesn't store a time zone in postgres. It converts all timestamps to utc. It's a common misconception.


> But which very likely store less information than an ISO timestamp does.

If you are trying to store dates and not timestamps, that’s what you want, otherwise equality tests might not work right.

> (The integers stored are usually POSIX timestamps.)

For date types, if they exist distinctly, usually not.


I'm going to guess because sql has rich datetime support. At least the majority of modern databases do.


You can then do things like greater than date or between X date and y date and get some use out of that column..


ITT HN collectively learns what an "audit log" is.


That reminds me of https://www.enterpriseready.io/ where you can find a guide of which features, such as an audit log, you may not have though about when building your side project, B2C startup, etc. that will facilitate its uptake by companies.

For example, for audit logs: https://www.enterpriseready.io/features/audit-log/


I would imagine the functional programming advocates here already understand this. For the rest, audit logs are at least a baby step towards that. Progress!


Completely agreed on the inevitability of that ask.

At risk of extreme overkill the other way, something like Debezium [1] doing change monitoring dumping into S3 might be a viable industrial strength approach. I haven't used it in production, but have been looking for appropriate time to try it.

[1] https://debezium.io/


usually businesses use auditing tables to track this, usually by setting up triggers to insert a row into the auditing table any time a delete, update or insert occurs on the original


So each commit is exactly one change to a relevant table?

I guess you can't just plug this into a common system of rotating logs easily, as there might be several changes beteeen the log rotation.

Also, I guess you'd need a user friendlier interface to actually display who made the change from the git repo.

Anyway, interesting solution.


So you commit to Git.

Sally makes a change to column 1 of record 1.

Billy makes a change to column 2 of record 1 a nanosecond later.

You commit to Git again.

Your boss wants to know who changed column 1 of record 1.

You report it was Billy.

Billy is fired.


You report it could have been Sally or Billy


That isn’t how this works. Based on OPs description, the fact that Sally made a change is gone since the last editor was Billy. It looks like Billy made the change to both column 1 and 2.


You tell your boss:

"Just so you know, we only take a snapshot every X minutes... so there's a chance Billy wasn't the person who made that change. If you had told me you were planning to fire people as a result of this tracking system I might have spent more than five minutes hacking it together with a cronjob and a git repository."


Your boss let changes go through the production DB directly

Your boss decides to fire people as a solution

Your boss should be fired


I’m assuming the change Sally made was an unethical or illegal change.


Your boss doesn't micro manage

You should be taking care of this, not Billy


Firing a team member for a team mistake is micromanagement.


I’m assuming Billy was fired because Sally made an unethical or illegal change.


Billy did nothing wrong!


Since this is being picked apart a bit in the comments, I should clarify that I don't think this is the BEST solution to this problem. It's the solution that gives you 90% if the value with 5% of the work.


> It's the solution that gives you 90% if the value with 5% of the work.

It’s not, though. History tables don’t provide only 111% of the value of this, and don’t take 20× the effort. They have a deterministic relation to base tables (so generating the schema from that of base tables, and the code necesaary for maintenance, is a scriptable mechanical transform if you have the base table schema.)

OTOH, lots of places I've seen approach this by haphazard addition of some of what I think of as “audit theater” columns to base tables (usually some or all of “created”, “created_at”, “last_updated”, “updated_by”), so its also not the worst approach I’ve seen.


A running joke at Uber is that every single configuration management system will end up re-implementing a half ass version of Git.


Were their exact words "an ad hoc, informally specified, bug-ridden, slow implementation of half of Git"?


Heh something to that effect. I’m not at Uber anymore so can’t reference the article which makes this assertion. It is quite well written, starts off with a super simple version of configuration management system and describes how it progressively evolves to encompass 80% of Git features.


Why not have a history table and use SQL triggers to insert the new data as JSON everytime someone changes something?


Indeed. I have on multiple occasions written scripts to automatically generate the triggers and audit tables


I should write a SQL as a NoSQL database blog post, showing a trivial one-table schema that does all you could ever want in a NoSQL, especially after choosing to ignore that it still has SQL.


This may or may not be the case anymore[1] - but for a long while Postgres' managed to outperform MongoDB when restricted to just using single JSON column tables.

One of the larger relational DBs beating the OG NoSQL will always fill me with amusement.

1. Ed. looks like this is still the case, but please remember this is MongoDB specifically. AFAICT MongoDB is mostly a dead branch at this point and a comparison against Redis or something else that's had continued investment might be more fair. I'm just happy chilling in my SQL world - I'm not super up-to-date on the NoSQL market.


I can't find any sources for this. Can you provide one?


Here's one (read the article in detail, apparently this is up to pretty recently...) https://www.enterprisedb.com/news/new-benchmarks-show-postgr...

Also, here's SiSense sensationalizing it for easy digest: https://www.sisense.com/blog/postgres-vs-mongodb-for-storing...


Someone wrote a SQL wrapper around InnoDB & called it MySQL XD


They should sell that to a large tech corp, I bet they'd make a killing.


sqlite and json.1 gets you pretty close to emulating a more powerful nosql database.


I'm aware. My comment was a joke, but evidently it's been well-received.


You joke, but this is basically how Facebook uses MySQL for the main user table and for the majority of Entities: https://blog.yugabyte.com/facebooks-user-db-is-it-sql-or-nos...


Postgres has good support for JSON.


check out RedBeanPHP - it does some interesting stuff for schema liquidity


> The reason why it says stupid in the man-pages is that it makes no assumptions about what content you store in it.

That's not true. The assumption that you are storing text files is very much built in to the design of git, and specifically, into the design of its diff and merge algorithm. Those algorithms treat line breaks as privileged markers of the structure of the underlying data. This is the reason git does not play well with large binary files.


The diffing/merging stuff that you see when you use the git tool is really just some sugar on top of the underlying data which is being stored. If you look at what a commit actually contains, it's just a hash of the tree root, one or more parent commits identified by their hash, and some other metadata (author, date, etc). There's nothing about the commit that cares about the contents of its tree or its parents' trees. It's the git tooling on top of those structures which does.

In git world this distinction is referred to as the "porcelain" vs "plumbing", the plumbing being the underlying structures and porcelain being the stuff most people actually use (diffs, merges, rebases, etc...)


Yes, I get that. But:

> one or more parent commits identified by their hash

The reason that matters is because you need that information to do merges. So merges are integral to git. They are not "porcelain". They are, arguably, the whole point.


Nothing in the "plumbing" defines what merge/diff algorithm is used or is even needed. You can pick any of the merge strategies, or plug in any diff utility.

The plumbing cares only about one thing, that you have a valid data structure, a data structure that among many things tracks "what was/were the previous state/s before this commit?" as a pointer to the previous commit/s, nothing more. I can construct a git tree by hand / shell script just piping in full text/binary files, and git's porcelain will happily spit out "diffs" of the structure.

Your porcelain on top of that decides what to do with that information, standard builds admittedly just provide diff tooling that focuses on , more expansive UIs can diff some binary data (e.g. GitHub shows diffs of images using a UI more suited for that data, with side by side, swiping, or fading to allow someone to examine the changes)

> The assumption that you are storing text files is very much built in to the design of git

I haven't dived deep in a while to git's source code, but the last time I did, this argument would only really hold true for packfiles / where delta encoding is used to compact the .git directory's contents as that does indeed seem tuned for generic text content instead of binary data.

That isn't related to the diff/merge functionality of git though; that is a feature used to reduce disk space by reducing loose objects (full copies of the previous versions of files), trading increased compute time (rebuilding a previous version of a file from a set of diffs) for reduced disk space.


You're not wrong, but you're missing the point.

Would you agree that keeping track of the ancestor(s) of a commit, and the fact that there can be multiple ancestors, is central to git's design? If so, why is that the case? Why is keeping track of ancestry, and the ability to have multiple ancestors, central to the design?

It's so you can do diff/merge. So yes, diff/merge is not directly part of the core. But the core is designed specifically to support diff/merge. That's the whole point. That's the reason the core data structures of git are designed the way they are. It's also the reason that git comes with a diff/merge algorithm built-in, and 99.999% of the people who use git use the default diff/merge. And the default diff/merge assumes text.

So yes, you can use git to store binary files, and you can even modify it to do something sane with them without ripping out everything and starting over from scratch. But that's not what git was originally designed for. That the design is adaptable to other use cases is a testament to Linus's foresight, but it does not change the fact that the assumption of text is woven deeply into the design. Dividing up the design into "plumbing" and "porcelain" does not change this. This part of the porcelain is as much a part of the design as the plumbing is.


Tangentially I started a re-implementation of git in Common Lisp[1], and have completed parsers for most of the file formats except delta encoded objects.

Does anyone happen to know of an implementation or tests for delta encoding I could consult that is available under an MIT-like license? (BSD, Apache v2, etc.)

[1]: https://github.com/fiddlerwoaroof/cl-git


I just used git-fsck, and my implemention is here (both read and write, hosted using git9 on 9front):

http://shithub.us/ori/git9/724c516a6eda0063439457a6701ef0d7e...

http://shithub.us/ori/git9/724c516a6eda0063439457a6701ef0d7e...

as far as I'm aware, the OpenBSD implemention of git, Game of trees, is adapting this code too.


Thanks, delta encoding stumped me a bit because the relevant git documentation was a bit ambiguous.

Maybe I’ll implement this tonight.


I just started using git annex, which is an extension for large files.

https://git-annex.branchable.com/

I like it a lot so far, and I think it could be used for "cloud" stuff, not just backups.

I'd like to see Debian/Docker/PyPI/npm repositories in git annex, etc.

It has lazy checkouts which fits that use case. By default you just sync the metadata with 'git annex sync', and then you can get content with 'git annex get FILE', or git annex sync --content.

Can anyone see a reason why not? Those kinds of repositories all seem to have weird custom protocols. I'd rather just sync the metadata and do the query/package resolution locally. It might be a bigger that way, but you only have fully sync it once, and the rest are incremental.


I'm a little confused about what git annex is - I think it's perhaps a basic file sync utility that uses git behind the scenes?


Yeah it solves the problem of putting big binaries that don't compress well inside git.

If you've ever tried that, then git will start to choke around a few gigabytes (I think it's the packing/diffing algorithms). Github recommends that you keep repos less than 1 GB and definitely less than 5 GB, and they probably have a hard limit.

So what git annex does is simply store symlinks to big files inside .git/annex, and then it has algorithms for managing and syncing the big files. I don't love symlinks and neither does the author, but it seems to work fine. I just do ls -L -l instead of ls -l to follow the symlinks.

I think package repos are something like 300 GB, which should be easily manageable by git annex. And again you don't have to check out everything eagerly. I'm also pretty certain that git annex could support 3TB or 30TB repos if the file system has enough space.

For container images, I think you could simple store layers as files which will save some space for many versions.

There's also git LFS, which github supports, but git annex seems more truly distributed, which I like.


With git-annex the size of the repo isn't as important, it's more about how many files you have stored in it.

I find git-annex to become a bit unwieldy at around 20k to 30k files, at least on modest hardware like a Raspberry Pi or a core i3.

(This hasn't been a problem for my use case, I've just split things up into a couple of annex repos)


> I find git-annex to become a bit unwieldy at around 20k to 30k files

Oh well there goes my plan to use it for the 500 million files we have in cloud storage!


I only recently started using it, but I think most of the limitation on metadata is from git itself. (Remember all the metadata is in git; the data is in the "annex".)

This person said they put over a million files in a single git repo, and pushed it to Github, and that was in 2015.

https://www.monperrus.net/martin/one-million-files-on-git-an...

I'm using git annex for a repository with 100K+ files and it seems totally fine.

If you're running on a Raspberry Pi, YMMV, but IME Raspberry Pi's are extremely slow at tasks like compiling CPython, so it wouldn't surprise me if they're also slow at running git.

I remember measuring and a Rasperry Pi with 5x lower clock rate than an Intel CPU (700 Mhhz vs. 3.5 Ghz) was more like fifty times slower, not 5 times slower.

---

That said, 500 million is probably too many for one repo. But I would guess not all files meed a globally consistent version, so you can have multiple repos.

Also df --inodes on my 4T and 8T drives shows 240 million, so you would most likely have to format a single drive in a special way. But it's not out of the ballpark. I think the sync algorithms would probably get slow at that number.

It's definitely not a cloud storage replacement now, but I guess my goal is to avoid cloud storage :) Although git annex is complementary to the cloud and has S3 and Glacier back ends, among many others.


Thanks for the info! I thought I was joking about the possibility of using git-annex in our case, but you've made me realize that it's not out of the realm of possibility.

We could certainly shard our usage e.g. by customer - they're enterprise customers so there aren't that many of them. We wouldn't be putting the files themselves into git anyway - using a cloud storage backend would be fine.

We currently export directory listings to BigQuery to allow us to analyze usage and generate lists of items to delete. We used to use bucket versioning but found that made it harder to manage - we now manage versioning ourselves. git-annex could potentially help manage the versioning, at least, and could also provide an easier way to browse and do simple queries on the file listings.


Git annex is pretty flexible, more of a framework for storing large files in git than a basic sync utility. ("Large" meaning larger than you'd want to directly commit to git.) If you're running Git Annex Assistant, it does pretty much work as basic file sync of a directory. But you can also use it with normal git commits, like you would Git LFS. Or as a file repository as chubot suggested. The flexibility makes it a little difficult to get started.

The basic idea is that each file targeted by `git annex add` gets replaced by a symlink pointing to its content. The content is managed by git annex and lives as a checksum-addressable blob in .git/annex. The symlink is staged in git to be committed and tracked by the usual git mechanisms. Git annex keeps a log of which host has (had) which file in a branch named "git annex". (There is an alternate non-symlink mechanism for Windows that I don't use and know little about.)

I use git annex in the git LFS-like fashion to store experimental data (microscope images, etc.) in the same repository as the code used to analyze it. The main downside is that you have to remember to sync (push) the git annex branch _and_ copy the annexed content, as well as pushing your main branch. It can take a very long time to sync content when the other repository is not guaranteed to have all the content it's supposed to have, since in that scenario the existence and checksum of each annexed file has to be checked. (You can skip this check if you're feeling lucky.) Also, because partial content syncs are allowed, you do need to run `git annex fsck` periodically and pay attention to the number of verified file copies across repos.


The website (https://git-annex.branchable.com/) has many details, including scenarios to explain why it can be useful. git-annex is not so much a backup/sync utility, it's more a tool to track your files if they exist on multiple repositories. Instead of having remote storages holding files that happen to be the same, with git-annex the relationship is inverted: you have files, and each one can be stored on multiple storages. You can follow where they are, push them/get them from any storage that has it, remove them from one place if it's short on free space knowing that other storages still have it...

There was a project of backing up the internet archive by using git-annex (https://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE...). Basically the source project would create repositories of files and users like you and I would be remote repositories; we would get content and claim that we have it, so that everyone would know this repository has a valid copy on our server.


In one way it's a very distributed/decentralized alternative to git-lfs.



No lolz here! I actually need exactly this as I'll soon be starting work on a CMS-style editor for Git-based static sites, and it's all Ruby to boot. Awesome sauce!


Maybe you can take some inspiration from FrontAid CMS https://frontaid.io/


Have you played with netlifycms.org? It's a Git based CMS that interfaces directly with GitHub apis.


lolz!


Looks like it is inspired by CouchDB


Yes, it was. It was originally a front end and back end (query server) for CouchDB. But CouchDB 2 really went in a different direction to what I expected with query servers so I gave up and decided to pivot in a different direction with the code base.

https://github.com/ioquatix/relaxo-query-server FYI.


I haven't used it myself, but based on their blog, Dolt looks like a nicer way to share tables of data in a git-like way, since you get both branches and merges and SQL support.

https://www.dolthub.com/


A big part of the magic of git is that I can run it locally. And sync up with my team much later.

I’m not seeing a way to do anything similar with dolthub.




I don't know the name of this type of learning style but I would eat up a tutorial on Git that started with the plumbing commands like this and worked their way up. I think this is oddly one of the clearest explanations of Git I've ever read. I know that wasn't really the point of the post but still.


This doesn't start with the plumbing commands, but it's along those lines and seemed helpful to some new-to-git folks at work when we started switching to it last year: https://eagain.net/articles/git-for-computer-scientists/

If you're already comfortable with git, the Git Internals section of the book may be better: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects



There’s nothing like writing it yourself for understanding something. This tutorial goes through reimplementing the basics of git in a few lines of Python, it’s surprisingly simple and doable in an afternoon: https://wyag.thb.lt/


This article/video by Mary Rose Cook is helpful:

https://maryrosecook.com/blog/post/git-from-the-inside-out


Write yourself a git https://wyag.thb.lt/


many people have done it but here's another one https://jwiegley.github.io/git-from-the-bottom-up/


> You can query by key … and that’s about it. The only piece of good news here is that you can structure your data in folders in such a way that you can easily get content by prefix, but that’s about it. Any other query is off limits, unless you want to do a full recursive search. The only option here is to build indices specifically for querying. You can do this on a scheduled basis if staleness is of no concern or you can use git hooks to update indices as soon as a commit happens.

Isn't the point of a database the ability to query? Why else would you want "Git as a NoSql database?" If this is really what you're after, maybe you should be using Fossil, for which the repo is an sqlite database that you can query like any sqlite database.


> Isn't the point of a database the ability to query?

I was thinking about this in terms of the new CloudFlare Durable Objects open beta: https://blog.cloudflare.com/durable-objects-open-beta/ Storing data is really only half the battle.

How could you reimplement the poorest of a poor man's SQL on top of a key/value store? The idea I came up with is: whatever fields you want to query by need an index as a separate key. Maybe this would allow JOINs? Probably not?


The point of a database is to hold data, and there are many key/value databases for which you can only query by key unless you add a secondary index. (e.g. Bigtable, DynamoDB, Riak, Redis, etc).


Fascinating article. Looks like it was posted to HN at least six times before today and got no traction. In my browser, the examples are sometimes a concatenation of the command and the output. Does anyone else see this?


yes, i see this as well. at first it was confusing, but later i understood it was the output.


How about StackExchangeQL? Write your data as comments on StackExchange questions. To update the data, reply to your comment. It's like NoSQL because somebody else is hosting it for you and there's no schema.


Gerrit is using Git as a NoSql database. All the users, projects, reviews, ... are stored in Git repositories[1].

One of the main advantage of this approach is that replication becomes trivial, it's just a matter of running `git fetch` on all the repositories.

The main disadvantage is that NodeDB only has one implementation (Java), and no CLI to play with it.

[1]: https://gerrit-review.googlesource.com/Documentation/note-db...


Or for a SQL database with Git versioning semantics:

https://github.com/dolthub/dolt


I've been very curious to explore this type of use case with askgit (https://github.com/augmentable-dev/askgit) which was designed for running simple "slice and dice" queries and aggregations on git history (and change stats) for basic analytical purposes. I've been curious about how this could be applied to a small text+git based "db". Say, for a regular json or CSV dumps.

This also reminds me of Dolt: https://github.com/dolthub/dolt which I believe has been on HN a couple times


You don't even need to write your key/value in a file. Git accept committing empty commit. and allow any message to be committed with it so content of your database can be the git commit message. I wrote about that concept before and tried to implement it but never went to the full potential (too many side projects) https://www.emadelsaid.com/git-as-messages-queue/ https://github.com/emad-elsaid/gitmq


For completeness here’s Git as a SQL Database (2018): https://opensource.com/article/18/11/gitbase


Related https://irmin.org/ (and in development since 2013)

> A distributed database built on the same principles as Git (and which can use git as a backend)


Is Irmin more like a database engine? I.e. you have to write code in OCaml in order to use it???


Isn't the problem attempting to be solved here solved by event sourcing[0]? Which lets you 're-create' your changes and is - ultimately- suppose to allow you effectively log each change in such a way its re-playable backwards and forwards

[0]: https://martinfowler.com/eaaDev/EventSourcing.html


Kind of, but there’s more to it. “Event sourcing” is the pattern being implemented by git’s immutable object database, but there’s also all the hashing and reference management.


Thank you, was a nice read (particularly the conclusion, which was insightful and made sense)


Most noSQL can guarantee at least two of consistency, availability and partition tolerance, this seem to drop the ball on all 3.


I can't tell if this is sarcasm.


In what ways does git implement CAP better than a NoSQL database?


No system can get all three of CAP, that’s the whole point of CAP. Hence why your comment sounds like a joke of some sort.


> Most noSQL can guarantee at least two of consistency, availability and partition tolerance...

You mean "at most" two.


ah yes, thank you. brain fart on my part.


"Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should."




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: