LiteFS a FUSE-based file system for replicating SQLite

ctur · on July 26, 2022

If a "FUSE to replicate SQLite" solution came from anywhere else, I'd be quite skeptical, but there is a lot of very interesting tech coming out of fly.io these days and Ben certainly knows this space well. It still feels a little like a hack and piercing of layers of abstraction (less so than, say, litestream).

I love it when at first glance it isn't clear if a project is a crazy idea from someone just goofing around vs a highly leveraged crazy idea that will be a foundational part of a major technology shift.

I suspect it's the latter and the strategy though is to layer this on top of litestream to create an easy way to use sqlite transparently in a widely distributed multi-node environment (litestream providing the backups and/or readonly replication to remote sites, with LiteFS handling low latency local access in a cluster, POP, or data center).

Cool stuff. It will be fun to see where fly takes this :)

benbjohnson · on July 26, 2022

Thanks for the vote of confidence! I can understand the "hack" feel -- it's a trade-off. If I wrote it the "proper" way and integrated directly into the SQLite source or used a VFS then it'd be a lot harder to deploy for most folks. By making it a FUSE file system, someone can use it without really knowing much about it from the application's perspective.

As for strategy, it unfortunately doesn't work to layer with Litestream as backups need some strict control over who is the current primary. Instead, I'm adding S3 replication support [1] directly into LiteFS. LiteFS also uses a different transactional file format called LTX so it wouldn't be compatible with Litestream. The LTX format is optimized for compactions so point-in-time restores can be nearly instant.

The end goal isn't much of a secret. We want to let folks spin up nodes in regions across the world, automatically connect to one another, and have the whole thing have the ease of a single node app. We still have a ways to go on that vision but that's what we're going for.

[1] https://github.com/superfly/litefs/issues/18

ignoramous · on July 26, 2022

> I'm adding S3 replication support directly into LiteFS.

Nice! There's a lot of value one can get out of a blob store, despite it appearing seemingly at odds with block-device dependent systems, like most sql dbms.

When a database at BigCloud layered replication (real-time backups) atop S3, they did so by shipping both the WAL and the on-disk files. For write heavy tables, WAL was streamed every second, and on-disk files (snapshots) every 30mins (or at some apt size-based threshold).

While WAL streaming also doubled-up as a key foundation for them to build materialized views, support real-time triggers, and act as an online data-verification layer; S3 itself served as an insurance against hardware errors (memory, cpu, network, disk) and data corruption.

https://web.archive.org/web/20220712155558/https://www.useni... (keyword search S3)

Elasticsearch / OpenSearch does something similar but it only implements snapshot-based replication to S3 (periodic backups).

https://web.archive.org/web/20190722153122/https://www.micro... / https://archive.is/Q5jUj (docs)

tmp_anon_22 · on July 26, 2022

> a highly leveraged crazy idea that will be a foundational part of a major technology shift

Has anything other then the Cloud presented a true foundational shift in how applications are built? Kubernetes, Serverless, Blockchain, React, Swift, these things are big but not big enough.

I think we just like pretending every little thing is the next big thing.

benbjohnson · on July 26, 2022

LiteFS author here (also Litestream author). I'm happy to answer any questions folks have about how it works or what's on the roadmap.

tantaman · on July 26, 2022

Very cool. I've been working on CRDTs for SQLite to allow for conflict-free replication. Its still super early but I have a prototype that works and, if you're up to it, I would love to pick your brain on the details of sqlite WALs and journaling.

https://www.loom.com/share/0934f93364d340e0ba658146a974edb4

benbjohnson · on July 26, 2022

Right on, I always thought a CRDT SQLite database would be awesome. I thought about looking at the session extension[1] to handle the merges but I didn't get very far into it.

Feel free to ping me at benbjohnson@yahoo.com if you want to chat WAL & journaling. I think it'd be difficult to implement CRDT at the journal level since it's all physical pages. Merging data in pages is going to sometimes cause splits and that'll be hairy to deal with.

[1] https://www.sqlite.org/sessionintro.html

tantaman · on July 26, 2022

that was my assumption but it's been suggested a few times to implement it there. Nice to hear some validation that that isn't a great spot.

anon291 · on July 26, 2022

Why a fuse filesystem instead of just using sqlite's vfs features? FUSE adds an extra user-kernel-user context switch which is expensive compared to a quick user-user call for the vfs drivers.

EdSchouten · on July 27, 2022

I was going to raise that point exactly.

As someone who spends an awful amount of time using FUSE, my recommendation is to only use it in cases where the software that interacts with the file system isn't easily changeable. For example, for Buildbarn which I maintain (https://github.com/buildbarn/bb-remote-execution), I need to use it. It's infeasible to change arbitrary compilers and tests to all interact with a network distributed build cache. Designing the FUSE file system was a pretty heavy investment though, as you really need to be POSIXly correct to make it all work. The quality of implementations of FUSE also varies between OSes and their versions. macFUSE, for example, is quite different from Linux FUSE.

Given that SQLite already has all of the hooks in place, I would strongly recommend using those. In addition to increasing portability, it also makes it easier to set up/run. As an example, it's pretty hard to mount a FUSE file system inside of a container running on Kubernetes without risking locking up the underlying host. Doing the same thing with the SQLite VFS hooks is likely easy and also doesn't require your container to run with superuser privileges.

benbjohnson · on July 26, 2022

The VFS system is great and I plan to support that as well. SQLite is simple to get started with but it can be complicated to integrate a VFS depending on the language using. My goal is to make the deployment as simple as possible and also to support legacy applications.

There's some interesting work I'm looking at with the VFS and WASM where you could deploy something like this across pure serverless functions like Vercel or Deno.

devnull3 · on July 27, 2022

It just so happens that I am working on a versioned file system for sqlite based on its VFS interface. Planning to release it in a week's time. It will be in pre-alpha stage and its more of a proof-of-concept.

Having worked with FUSE in the past, there is always some performance penalty, but the advantage is that there is no change in the application.

The VFS will need change in the application but less overhead. Trade-offs!

tptacek · on July 26, 2022

Would doing the VFS thing require participating applications to be linked to a LiteFS library?

benbjohnson · on July 26, 2022

That's a good question. I thought that was the case but I just double checked and it looks like you can load a VFS as an extension [1].

[1] https://www.sqlite.org/loadext.html

devnull3 · on July 27, 2022

I can confirm that it does. I am working on it as I write this :).

Technically, an extension and VFS are separate things.

An extension when loads, allows you to register the VFS with the Sqlite. It can also register virtual tables as well.

3np · on July 26, 2022

Thanks for sharing this!

As WAL'd sqlite on networked filesystems is a no-go, once LiteFS gets wal support I think this will be a savior software for many operators with existing sqlite deployments.

Could we even host LiteFS with underlying dbs on a filesystem that otherwise wouldn't play well with WAL (say nfs, ceph or gluster)? An effectively single-node deployment that would achieve redundancy through the FS.

mrkurt · on July 26, 2022

That would be an amazing hack. "LiteFS makes sqlite work on NFS".

epilys · on July 26, 2022

Kudos on the release! I've done almost exactly the same thing in Rust with WAL only; writes are appended to a distributed wal which achieves consensus by raft and distributed time leases. it all worked so nearly I was impressed. It was for a client though, so it's unfortunately proprietary.

benbjohnson · on July 26, 2022

Thanks! That sounds like a fun project. Did you run the WAL writes through Raft or did you just use Raft for leader election?

GauntletWizard · on July 26, 2022

Awesome! Glad to see it.

I was very excited for the experimental support for live replication in Litestream, but as I understand that didn't pan out. This looks to be the successor, with fanout replication an explicit feature.

Using a FUSE layer that's detecting changes is sure to have some performance tradeoffs. What benchmarks are under way? Do you need any help?

benbjohnson · on July 26, 2022

Thanks! In hindsight, I'm glad I pulled the live replication in Litestream and left it focused on disaster recovery only. A lot of what I'm doing in LiteFS would be much more difficult if I didn't have control over the write path. I'm able to perform rolling checksums to verify database integrity across replicas and that would be nearly impossible with Litestream (which runs as a separate process).

The FUSE layer does have performance trade-offs. Read queries aren't affected as much since the OS page cache is in the kernel and avoids the FUSE layer entirely. Most databases are under 10GB (or honestly probably less than 1GB) so you can fit most of that in the page cache for any moderately sized VM. The write side will incur a performance hit but I haven't taken benchmarks yet. There's a lot that can still be optimized around that.

The biggest help that I could use is just to get feedback on how it feels to use and any ways I can improve it. The end goal is to make it ridiculously simple to spin up and "just work" so any ideas towards that goal would be awesome.

dan353hehe · on July 26, 2022

I’m curious, was there a specific reason that FUSE was used instead of the FVS layer inside to SQLite itself? I realize that it would mean that a custom module would have to be loaded, and maybe that was undesirable. Full disclosure I have not read the link yet as I am a little pressed for time, but I will later tonight when I have time. If it’s mentioned in the readme, feel free to ignore me :) I ask because I have been building a VFS for SQLite to handle multiple distributed writers and replicating data blocks around to different places.

benbjohnson · on July 26, 2022

It's a good question. It's mainly about reducing friction for the end user. I wrote up a related response here: https://news.ycombinator.com/item?id=32243305

Cyberdog · on July 26, 2022

Given a situation where replication is desirable, is using SQLite + LiteFS a better choice than just replicated Postgres?

benbjohnson · on July 26, 2022

Postgres is great and replicating it can work as well. One benefit to SQLite is that it's in-process so you avoid most per-query latency so you don't have to worry as much about N+1 query performance issues. From my benchmarks, I see latency from an application node to a Postgres node as high as 1 millisecond -- even if both are in the same region. SQLite, on the other hand, has per-query latency overhead of about 10-20µs.

Another goal is to simplify deployment. It's a lot easier and cost-effective to just deploy out 10 application nodes across a bunch of regions rather than having to also deploy Postgres replicas in each of those regions too.

gigatexal · on July 27, 2022

SQLite v Postgres here is apples v oranges. Postgres is multi-reader multi-writer whereas SQLite is single-writer multi-reader among other things. They are both very fine databases but solve different use cases.

westurner · on July 27, 2022

From https://www.sqlite.org/isolation.html :

> Isolation And Concurrency: SQLite implements isolation and concurrency control (and atomicity) using transient journal files that appear in the same directory as the database file. There are two major "journal modes". The older "rollback mode" corresponds to using the "DELETE", "PERSIST", or "TRUNCATE" options to the journal_mode pragma. In rollback mode, changes are written directly into the database file, while simultaneously a separate rollback journal file is constructed that is able to restore the database to its original state if the transaction rolls back. Rollback mode (specifically DELETE mode, meaning that the rollback journal is deleted from disk at the conclusion of each transaction) is the current default behavior.

> Since version 3.7.0 (2010-07-21), SQLite also supports "WAL mode". In WAL mode, changes are not written to the original database file. Instead, changes go into a separate "write-ahead log" or "WAL" file. Later, after the transaction commits, those changes will be moved from the WAL file back into the original database in an operation called "checkpoint". WAL mode is enabled by running "PRAGMA journal_mode=WAL".

> In rollback mode, SQLite implements isolation by locking the database file and preventing any reads by other database connections while each write transaction is underway. Readers can be active at the beginning of a write, before any content is flushed to disk and while all changes are still held in the writer's private memory space. But before any changes are made to the database file on disk, all readers must be (temporarily) expelled in order to give the writer exclusive access to the database file. Hence, readers are prohibited from seeing incomplete transactions by virtue of being locked out of the database while the transaction is being written to disk. Only after the transaction is completely written and synced to disk and committed are the readers allowed back into the database. Hence readers never get a chance to see partially written changes.

> WAL mode permits simultaneous readers and writers. It can do this because changes do not overwrite the original database file, but rather go into the separate write-ahead log file. That means that readers can continue to read the old, original, unaltered content from the original database file at the same time that the writer is appending to the write-ahead log. In WAL mode, SQLite exhibits "snapshot isolation". When a read transaction starts, that reader continues to see an unchanging "snapshot" of the database file as it existed at the moment in time when the read transaction started. Any write transactions that commit while the read transaction is active are still invisible to the read transaction, because the reader is seeing a snapshot of database file from a prior moment in time.

> An example: Suppose there are two database connections X and Y. X starts a read transaction using BEGIN followed by one or more SELECT statements. Then Y comes along and runs an UPDATE statement to modify the database. X can subsequently do a SELECT against the records that Y modified but X will see the older unmodified entries because Y's changes are all invisible to X while X is holding a read transaction. If X wants to see the changes that Y made, then X must end its read transaction and start a new one (by running COMMIT followed by another BEGIN.)

Or:

  ROLLBACK; // cancel the tx e.g.  because a different dbconn thread detected updated data before the tx was to be COMMITted.

  // Replay the tx 
  BEGIN;
  // replay the same SQL statements
  COMMIT;

spiffytech · on July 27, 2022

This is saying that SQLite allows reads and writes to happen simultaneously, but it's still single-writer. There's a WIP branch to add concurrent writes.

> Usually, SQLite allows at most one writer to proceed concurrently. The BEGIN CONCURRENT enhancement allows multiple writers to process write transactions simultanously if the database is in "wal" or "wal2" mode, although the system still serializes COMMIT commands.

https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...

gigatexal · on July 27, 2022

Shit. I was wrong. Thank you for sharing.

liuliu · on July 26, 2022

Does LiteFS needs to do "pattern matching" to know what SQLite writes (a txn or not e.g.)? With Litestream, it seems simply use SQLite API. Just curious how do you think through this.

benbjohnson · on July 26, 2022

Transactions start with the creation of the journal file and end when it's unlinked so there's some "pattern matching" but it's not terribly complicated. Once the file is unlinked, LiteFS can read and verify the journal and pull the change set of pages from the database file.

Litestream interacts with the SQLite API for locking but it uses a regular file descriptor to read the WAL. It has to do similar parsing of pages but it's a bit more complicated since it needs to re-verify its position every time it reads from the WAL.

liuliu · on July 26, 2022

I am under the impression that if you are in the WAL mode, there is no rollback journal created?

ngrilly · on July 27, 2022

Yes, there is no rollback journal in WAL mode, but the WAL is its own file.

laweijfmvo · on July 26, 2022

A while ago, when I discovered litestream, I played around with it and just setup cloning to a local file://; is this the same idea?

benbjohnson · on July 26, 2022

Sorta. LiteFS sets up a FUSE mount so once you have that in place then you can create a database within that mount directory and it is automatically replicated.

bbertelsen · on July 26, 2022

Can you describe one or more use cases?

benbjohnson · on July 26, 2022

One main advantage of LiteFS is being able to deploy a SQLite-based application and not have downtime during deploys. This is especially true in ephemeral systems like Kubernetes where pods will rollover.

Another use case is moving data close to users. If you're only targeting users in a single country then it's not as big of a deal but RTT from the US to Europe is ~100ms and US to Asia is ~250ms. That's a big latency hit depending on what you're trying to do.

qbasic_forever · on July 26, 2022

It might be interesting for embedded scenarios where you need an always-available database but can't guarantee there will be a connection available consistently to a central database server.

hinkley · on July 26, 2022

I have an adjacent problem, and I haven't been able to find anyone who has a fix for me.

One perfectly reasonable use case for a read replica of a database is a bastion server. Database + web server on a machine that is firewalled both from the internet and from the business network. With read only access there is a much smaller blast radius if someone manages to compromise the machine.

The problem is that every single replication implementation I've seen expects the replicas to phone home to the master copy, not for the master copy to know of the replicas and stream updates to them. This means that your bastion machine needs to be able to reach into your LAN, which defeats half the point.

The most important question is, "what options exist to support this?" but I think the bigger question is why do we treat replicas as if they are full peers of the system of record when so often not only are they not - mechanically or philosophically - and in some cases couldn't be even if we wanted to? (eg, a database without multi-master support).

benbjohnson · on July 26, 2022

That's an interesting idea. I had plans to introduce "candidates" [1] (e.g. nodes that could become the primary) but I like the idea of reversing the communication and connecting from primary to replica. I added an issue to the LiteFS project to track it. Thanks!

[1] https://github.com/superfly/litefs/issues/16

[2] https://github.com/superfly/litefs/issues/24

koolba · on July 26, 2022

> The problem is that every single replication implementation I've seen expects the replicas to phone home to the master copy, not for the master copy to know of the replicas and stream updates to them. This means that your bastion machine needs to be able to reach into your LAN, which defeats half the point.

You can set up a PostgreSQL replica to be driven purely off of archive logs. It does not need direct access to the source database as it can pull the archive files via read only to a third location (e.g. file server or S3) that gets pushed by the source database server. The catch is that it will only be updated when an WAL file is pushed which can be driven either by size (automatically on an "active" database) or time (every N seconds or minutes). If you're fine with potentially being a minute behind the source, you can easily set this up.

rhizome · on July 26, 2022

Look into reverse SSH tunnelling. SSH from primary to secondary, which then connects back to the primary through the already-established SSH connection.

hinkley · on July 26, 2022

That's the closest I've been able to come up with, but it does have the problem that anything local can typically connect to that tunnel. In the bastion situation we generally don't assume that the machine is not compromised. Otherwise why did we put it outside of the firewall?

To be fair, there are a number of ways a hostile endpoint can screw with another server even just by screwing around with TCP protocol behavior, so perhaps I'm putting too fine a point on it.

mandeepj · on July 26, 2022

> a database without multi-master support

I believe Cassandra does not have a Master\follower architecture; it's following a ring based structure.

0x457 · on July 26, 2022

Yeah, Cassandra uses a Distributed Hash Table with Consistent Hashing.

When a new node added, it takes responsibility for a segment of a ring. When node removed, it's segment get redistributed.

It's however in no way a drop-in replacement for RDBMS and requires a careful planning around application read and writes patterns that is very different from your typical RDBMS. Definitely, can't be used in this scenario - every node needs to be able to access every other node and client must be able to access at least one node.

ok_dad · on July 26, 2022

Try litestream I think it’s a push system.

moderation · on July 26, 2022

> LiteFS is intended to provide easy, live, asychronous replication across ephemeral nodes in a cluster. This approach makes trade-offs as compared with simpler disaster recovery tools such as Litestream and more complex but strongly-consistent tools such as rqlite.

I think rqlite having a single binary that handles Raft / consensus _and_ includes SQLITE makes it simpler. Beyond 'hello world', Consul isn't trivial to run and Fly have blogged about this [0]

0. https://fly.io/blog/a-foolish-consistency/

tptacek · on July 26, 2022

rqlite is very cool, but it's also much more ambitious than LiteFS; it's Raft consistency for every instance of the database, where Litestream/LiteFS is replication (for single-writer multi-reader setups, where reads are answered quickly from edges or read caches, and writes are funneled to a central node --- with LiteFS, an elected central node). Raft is, of course, more powerful, but it's also its own whole thing to manage and monitor.

The advantage of LiteFS/Litestream is that, for the most part, the database is "just" SQLite. You can't really say that, to the same extent, about rqlite.

I hope rqlite takes off! It's a good project.

We've spent a lot of time at Fly.io wrestling with Consul, but that's because we abuse it. That's what the article is about: we shoehorned Consul into a part of our architecture where we're taxed for features it has that we don't actually use (the overwhelming majority of all the data we have in Consul is stuff for which there's a single, continually available source of truth for the data, and Consul was just a "convenient" way to replicate it). Consul is great for the stuff it's meant for.

I wouldn't hesitate to reach for Consul in a new design. I just wouldn't use it for the thing we used it for.

gmemstr · on July 26, 2022

And I /just/ got my infrastructure bits and pieces running Litestream! Guess I'll have to figure out if it's worth switching to this -- my gut reaction is no, since I only really run one pod at a time, so Litestream serves the purpose of not only saving the database offsite but also restoring it. But I will be keeping a very close eye on this thanks in part to my love of SQLite.

Hats off to Ben and Fly.io, you're doing some cool stuff.

benbjohnson · on July 26, 2022

Thanks! Yeah, if you don't need multiple replicas then Litestream should work just fine. I'd say stick to that for now.

ThinkBeat · on July 26, 2022

Having been a happy user developing solutions around SQLite for a good amount of time, I find all these "enterprisy" hacks / extension curious.

There are great solutions out there that handle these things and have for a long time.

I know SQLite has become the new hotness, but I really do not want SQLite to get good at all these things because then it would no longer be great at what it does marvelously already.

mrkurt · on July 26, 2022

This is about as non-enterprisey as it gets. It's built to make sqlite work better for tiny little node.js apps running on very cheap hosting.

ThinkBeat · on July 27, 2022

From the first of the description of the project on GitHub.

> """LiteFS is a FUSE-based file system for replicating SQLite databases across a cluster of machines."""

> """Leader election: currently implemented by Consul using sessions """

That sounds enterprisy to me.

benbjohnson · on July 27, 2022

I can see Consul feeling enterprise-y. If you want to distribute SQLite without downtime then you need to move the primary around and that requires leader election. We have Consul already running at Fly.io but I feel like a lot of folks have either Consul or etcd running for some part of their system out there so the idea was to be able to attach onto that.

mandeepj · on July 26, 2022

Db sharding and replication is a fascinating subject and a matter of deep interest to me.

Ben and Matt - appreciate your contributions in this area. I'm interested in making contributions along with you. Please let me know if you are looking for help. Much Thanks.

benbjohnson · on July 26, 2022

Thanks! I think sharding is really interesting -- especially with a lightweight database like SQLite. I'm not looking for contributions right now but I would love to hear any feedback on the approach taken with LiteFS. I want to make it as easy to run as possible.

mandeepj · on July 26, 2022

I'd give LiteFS a run soon. SqlLite is cross-platform, but seems like LiteFS is not. True?

https://github.com/superfly/litefs

benbjohnson · on July 26, 2022

Yes, that's correct. It's Linux only right now although I'll probably support macOS too (via MacFUSE).

hobo_mark · on July 26, 2022

Tangentially related: I'd like to use litestream but my SQLite files are several gigabytes, is there a way to lazily download the db only once it's being accessed? (using something like userfaultfd maybe? just thinking out loud)

pkhuong · on July 27, 2022

Verneuil (https://github.com/backtrace-labs/verneuil) offers that for read replicas, because backtrace as well has some multi-GB sqlite DBs. It's a VFS (loadable at runtime as a sqlite extension), instead of a filesystem.

I don't remember if that's in the docs; commit that adds the configuration field https://github.com/backtrace-labs/verneuil/commit/027318ba74... and commit for the enum https://github.com/backtrace-labs/verneuil/blob/e6697498f3ba...

(the actual implementation is nothing special https://github.com/backtrace-labs/verneuil/commit/b6bdfcf7bc...)

Issue for the feature: https://github.com/backtrace-labs/verneuil/issues/12

benbjohnson · on July 26, 2022

I'm not sure how that could be implemented with Litestream since it runs as a separate process. It could be possible with LiteFS where it just pages in data on the fly. That's on the roadmap but it's probably still a ways out.

mandeepj · on July 26, 2022

> is there a way to lazily download the db only once it's being accessed?

Not sure what's your full scenario, because you mentioned "lazily download" so thought you might have a luck here https://news.ycombinator.com/item?id=27016630

tl;dr: using http range and a smaller page size - might be the way to go

ok_dad · on July 26, 2022

Very cool, you should add raft so every node could be the primary if the primary fails. You just need to add election and a few minor state things on top of what’s there already, I think.

vaxman · on July 27, 2022

LiteFS, a tool for edge-case data loss, unintended service hangs and data corruption /s

this kind of scheme was explored and naturally selected away in the pre-Internet networking era (whose applied knowledge was mostly lost and unavailable to newer generations after the boomer purge that began in the 2000 market crash) .. this kind of scheme should always be isolated to tightly coupled machines operating over fault tolerant and RELATIVELY high-speed links (like a cluster of boards interconnected via PCIe, fibre or Thunderbolt, each with [controller] ECC memory) not between WAN zones.

But in a tightly-coupled cluster-type environment, better solutions would exist like Redis Cluster (and RediSQL) that could be further upgraded with some kind of shared pagefile.

But for testing near-obsolete non-serverless cloud code against edge-cases, consider building a testbed of 4 or 5 rPis talking to each other over (really) slow LoRA links ..then amp up the RFi, thermal and vibration procedures and add all the code to maintain the data integrity as you monitor for the various failure scenarios.

vluft · on July 27, 2022

looks pretty neat, and I'm a big fan of litestream, but I can't help but feel that requiring a 400k loc election mechanism (2 mil if you include deps) for your 180k loc database is slightly excessive.