Notes from the Architect (2016)

capableweb · on Aug 4, 2021

See also: a reply from antirez (of Redis fame) - http://oldblog.antirez.com/post/what-is-wrong-with-2006-prog...

0x000000001 · on Aug 4, 2021

Don't overlook PHK's replies in the comments there

jasonwatkinspdx · on Aug 4, 2021

This. Note that PHK's comments have proven entirely correct, and redis had to implement at least a limited form of multithreading.

I would be very careful about Antirez's older advice on this topic area. At one point he was designing his own "VM" algorithm that would work at 256 byte granularity, an idea that was never viable.

antirez · on Aug 9, 2021

I believe that what happened in the latest 20 years actually shows that multiplexing, for very short-lived requests, won: multithreading is, very often, only used at this point only to scale the single multiplexed approach to multiple cores. Now on disk databases still use one thread per query models, but hardly anything that serves small objects uses the operating system paging as a cache, including on-disk stores that have their own paging mechanism and never trust the OS to do so.

The 256 byte thing was an old story, pre-memcached, that I recounted at some point, that has nothing to do with Redis. I implemented an on-disk cache for slow SQL queries: since the filesystem had issues with inodes, I created a 256 directories, each nested with 256 (so 256*256) in order to lower the number of files for every directory. This used to play a lot better with filesystems of 20 years ago.

Btw here the main point is: nobody is using, at this point, the OS paging as a caching layer in databases, and that was my point back then, and I don't see how it is not true. And, to serve small objects, multiplexing won and threading serves as a way to scale multiplexing itself (see memcached, for instance).

IMHO it is more interesting to provide arguments than to attack single persons you may not like :)

thamer · on Aug 5, 2021

It's not only the "VM" experiments but generally the very basic approach to persistence that's kept Redis from being an actual database and has most users (hopefully) only using it as a cache.

Around the same period as this blog post antirez had also announced that he had written a B-Tree that could be potentially used to store Redis data. The code is still up: https://github.com/antirez/otree – needless to say serious storage engines like LevelDB or RocksDB are more than a single 1,000-line C file, while this looked more like a weekend project.

The persistence model that has stayed in Redis is also a pretty basic approach: fork + dump all memory to a single file. Write every command to a file and compact it by doing the same fork + scan/flush. None of this is efficient, and with a tens of GB of cache you get long pauses during fork() as the page table gets copied to the child process (yes just the table, not the memory pages themselves).

All this to say that his advice on disk I/O is to be taken in the context of his own experience in this domain.

jasonwatkinspdx · on Aug 5, 2021

Yeah, I completely agree. Redis makes a great cache with richer operations, but less efficiency than memcached. Using it as a system of record in any context will come back to haunt you in the worst way.

I first ran into the more minimal append only log persistence pattern with the jvm project Prevayler. It's not a totally unreasonable idea but has to be implemented carefully.

The fork as copy on write mechanism initially seems clever, but in practice is extremely brittle. There's a reason classic databases have a lot of sophisticated logic around checkpointing and truncating the write ahead log. It's not a problem that can be oversimplified.

I wouldn't be totally down on minimal btree implementations, but 1000 LOC is pushing it. Even LMDB as minimalist as it is, is around 10k as I recall.

andrewmcwatters · on Aug 4, 2021

I’ve noticed this sort of thing from Antirez before and it makes his comments seem dubious unless they come from his specific specialization.

He made some claims about Lua here some time ago, and if you read his code, it was clear he didn’t know how to use its C API and was criticizing it out of something he could have just read the manual over.

Software is hard, though, and there’s always someone who is more right, so, oh well.

Maybe a good strategy is to openly ask others if there are better solutions the reader knows of and to share them.

jasonwatkinspdx · on Aug 4, 2021

Yeah, I don't want to be overly negative, but Not Invented Here syndrome is a recurring theme with his work. Unfortunately that's quite prevalent among software developers as a whole. It's somewhat understandable: it's easier to just run with your imagination and start hacking something together than dig through research papers, textbooks, and blog posts.

antirez · on Aug 9, 2021

My claim about the Lua C API was that, being a stack-based API, it is hard to use. It's a tradeoff between simplicity and usability from C. Please show me the code I wrote that shows how my ideas about lack of usability of the Lua C interface is due to me not reading the documentation.

Note: interpreters are not entirely out of my specialization, I wrote a Tcl interpreter that is still in use in embedded systems, participated in the Tcl community for a long time, reimplemented the Joy concatenative language extending it, and so forth.

throw0101a · on Aug 4, 2021

He's on HN as well:

* https://news.ycombinator.com/user?id=phkamp

nextaccountic · on Aug 4, 2021

> Well, today computers really only have one kind of storage, and it is usually some sort of disk, the operating system and the virtual memory management hardware has converted the RAM to a cache for the disk storage.

> So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.

ctrl+f mmap on this article doesn't return anything so.. is it talking it's better to mmap files and having files store in-memory data structures like cap'n'proto does?

(if it's entirely in-memory with no file in disk, then it's surely comparing apples to oranges? if a program needs the persistent memory, it needs the disk)

Also: is mmaping a file and using it directly from memory what databases should do today to avoid fighting the OS disk cache?

jasonwatkinspdx · on Aug 4, 2021

Mmaping directly readable data is certainly a way of avoiding the problems PHK is talking about, and can be reasonably efficient. Sendfile can be even better if you can structure things to have a metadata index that's in memory or mmaped, and never need to read the actual sent data.

But mmap is no panacea. It's big flaw is that the kernel scheduler often doesn't know as much as your app about what you're doing. So when you touch an out of core page and it page faults, your thread just gets locked up. Worse, the kernel's read ahead logic may make the wrong guess about how many pages you're going to read, or when you want data written back out. You can try to mitigate this using madvise and msync, but in practice it's kinda brittle.

Some systems use blocking io with a dedicated thread pool to wait on the block. This way the main threads can move on to other useful work rather than sitting stalled on a page fault. Golang's runtime does something like this for you automatically.

Databases for the most part do their own scheduling, and may even use async io. This is because they know far better what they're actually doing than the kernel can guess. Another reason is the database's buffer cache is usually tightly coupled with the consistency control algorithm, so again, you don't want the kernel guessing, you want total control. Databases often use direct io and take on the burden of scheduling everything themselves precisely for this reason.

LMDB uses mmap, but should not really be considered a database. It's just an embedded CoW btree, though a fine one.

Andy Pavlo's openly published course materials are a great resource for learning about database internels. He used to be emphatically against mmap for databases, but in the last couple years has moderated that position somewhat.

io_uring is looking like it will be the best overall solution moving forward, at least for linux only.

hyc_symas · on Aug 10, 2021

LMDB is an embedded database engine, the same as Berkeley DB, gdbm, ndbm, etc. (except gdbm/ndbm/etc don't support multithreading, ACID transactions, various other stuff that are essential to DB operation). You use them to build other data management systems. Thus, there was a Berkeley DB backend for MySQL, just as there was a Berkeley DB backend for OpenLDAP. As storage engines go though, LMDB outclasses all the others for compactness, robustness, and efficiency.

The eternal debate over app-level caching is always based on the myth that the DB application knows better about the data access patterns than the underlying kernel ever can. This may be true for some data structures, but is irrelevant in the modern computing landscape. The fact is that applications today are running on multi-tenant VMs/containers/etc and only the hypervisor and kernels have a comprehensive view of the resource demands of the entire system. Back in the days when Oracle servers ran on dedicated hardware, it was practical for the DBMS to take all decision making away from the OS, but that is not how computing is done today.

Today your apps are sharing resources with countless other unknown apps, and any cache/memory optimizations your app may attempt are futile and instantly nullified by the actions of other unrelated apps in other unrelated VMs/containers.

Also, as I mentioned above, the data structures make a difference. LMDB relies solely on OS page caches because only the underlying virtual memory manager really knows what's going on in the system, and also, B+trees are inherently the most cache friendly. Your app-level data access patterns just aren't going to outperform them. The tree structure inherently optimizes for LRU cache mgmt; the tree root is always hit on every access, so it is always most-often-and-recently-used. As accesses navigate down the tree to the leaf nodes, each node is successively less frequently used. The net result is that the root and interior branches naturally stay hot in a cache, and the leaf nodes naturally age out fastest, resulting in optimal cache hit rates for any access pattern.

In a modern multi-tenant environment there is no scenario where an app has enough knowledge to make optimal resource mgmt decisions and outsmart the underlying OS.

nextaccountic · on Aug 4, 2021

Thanks for the explanation.

> io_uring is looking like it will be the best overall solution moving forward, at least for linux only.

But io_uring means duplicating memory, right? There will be both my in-memory representation and the kernel cache for the file and they will be fighting each other (or at least, consuming more memory than necessary.

jasonwatkinspdx · on Aug 5, 2021

You can use io_uring with O_DIRECT, with the usual alignment restrictions. io_uring itself is carefully designed to eliminate copying even request/completion event structures.

nextaccountic · on Aug 5, 2021

Thank you for the pointer, this seems very interesting. do you know of an example of open source code using io_uring and O_DIRECT to bypass the fs cache? What about posix_fadvise?

I'm reading here https://unix.stackexchange.com/questions/6467/use-of-o-direc... that some people don't think it's a good idea to bypass the fs cache like that. Perhaps mmap could have been salvageable if it had a better error handling API.

jasonwatkinspdx · on Aug 5, 2021

I don't know an example of that sadly.

O_DIRECT is definitely doing things the hard way for sure. There's a very real risk you won't be as smart as the kernel, even knowing your app's patterns well. Linus famously has ranted against this for a long time. But like mentioned above, with databases, taking on the complexity is often warranted.

I've been tinkering off and on with a database design, and one of the things I want to explore is a hybrid approach where I still have an on heap buffer cache, but misses to that cache get filled out of mmap. The buffer cache is a lock free hash table, greatly simplified by using Copy on Write pages as a fundamental part of the design (don't have to worry about racing updates or duplication races during resizing other than as performance degradations). It's still not finished and totally untested, but the way I imagine it working is the buffer cache is smaller than what you'd typically reserve for a database, and the OS can manage its page cache as a sort of L2 cache.

nextaccountic · on Aug 5, 2021

Thanks for sharing those ideas. This sounds interesting. Is it on github?

jasonwatkinspdx · on Aug 5, 2021

Nope but I'm trying to herd my own cats to get a prototype shipped and up. It'll get shared here when I do for sure.

nextaccountic · on Aug 5, 2021

Nice to hear, I hope it does well!

genericlemon24 · on Aug 4, 2021

Yes, I understood that to mean mmap().

> is mmaping a file and using it directly from memory what databases should do today

Some do; for example, SQLite: https://www.sqlite.org/mmap.html (the page has pros and cons, and explains how SQLite does memory mapped I/O).

atonse · on Aug 4, 2021

I thought the fact that mongodb data files were essentially mmapped was used as a critique against its durability as a primary database. Is that not true?

Or was that not a relevant critique?

jfindley · on Aug 4, 2021

A database like mongodb is a very different animal from a http cache. The mongodb authors likely read this article (it's from 2006) when they were designing the first version of their database, and due to their almost complete unfamiliarity with what they were doing didn't understand that these two things are not alike.

Databases have different classes of objects (think index vs data, though there are more than just these two), different types of access (e.g. full table scan vs single record select) and a whole lot more complexity than I can cover here. You don't really ever want to page out your indicies, but paging out data is probably fine. You likely want to avoid replacing your entire cache for full table scan of a big table. You probably want to have some concept of ensuring a single user doesn't starve out all your other 1000+ users. Etc etc.

In other words, you can't get away with just opting out of memory management for a database - you need to treat different things differently. I understand that a few years in, the mongo devs eventually caught up to 1990s era mysql and realised this, leading to a new DB engine that was a bit less bad.

nextaccountic · on Aug 4, 2021

Welp, I don't know!

From the link of the other comment here, Sqlite disables mmap by default because Linux provides a poor API regarding I/O errors (they are sent as a signal instead)

aidenn0 · on Aug 4, 2021

It's a valid critique, but A cache has no durability requirements.

jd_mongodb · on Aug 4, 2021

MongoDB released the Wired Tiger storage engine in 2015 which replaced the MMAP storage engine. So whatever your read about MMAP is obsolete.

https://www.mongodb.com/presentations/a-technical-introducti...

atonse · on Aug 4, 2021

Yep I know that. I’m saying mmap used to be a critique.

I left the place I introduced mongo to, a couple months before they bought WiredTiger

andrewmcwatters · on Aug 4, 2021

So I guess a bigger question is, how do you reconcile old thinking with new?

You can’t really program everything to use big blocks of virtual memory that are file backed. And you can’t program everything to be cache oblivious.

So is the only real solution to implement naive solutions until they’re slow and test them until they’re not?

That would be sort of a sad way to write software, but perhaps it is the only true way, since it’s the most widely applied and pragmatic practice.

Further, if all of this really is true, which I’m sure it is, then APIs and languages have not caught up, not developers.

Because we’re all still writing software with what’s provided to us, and last time I checked, we didn’t have access to black box caching implementation details, or direct access to L1, L2, L3, etc.

phkamp · on Aug 4, 2021

Ohh, one of the hard questions :-)

Personally, I think the cache oblivious thing was oversold, all the algorithms I have seen are horribly complex and thus only an improvement if your N is so big as to overshadow the constant terms in O(...) we usually ignore. Patenting and expecting to become stinking filthy rich on the licensing fees, was another good way to prevent them from being used.

I think the best we can do is probably still making sure we know what we are trying to do, and what we are not going to do, and then architect accordingly.

That is of course really just a fancy way of saying "it depends", but I'm OK with that: Magic silver bullets are not a thing.

throw0101a · on Aug 4, 2021

(2006)

Previously:

* https://news.ycombinator.com/item?id=27086239

phkamp · on Aug 4, 2021

This post does seem to pop up here every other year, doesn't it ?

tyingq · on Aug 4, 2021

He also has a post on why Varnish doesn't support SSL/TLS:

https://varnish-cache.org/docs/4.0/phk/ssl.html

Which is somewhat interesting now, as Varnish has lost a lot ground to other caching proxies that did choose to implement SSL/TLS.

phkamp · on Aug 4, 2021

On the other hand: Count the CVE's.

An increasingly popular high-rel setup has two different SSL/TLS handlers in front of Varnish, each using a different SSL/TLS implementation.

That way a "ohh shit CVE" against either of those two implementations allow you to turn those of, and keep your site running.

If We bolted any particular TLS/SSL implementation into Varnish, you'd be down when that one got hit.

tyingq · on Aug 4, 2021

Yes, I'm not arguing that he's (edit: you're) wrong. Just that the decision seems to be causing people to choose other solutions, I suspect because managing one thing at least seems easier than two.

phkamp · on Aug 4, 2021

"he" in this case being me :-)

I think the threshold question will always be "Does software X make my life better?" and if it does not, you should ditch it if you can.

There is always a huge bias in reporting: People are eager to tell you why they started using your software, but they always forget the "exit-interview" when they drop it again.

The reason I hear most often for people dropping Varnish is that they have cleaned up the mess of legacy web-services, or at least transitioned it all to the New Fantastic Platform.

Other people drop Varnish for other versions of "this is now surplus to requirements" and I am totally fine with that: I dont want people to run Varnish if it doesn't make their life better.

myWindoonn · on Aug 4, 2021

How are you measuring this lost ground? Varnish is still extremely popular, and designs which separate TLS from caching are also still popular.

tyingq · on Aug 4, 2021

Mostly anecdotal, since there's no real definitive way to tell.

There's two spaces I see.

- Caching within the webserver, as the caching there has improved over time. The caches in Nginx, Apache, LiteSpeed, etc, perform much better than they did in the past.

- Caching at a load balancer tier. Nginx again (though in a LB context), Haproxy, etc.

Both spaces seem to have less talk about Varnish than they used to, and more about other platforms.

You can see a plateau, then decline (though slight) that starts for Varnish around mid-2018, here:

https://trends.builtwith.com/Web-Server/Varnish

Compare to:

LiteSpeed: https://trends.builtwith.com/Web-Server/LiteSpeed

Nginx: https://trends.builtwith.com/Web-Server/nginx

(Though these kinds of surveys are tricky, as they depend on outward facing headers that don't always exist, or don't always tell you enough...like Nginx doesn't always imply the cache is in use)

I also wonder what percentage hit Varnish might take if Fastly moves away. I'm sure they regret not varnish specifically, but exposing the varnish vcl directly to end users.

Edit: Yes, I agree that sites like "builtwith" are flawed, and mentioned some of the reasons above. And, I didn't mean for this comment to sound like a criticism...just an observation. I noticed builtwith's chart has a similar plateau + slight decline for Apache, starting also near mid-2018.

phkamp · on Aug 4, 2021

Varnish has a pretty big market-share as "the intelligent HTTP-router" which can be used to sort traffic to piles of legacy webservers etc, and also be the central clearing-house for detecting and fixing trouble.

Surveys such as builtwith, despite their hard work, can often not "see through the sandwich" and spot if there is a Varnish in it.

Also, I dont know about you, but from a "I want the world to keep working" reliability point of view, I do not like it when any single piece of software, FOSS or non-FOSS, becomes too dominant.

See for instance how dysfunctional the GCC-monopoly was until LLVM gave them competition.

So taking builtwith at their numbers, I'm actually fine with Varnish having "stagnated" at a market share of one fifth of the worlds top 10K websites: That keeps me humble about my code quality, but does not keep me awake at night.

acdha · on Aug 4, 2021

> Though these kinds of surveys are tricky, as they depend on outward facing headers that don't always exist

It's not just “don't always exist”: those headers are actively recommended against by various security guidelines so many large sites heavily use things you can only infer from other characteristics.

This is also the kind of environment where I see some movement against Varnish: internal TLS requirements increase the cost of managing two services instead of one, and if you're increasingly using something like an external CDN the level of benefit from Varnish's cache declines somewhat even though the powerful request routing and manipulation features are still appealing.

I've been generally wondering what it would take to be able to flip the model to something like Cloudflare's Argo Tunnel feature where you could secure internal communications by having your various web services make an _outbound_ connection to the Varnish box which all of the requests will be tunneled over so you only need to manage one certificate there rather than one for every service/container in a complex application.

zapt02 · on Aug 4, 2021

Varnish does support SSL officially in their enterprise offering: https://docs.varnish-software.com/varnish-cache-plus/feature...

haasted · on Aug 4, 2021

Would be nice if the headline included the year this was written. (2005? 2006?)

Ecco · on Aug 4, 2021

Does anyone else think the MP3 player joke is kind of weird?

jfindley · on Aug 4, 2021

Yes - this was written in 2006 and that joke really hasn't aged well at all.

Dang maybe worth sticking a (2006) on the end of this?

capableweb · on Aug 4, 2021

Eh. The article itself is good enough for me to ignore one silly penile joke that almost doesn't even make sense.

sudhirj · on Aug 4, 2021

> Varnish allocate some virtual memory, it tells the operating system to back this memory with space from a disk file.

What's the special instruction to do this? Is this a low level C / syscall thing or do languages like Go have a disk-backed map implementation already?

KarlKode · on Aug 4, 2021

I guess they are refering to mmap (https://man7.org/linux/man-pages/man2/mmap.2.html). In go you can use the mmap syscall directly (low level, not supported on all targets/OSs). I believe there are a few libraries that offer wrappers around the syscall/emulate the syscall if not available).

ruste · on Aug 4, 2021

This sounds like a backwards description of mmap. This is probably what they're using on the backend. I'm not sure if Windows has this as a feature, but any unixy system will.

Arnavion · on Aug 4, 2021

mmap on Windows is CreateFileMapping + MapViewOfFile

skywhopper · on Aug 4, 2021

I find it mildly amusing that the author spends the first half of the article mocking squid for handling disk cache manually when varnish just hands the work to the OS instead, and then the second half of the article explaining how and why varnish does all its own memory management for processing requests instead of just letting the OS handle it.

jerf · on Aug 4, 2021

No, the first half explains why varnish lets the OS manage virtual memory, then explains why varnish doesn't rely on the user space memory allocator that comes with the language. Two completely different layers and different things.

aporetics · on Aug 4, 2021