More

jeffffff · 2026-03-27T14:31:34 1774621894

> None of the things people care about really get much out of "unified memory". GPUs need a lot of memory bandwidth, but CPUs generally don't and it's rare to find something which is memory bandwidth bound on a CPU that doesn't run better on a GPU to begin with. Not having to copy data between the CPU and GPU is nice on paper but again there isn't much in the way of workloads where that was a significant bottleneck.

the bottleneck in lots of database workloads is memory bandwidth. for example, hash join performance with a build side table that doesn't fit in L2 cache. if you analyze this workload with perf, assuming you have a well written hash join implementation, you will see something like 0.1 instructions per cycle, and the memory bandwidth will be completely maxed out.

similarly, while there have been some attempts at GPU accelerated databases, they have mostly failed exactly because the cost of moving data from the CPU to the GPU is too high to be worth it.

i wish aws and the other cloud providers would offer arm servers with apple m-series levels of memory bandwidth per core, it would be a game changer for analytical databases. i also wish they would offer local NVMe drives with reasonable bandwidth - the current offerings are terrible (https://databasearchitects.blogspot.com/2024/02/ssds-have-be...)

AnthonyMouse · 2026-03-27T19:15:43 1774638943

> the bottleneck in lots of database workloads is memory bandwidth.

It can be depending on the operation and the system, but database workloads also tend to run on servers that have significantly more memory bandwidth:

> i wish aws and the other cloud providers would offer arm servers with apple m-series levels of memory bandwidth per core, it would be a game changer for analytical databases.

There are x64 systems with that. Socket SP5 (Epyc) has ~600GB/s per socket and allows two-socket systems, Intel has systems with up to 8 sockets. Apple Silicon maxes out at ~800GB/s (M3 Ultra) with 28-32 cores (20-24 P-cores) and one "socket". If you drop a pair of 8-core CPUs in a dual socket x64 system you would have ~1200GB/s and 16 cores (if you're trying to maximize memory bandwidth per core).

The "problem" is that system would take up the same amount of rack space as the same system configured with 128-core CPUs or similar, so most of the cloud providers will use the higher core count systems for virtual servers, and then they have the same memory bandwidth per socket and correspondingly less per core. You could probably find one that offers the thing you want if you look around (maybe Hetzner dedicated servers?) but you can expect it to be more expensive per core for the same reason.

jeffffff · on July 10, 2024

it's kind of dumb that postgres uses a nested loop join instead of a hash join there. hash join almost always has the best worst-case behavior, and without stats it should be the default choice.

jeffffff · on April 25, 2024

we're not really close, for two reasons:

1) programming takes a long time, and it only makes sense to take the time to do it if you're making a bunch of copies of something. this is something that could be improved with better software and ux - if cad programs made it easy to just drag and drop joints from a joint library into your model then this would be a different story. a hardware+software solution could also work here, something like a cnc version of https://www.woodpeck.com/multi-router-group.html where the software makes it easy to scale the templates to your work piece.

2) setup takes a long time on the affordable machines. every time you change bits you have to recalibrate. positioning the work piece on the table and clamping/taping it down takes a lot of time. if you have to flip the work piece over then that takes even longer and positioning is even more critical, and programming is more complicated as well. regardless of whether your designs require cutting on one or both sides, you have to program tabs into your design so the router doesn't cut all the way through (or else the piece will move and the router will screw it up), and then you have to go back and cut the pieces out the rest of the way manually and trim off the tabs with a flush trim router bit. the high end production quality machines mitigate a lot of these issues, but now you are talking about a machine that costs at least $100,000 and takes up a whole room.

WillAdams · on April 25, 2024

I've been working on the software end of things:

http://tug.org/TUGboat/tb40-2/tb125adams-3d.pdf

and

https://community.carbide3d.com/t/a-different-sort-of-box/36...

and for more see:

https://willadams.gitbook.io/design-into-3d/programming

jeffffff · on April 25, 2024

i'm a hobbyist woodworker with more money than time. i have a pretty basic 3-axis cnc and i thought it would save me time, but it really doesn't. the only thing i actually use it for is cutting out router templates, and even that would be done better with a laser cutter (although a good laser cutter costs a lot more than my cnc).

i could see how a machine big enough for 4x8 sheets with an automatic tool changer, a vacuum table, and all the automatic calibration gizmos might be a time saver for a production shop, but if you're building something that's a one-off or you don't have all the setup automation goodies (which are $$$$$) then setup and programming usually end up taking longer than doing the work the old fashioned way.

for tenon cutting like in the bed rail example you gave, i have a hard time imagining any situation where cnc is going to be more efficient than a domino xl.

Tossrock · on April 25, 2024

I find CNC is a time-saver for one-offs when the work is complex enough that it'd be difficult-to-impossible to do by hand, eg complex curving cuts, engraving/pockets, etc.

I actually saw an unusually straightforward example of this last year - a group of friends and I were making instances of Tyler Gibson's 1-sheet portable bike rack design (it's great, check it out: https://www.thetylergibson.com/building-a-better-portable-bi... )

One group of two-ish people used jigsaws to manually cut the pieces, and I used a Shopbot 4'x8' CNC router. Very roughly, it took about twice as many man-hours to make one by hand, vs by CNC, and the result was less clean. CNC could have done even better, but due to warping of the sheet, it failed to cut all the way through in places, and I had to do a cleanup pass with the jigsaw. And once the upfront cost of generating the toolpaths etc was paid, it would improve again.

bradly · on April 25, 2024

4x8 CNCs with a vacuum table really aren't faster. Even the watercooled CNCs I've used are still too slow for joinery. All the furniture shops I've worked in have been dominated by the Domino for most joinery tasks.

jeffffff · on Jan 4, 2024

Speaking of Indeed, the first employee (https://en.wikipedia.org/wiki/Chris_Lamprecht) was a felon and a great engineer

jeffffff · on July 24, 2023

shouldn't have any effect, the new amd hardware is zen 4 and this only affects zen 2

jeffffff · on July 6, 2023

JIT compilation has the opportunity to do profile-guided optimization at runtime. JIT compilation is also simpler when distributing an application to non-identical servers, as it can optimize for the exact hardware it is running on.

jeffffff · on July 3, 2023

take a look at http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf - the overhead of coordinating multiple writers often makes multi-writer databases slower than single-writer databases. remember, everything has to be serialized when it goes to the write ahead log, so as long as you can do the database updates as fast as you can write to the log then concurrent writers are of no benefit.

pclmulqdq · on July 3, 2023

This is another cool example of a toy database that is again very small:

> The database size for one warehouse is approximately 100 MB (we experiment with five warehouses for a total size of 500MB).

It is not surprising that when your database basically fits in RAM, serializing on one writer is worth doing, because it just plainly reduces contention. You basically gain nothing in a DB engine from multi-writer transactions when this is the case. A large part of a write (the vast majority of write latency) in many systems with a large database comes from reading the index up to the point where you plan to write. If that tree is in RAM, there is no work here, and you instead incur overhead on consistency of that tree by having multiple writers.

I'm not suggesting that these results are useless. They are useful for people whose databases are small because they are meaningfully better than RocksDB/LevelDB which implicitly assume that your database is a *lot* bigger than RAM.

hyc_symas · on July 3, 2023

> RocksDB/LevelDB which implicitly assume that your database is a lot bigger than RAM.

Where are you getting that assumption from? LevelDB was built to be used in Google Chrome, not for multi-TB DBs. RocksDB was optimized specifically for in-memory workloads.

pclmulqdq · on July 3, 2023

I worked with the Bigtable folks at Google. LevelDB's design is ripped straight from BigTable, which was designed with that assumption in mind. I'm also pretty sure it was not designed specifically for Google Chrome's use case - it was written to be a general key-value storage engine based on BigTable, and Google Chrome was the first customer.

RocksDB is Facebook's offshoot of LevelDB, basically keeping the core architecture of the storage engine (but multithreading it), and is used internally at Facebook as the backing store for many of their database systems. I have never heard from anyone that RocksDB was optimized for in-memory workloads at all, and I think most benchmarks can conclusively say the opposite: both of those DB engines are pretty bad for workloads that fit in memory.

hyc_symas · on July 3, 2023

I think we've gone off on a tangent. At any rate, both LevelDB and RocksDB are still single-writer so whatever point seems to have been lost along the way.

earthnail · on July 3, 2023

I've used RocksDB for an in-memory K/V store of ~600GB in size and it worked really well. Not saying it's the best choice out there but it did the job very well for us. And in particular because our dataset was always growing and we needed the option to fallback to disk if needed, RocksDB worked very well.

Was a PITA to optimise though; tons of options and little insight into which ones work.

ayende · on July 3, 2023

I am using the same rough model, and I'm using that on a 1.5 TB db running on Raspberry PI very successfully.

Pretty much all storage libraries written in the past couple of decades are using single writer. Note that single writer doesn't mean single transaction. Merging transactions is easy and highly profitable, after all.

jeffffff · on June 30, 2023

there are only a handful of instructions that do interesting things beyond parallel versions of basic arithmetic and bitwise operations. https://branchfree.org/2019/05/29/why-ice-lake-is-important-... provides a good overview of them.

jeffffff · on June 30, 2023

AVX and AVX2 are pretty awful because of lane-crossing limitations, but AVX512 is actually really nice and feels like a real programmer designed it rather than an electrical engineer.

gpderetta · on June 30, 2023

FWIW, Michael Abrash [1] was at Intel when Larrabee (the AVX512 predecessor) was being developed and apparently [2] he contributed to the ISA design.

[1] https://en.wikipedia.org/wiki/Michael_Abrash [2] https://www.anandtech.com/show/2580/9

thechao · on June 30, 2023

Yeah — my favorite instructions he added were `fmad233` and `faddsets`; the former instruction essentially bootstraps the line-equation for the mask-generation for rasterization, and the latter lets you 'step' the intersection. You could plumb the valid mask through and get the logical intersection "for free". This let us compute the covering mask in 9 + N instructions for N+1 4x4 tiles. We optimized tile load-store to work in 16x16 chunks, so valid mask generation came to just 24 cycles. It was my argument that using Boustrophedon order and just blasting the tile (rather than quad-tree descent like he designed) is what convinced him to let me work with RAD & do the non-polygon path for LRB.

Tuna-Fish · on June 30, 2023

This is not just in your head.

Most Intel ISA extensions come from either customers asking for specific instructions, or from Intel engineers (from the hardware side) proposing reasonable extensions to what already exists.

LRBni, which eventually morphed into AVX-512, was developed by a team mostly consisting of programmers without long ties to Intel hw side, as a greenfield project to make an entirely new vector ISA that should be good from the standpoint of a programmer. I strongly feel that they have succeeded, and AVX-512 is transformative when compared to all previous Intel vector extensions.

The downside is that as they had much less input and restraint from the hw side, it's kind of expensive to implement, especially in small cores. Which directly led to its current market position.