Absolutely. I’ll go further and say that you must specifically design the database schema knowing that it’s going to be used this way. Your pages need to be packed full with the data you need and nothing you don’t. Spatial locality matters bigtime since “seeks” are so expensive (additional S3 requests), when in a traditional db it matters much less. Wide tables with a lot of columns that might not be used in a query are a bad idea here.
Here’s an in-the-weeds tip for anyone attempting the same: your tables should all be WITHOUT ROWID tables. Otherwise SQLite sprays rows all over the place based on its internal rowids, ruining locality when you attempt to read rows that you thought would be consecutive based on the primary key.
A few days ago, I tried to use the linked library (sql.js-httpvfs) for a graph network visualization, which went about as well as you'd expect given the poor spatial locality. Do you have any tips for optimizing spatial locality with more complex queries? Can you manually cluster data for some given properties in SQLite?
For my project I ended up just exporting the graph edges as JSON, but I'm curious if it would still be possible to make work.
In a WITHOUT ROWID table, you have control over the order of the rows. Make an id column as your primary key, and set the id appropriately so that rows accessed together will be next to each other in the file. This is how I manually cluster the data.
Aside from that, I use an entity-attribute-value model. This ensures that all the tables are narrow. Set your primary key (again, with WITHOUT ROWID) to put all the values for the same attribute next to each other. That way, when you query for a particular attribute, you'll get pages packed full with nothing but that attribute's values in the order of the IDs for the corresponding entities (which you manually clustered).
It's worth repeating one more time: you must use WITHOUT ROWID. SQLite tables otherwise don't work the way you'd expect from experience with other DBs; the "primary" key is really a secondary key if you don't use WITHOUT ROWID.
Thanks for the info! In my case there's not really one primary key that would guarantee good clustering for my query, so I guess there's not much that can be done to optimize here.
I'm trying to find all the ancestors of a node in a DAG, so the optimal clustering would vary depending on the node I'm querying for
> I’ll go further and say that you must specifically design the database schema knowing that it’s going to be used this way.
If it ever turns out, at some point in the future, that you do need features from a standard RDBMS after all, you are going to regret not using Postgres in the first place, because re-engineering all of that is going to be vastly more expensive than what it would have cost to just "do it right" from the start.
So it seems that Cloud SQLite is basically a hyper-optimization that only makes sense if you are completely, totally, 100% certain beyond any reasonable doubt that you will never need anything more than that.
I can’t reveal too many internal details but this data lived in the RDBMS for years. Its access patterns are well understood. That’s exactly when you start cost optimizing. If this didn’t work out we’d just move back to the old DB schema that we were already using and pay for a bigger server. If we wanted, we could keep the schema as-is and just move it into SQL Server. That would work just fine, too. No re-engineering required.
Don’t know how else to say “we were not born yesterday; we thought of that” politely here. This definitely isn’t something to have your junior devs work on, nor is it appropriate for most DB usage, but that’s different than it not having any use. It’s a relatively straightforward solution to a niche problem.
Here’s an in-the-weeds tip for anyone attempting the same: your tables should all be WITHOUT ROWID tables. Otherwise SQLite sprays rows all over the place based on its internal rowids, ruining locality when you attempt to read rows that you thought would be consecutive based on the primary key.