zakki's favorites | Hacker News

Back at my old job in ~2016, we built a cheap homegrown data warehouse via Postgres, SQLite and Lambda.

Basically, it worked like this:

- All of our data lived in compressed SQLite DBs on S3.

- Upon receiving a query, Postgres would use a custom foreign data wrapper we built.

- This FDW would forward the query to a web service.

- This web service would start one lambda per SQLite file. Each lambda would fetch the file, query it, and return the result to the web service.

- This web service would re-issue lambdas as needed and return the results to the FDW.

- Postgres (hosted on a memory-optimized EC2 instance) would aggregate.

It was straight magic. Separated compute + storage with basically zero cost and better performance than Redshift and Vertica. All of our data was time-series data, so it was extraordinarily easy to partition.

Also, it was also considerably cheaper than Athena. On Athena, our queries would cost us ~$5/TB (which hasn't changed today!), so it was easily >$100 for most queries and we were running thousands of queries per hour.

I still think, to this day, that the inevitable open-source solution for DWs might look like this. Insert your data as SQLite or DuckDB into a bucket, pop in a Postgres extension, create a FDW, and `terraform apply` the lambdas + api gateway. It'll be harder for non-timeseries data but you can probably make something that stores other partitions.