Why not use parquet files + AWS Athena?

electroly · on July 6, 2023

The ability to use an index to seek directly to a handful of consecutive rows without processing the whole file was very important for our use case. Athena doesn't support indexing like this; it only has partitioning on a single column. It has to scan whole partitions every time. Both S3 Select and Athena are more useful when you want to aggregate massive data sets, but that's not what we're doing. We want to jump in and pull out rows from the middle of big data sets with reasonably low latency, not aggregate the whole thing.

pid-1 · on July 7, 2023

> it only has partitioning on a single column

You can partition using several columns. But I get your point, it's not optmized for row level operations in general.

simlevesque · on July 6, 2023

To avoid depending on a AWS product.

pid-1 · on July 7, 2023

Athena = managed Presto + managed Hive Metastore

I use a presto containers to mock Athena for local and CI tests + a psql container for metastore.