specifically I've discovered how to 'trick' mainstream cloud storage and mainstream query engines using mainstream table formats how to read parallel arrays that are stored outside the table without using a classic join and treat them as new columns or schema evolution. It'll work on spark, bigquery etc.
Table1 = {"col1": [1,2,3]}
Table2 = {"epiphany": [1,1,1]}
for i, r in enumerate(Table1["col1"]):
print(r, Table2["epiphany"][i])
He's really happy he found this (Edit: actually it seems like Chang She talked about this while discussing the Lance data format[1]@12:00 in 2024 at a conference calling it "the fourth way") and will represent this in a conference.
Seriously, this is not what big data does today. Distributed query engines don't have the primitives to zip through two tables and treat them as column groups of the same wider logical table. There's a new kid on the block called LanceDB that has some of the same features but is aiming for different use-cases. My trick retrofits vertical partitioning into mainstream data lake stuff. It's generic and works on the tech stack my company uses but would also work on all the mainstream alternative stacks. Slightly slower on AWS. But anyway. I guess HN just wants to see an industrial track paper.