You mean you discovered parallel arrays?

willvarfar · 2026-01-15T06:57:49 1768460269

specifically I've discovered how to 'trick' mainstream cloud storage and mainstream query engines using mainstream table formats how to read parallel arrays that are stored outside the table without using a classic join and treat them as new columns or schema evolution. It'll work on spark, bigquery etc.

hahahahhaah · 2026-01-15T08:06:56 1768464416

Whats a good place to see parallel arrays defined. I have no data lake expetience. Know how relational db works.

nurettin · 2026-01-15T08:16:12 1768464972

I mean,

    Table1 = {"col1": [1,2,3]}
    Table2 = {"epiphany": [1,1,1]}
    for i, r in enumerate(Table1["col1"]):
      print(r, Table2["epiphany"][i])

He's really happy he found this (Edit: actually it seems like Chang She talked about this while discussing the Lance data format[1]@12:00 in 2024 at a conference calling it "the fourth way") and will represent this in a conference.

[1] https://youtu.be/9O2pfXkCDmU?si=IheQl6rAiB852elv

willvarfar · 2026-01-15T08:49:06 1768466946

Seriously, this is not what big data does today. Distributed query engines don't have the primitives to zip through two tables and treat them as column groups of the same wider logical table. There's a new kid on the block called LanceDB that has some of the same features but is aiming for different use-cases. My trick retrofits vertical partitioning into mainstream data lake stuff. It's generic and works on the tech stack my company uses but would also work on all the mainstream alternative stacks. Slightly slower on AWS. But anyway. I guess HN just wants to see an industrial track paper.

hahahahhaah · 2026-01-16T05:10:17 1768540217

Why a paper? A repo should do the trick.

hahahahhaah · 2026-01-15T10:24:40 1768472680

That code is for in memory data right? I see no storage access.

What is really happening? Are these streaming off 2 servers and zipped into 1. Is this just columnar storage or something else?