Predicting the future of distributed systems

leventov · 2024-08-30T06:27:26 1724999246

I don't buy this "object storage + Iceberg is all you need for OLAP" hype. If the application has sustained query load, it makes sense to provision servers rather than go pure serverless. If there are already provisioned servers, it makes sense to cache data on them (either in their memory or SSDs) to avoid round-trips to object storage. This is the architecture of the latest breed of OLAP databases such as Databend and Firebolt, as well as the latest iterations of Redshift's and Snowflake's architecture. Also, this is the approach of the newest breed of vector databases, such as Turbopuffer.

For OLAP use cases with real-time data ingestion requirements, object-storage-only approach also leads to write amplification. Therefore, I don't think that architectures like Apache Pinot, Apache Paimon, and Apache Druid are going anywhere.

Another problem with "open table formats" like Iceberg, Hudi, and Delta Lake is their slow innovation speed.

I've recently argued about this at greater length here: https://engineeringideas.substack.com/p/the-future-of-olap-t...

dijksterhuis · 2024-08-30T15:22:42 1725031362

I buy into some of the hype. but only for majority use cases, not for edge case specialisation.

As ever, at FANGMA-whatever scale/use cases, yeah I’d agree with you. But the majority of cases are not FANGMA-whatever scale/use cases.

Basically, it’s good enough for most people. Plus it takes away a bunch of complexity for them.

> If the application has sustained query load

Analytical queries in majority cases are not causing sustained load.

It’s a few dashboards a handful of managers/teams check a couple of times through the day.

Or a few people crunching some ad hoc queries (and hopefully writing the intermediate results somewhere so they don’t have to keep making the same query — I.e. no sustained load problem).

> real-time data ingestion requirements

Most of the time a nightly batch job is good enough. Most Businesses still work on day by day or week by week, and that’s at the high frequency end of things.

> slow innovation speed

Most people don’t want bleeding edge innovative change. They want stability.

Data engineers have enough problems with teams changing source database fields without telling us. We don’t need the tool we’re storing the data with to constantly break too.

jschrf · 2024-08-30T04:44:30 1724993070

I dunno, Postgres seems pretty solid.

random3 · 2024-08-30T15:20:46 1725031246

Postgress may be solid, but not a distributed system and the OP is about distributed systems.

packetlost · 2024-08-30T15:22:04 1725031324

postgres can be a distributed system though? It supports sharding and various replication schemes.

random3 · 2024-08-30T17:00:00 1725037200

I think that's where it transitions from a solid system to a subpar system and it's exactly that part that's relevant when we think of distributed systems.

Postgres is solid and a good choice because most times you don't need more than one database and not dealing with complex enough problems that would require a distributed architecture. In reality, you always do, it's just that you're fine with a non-bullet proof solution. For example any real app will have distributed state that spans across UI and database (think of as simple as React's UI state to more comprehensive state management like MobX and finally a database.

So the distributed aspect of systems is actually pervasive and goes beyond what you'd consider a database. And when we talk about the future of distributed system, it's this aspect that is the most relevant one to start with.

seattleeng · 2024-08-30T05:40:29 1724996429

A new programming model for distributed computing is desperately needed. Something between a full operational system a la Temporal, but without the extreme operational overhead + a sane cooperative runtime like Golang.

I think we're probably too early to build this today. Ray is used at my current job for scaling subroutines in our distributed job system. It's the closest I've seen.

sorentwo · 2024-08-30T13:31:21 1725024681

What you’re describing sounds exactly like Elixir (any Erlang family language really, Elixir is just most ergonomic IMO). Truly seamless cooperative scheduling with transparent distribution between any number of nodes.

Shameless plug, but there’s also Oban, a widely used database backed background job system.

cosmic_quanta · 2024-08-30T11:48:00 1725018480

> A new programming model for distributed computing is desperately needed

The article mentions actor-model frameworks like Akka. Is that not like Ray?

At work we use and maintain something similar called Cloud Haskell (confusingly implemented in a package called distributed-process: https://github.com/haskell-distributed/distributed-process) and I have to say that using it is a breeze.

briankelly · 2024-08-30T20:34:50 1725050090

Oversimplifying it, but if you look as parallelism as a special case of concurrency [0] then ray is a framework for the former and actors are a model for the latter. Actors are lower level and you could build something like Ray on top of it, but if your goal is to process large data in parallel you’d want something like ray. Accordingly ray can’t be used for more general computing.

[0] and it’s an oversimplification because while concurrency can reasoned about conceptually, parallelism implicates implementation realities around computation resources and data locality.

actionfromafar · 2024-08-30T09:36:13 1725010573

Wouldn't you say the Unison programming language is the closest so far?

https://www.unison-lang.org/

anonzzzies · 2024-08-30T08:09:21 1725005361

Ray? Terrible name, so I cannot find anything but a Go debugger? Do you have a link?

leventov · 2024-08-30T08:16:36 1725005796

https://www.ray.io/

jamesblonde · 2024-08-30T05:41:30 1724996490

This was on the front page 3 days ago #deng

https://news.ycombinator.com/item?id=41363499

gslaller · 2024-08-30T10:33:11 1725013991

Having superficial knowledge about data storage/DBMS, perhaps I am just spewing nonsense. But I always imagine that a compaction layer for the long-tail/old data should be a standard, where less accessed(sometimes once a month) is pushed to a s3(but still queryable) like storage.

dijksterhuis · 2024-08-30T15:55:42 1725033342

So with traditional Parquet this is usually handled through “sane” partitioning.

Heavily simplified version — Each partition is a separate file containing a bunch of table rows. And partition splits are determined by the values in those rows.

If you’ve got data with like a date column (sign up date or order date or something), you would partition on a YYYY-MM field you create early on.

Each time you run a query filtering by YYYY-MM, your OLAP query tool no longer needs to read bunch of files from disk or S3. If you only want to look at 2023-12, then you only need to read one file to run the query.

Edit — OLAP kinda stuff is all about getting the data “slices” nicely organised for queries people will run later.