Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I do C++ backend work in a non-web industry and this entire post is Greek to me. Even though this is targeted at developers, you need a better pitch. I get "we did this 100x faster" but the obvious followup question is "how" but then the answer seems to be a ton of flow diagrams with way too many nodes that tell me approximately nothing and some handwaving about something called P-States that are basically defined to be entirely nebulous because they are any kind of data structure.

I'm not saying there's nothing here, but I am adjacent to your core audience and I have no idea whether there is after reading your post. I think you are strongly assuming a shared basis where everybody has worked on the same kind of large scale web app before; I would find it much more useful to have an overview of, "This what you would usually do, here are the problems with it, here is what we do instead" with side by side code comparison of Rama vs what a newbie is likely to hack together with single instance postgres.



In a typical architecture, the DB stores data, and the backend calls the DB to make updates and compile views.

Here, the "views" are defined formally (the P-states), and incrementally, automatically updated when the underlying data changes.

Example problem:

Get a list of accounts that follow account 1306

"Classic architecture":

- Naive approach. Search through all accounts follow lists for "1306". Super slow, scales terribly with # of accounts.

- Normal approach. Create a "followed by" table, update it whenever an account follows / unfollows / is deleted / is blocked.

Normal sounds good, but add 10x features, or 1000x users, and it gets trickier. You need to make a new table for each feature, and add conditions to the update calls, and they start overlapping... Or you have to split the database up so it scales, but then you have to pay attention to consistency, and watch which order stuff gets updated in.

Their solution is separating the "true" data tables from the "view" tables, formally defining the relationship between the two, and creating the "view" tables magically behind the scenes.


I read their post and honestly it’s not really that much different than just materialized views in a regular database plus async jobs to do the long running tasks.

It’s a ridiculous amount of fluff to describe that. Not to mention it’s proprietary and only supports the JVM and doesn’t integrate with the tons of tooling designed about RDBMS unless you stream everything to them, defeating the purpose.

What really irks me is that they go on and on bragging about the low LoC count and literally show nothing complete. They should’ve held on this post and released it simultaneously with the code.


We are very open in the post that the core concepts are not new:

  Individually, none of these concepts are new. I’m sure you’ve seen them all before. You may be tempted to dismiss Rama’s programming model as just a combination of event sourcing and materialized views. But what Rama does is integrate and generalize these concepts to such an extent that you can build entire backends end-to-end without any of the impedance mismatches or complexity that characterize and overwhelm existing systems.
Indexes as arbitrary data structures that you shape to perfectly meet your use cases, a powerful computation API that's like a "distributed programming language", and everything being so integrated make a world of difference.

I understand the desire to see all the code, and that's coming in two weeks. That said, the code in the post isn't trivial as it's showing almost the complete implementations of two major parts of Mastodon: the social graph and timeline fanout.

Next week you'll be able to play with Rama when we release a build of it, and the documentation will help with that.


> But what Rama does is integrate and generalize these concepts to such an extent that you can build entire backends end-to-end without any of the impedance mismatches or complexity

Every time I hear this the reality turns out to be that building anything with this tech is like building something on top of SAP.

But I’m also just allergic to any post that says ‘look how amazing’ in general, so I’m a bit prejudiced.


After reading through the post a bit more, I’m inclined to believe it’s not hot air, but I think most of the innovation where is in the management layer, not the ease of application development.

Just looking at the first example tells me that there’s a million ways someone that doesn’t know what they’re doing can mess this up.

If the author of the platform implements some service on their own platform it’s always going to seem simple.


The difference is that the materialized-view logic lives naturally in the application code; there's no step where they go out of the DB to do computations and then reinsert.

Once SQL materialized views aren't enough, you might do this by replicating your database into Kafka, implementing logic in Flink or something, and reinserting into the same DB/Elasticsearch/etc. Very common architecture. (Writ small, could also use a queue processor like RabbitMQ.)

Their approach is to instead--apparently--make all of these first-class elements of the same ecosystem, not by "putting it all in the database", but by putting the database into the application code. Which seems wild, but colocates data, transformation, and view.

Seems like it would open up a lot of cans of worms, but if you solve those, sounds great.


You can do all of this with https://materialize.com, and you don’t need to write it in Java. Just connect it to a Postgres instance and start creating materialised views using SQL. These views then auto update. So much so, that you can create a view for the top 10 of something, and let it sit there as the list updates. Otherwise just use normal select statements from your views using any Postgres client.


IIUIC, the most significant difference from a materialized view is that the Rama infrastructure recompute only the changed data by checking the relationship between fields, while a traditional materialized view recomputes the whole table?


isn't Materialized performing symbolic differentiation of SQL queries?


incremental view maintenance is the database equivalent of: "recompute only the changed data by checking the relationship between fields,"

Oracle has decent support for incrementally updated materialized views, redshift has some too. Materialize.com is an entire snowflake-like platform built around incrementally maintained materialized views.


> I read their post and honestly it’s not really that much different than just materialized views in a regular database plus async jobs to do the long running tasks.

How about you go and implement a Mastodon server to their level of feature parity, and tell us how much effort and how many lines of code it takes?

I really don't appreciate this kind of fluffy, insubstantial, overly dismissive non-content on HN.


This is all armchair for me, but I think they have containers and sharding built in as well, which is the other half of the puzzle when it comes to scaling.


Yes, but there are plenty of NewSQL that support views and offer all of that too. Yugabyte, Cockroach, TiDB and that’s just off the top of my head and open source. If we count proprietary then you have Fauna, Cloud Spanner and more I’m sure.


I’m getting Noria[1] / Materialize / Readyset vibes from this, or perhaps even Samsa[2] ones. (Incidentally, I’d appreciate it if anyone could elaborate on the differences between the two.) Explicit inspiration? Parallel evolution?

[1] https://news.ycombinator.com/item?id=29615085

[2] https://martin.kleppmann.com/2015/03/04/turning-the-database...


So... at a high level, early React for data? In other words, letting a framework manage update dependency graph tracking, and then cascading updates through its graph in an optimized manner to enhance performance?

Obviously, with tons of implementation difficulties and details, and not actual graph structures, but as a top level analogy.


Not at all, especially because React doesn’t do much dependency tracking on its own and is built for predictable UI updates and not performance.

To be honest any parallel with frontend here is meaningless, reactivity and all the concepts at play have existed long before JS and browsers came along, it’s easier to explain from first principles.


I think that’s probably not the case for many new developers that don’t have any exposure to anything not React. Of course ‘react for data’ is entirely misleading, but it may give a decent idea if you don’t have an hour to spend on an explanation.


In other words, it’s the Dark Souls of application backends, but entirely different.


Only if your expectation is to be constantly frustrated, and eventually die, after which you have to do it all again.


This is a problem that is very similar to the one solved by Differential Data Flow and implemented by https://github.com/MaterializeInc/materialize

Have you considered using that?


Nathan Marz created Apache Storm, coauthored the book "Big Data", and founded an early real-time infrastructure team at Twitter. It's likely the 'curse of knowledge' of working on this specific problem for so long is responsible for the unique and/or unfamiliar style of communication here.

EDIT: Specifics


> Whereas Twitter stores home timelines in a dedicated in-memory database, in Rama they’re stored in-memory in the same processes executing the ETL for timeline fanout. So instead of having to do network operations, serialization, and deserialization, the reads and writes to home timelines in our implementation are literally just in-memory operations on a hash map. This is dramatically simpler and more efficient than operating a separate in-memory database. The timelines themselves are stored like this:

> To minimize memory usage and GC pressure, we use a ring buffer and Java primitives to represent each home timeline. The buffer contains pairs of author ID and status ID. The author ID is stored along with the status ID since it is static information that will never change, and materializing it means that information doesn’t need to be looked up at query time. The home timeline stores the most recent 600 statuses, so the buffer size is 1,200 to accommodate each author ID and status ID pair. The size is fixed since storing full timelines would require a prohibitive amount of memory (the number of statuses times the average number of followers).

> Each user utilizes about 10kb of memory to represent their home timeline. For a Twitter-scale deployment of 500M users, that requires about 4.7TB of memory total around the cluster, which is easily achievable.

Isn't this where the most difficult(expensive) part is and Rama has little to do with it? It appears that the other parts also do not have to be Rama.


We're storing those in-memory within the Rama modules materializing the home timelines. And the query topologies that refresh home timelines for lost partitions is colocated with that. This is dramatically simpler than operating a separate in-memory database, and Rama has everything to do with that.


It appears simpler and better without Rama.

> So instead of having to do network operations, serialization, and deserialization, the reads and writes to home timelines in our implementation are literally just in-memory operations on a hash map. This is dramatically simpler and more efficient than operating a separate in-memory database.


Agreed, just reading through half of it I have no idea what Rama is.


... Maybe the post isn't targeted to your audience at all? How is "C++" and "non-web work" adjacent to web work with web launguage audiences?


he's a developer and curious about the subject. Since it's a blog post, not a scientific paper, the fact that he did not understand could be a communication failure. I think he's being helpful


OP did not specify what their industry actually is. I've been doing "web work" for 17 years and I'm sharing their concern: where's the TL;DR for this? If this somehow can make me 100x as productive, how about starting with a "hello world" example that shows me how is it different from pip install django, etc?


Here in video form: Microservices https://www.youtube.com/watch?v=y8OnoxKotPQ




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: