Deleting large amount of data on Postgresql is expensive. First you need an inde...

plank_time · on June 12, 2021

You can do the equivalent by adding a Timestamp column with index on your table and add “where now - timestamp <= TTL”. (Or some computational easier way that doesn’t require math on the query)

TTL on systems like Cassandra is pretty ugly and deleting data is hard as you scale no matter how you do it. I don’t think Postgres would be able to implement a TTL that is magically better than the rest.

nezirus · on June 12, 2021

Deleting whole partitions is generally useful strategy. It's like the difference between single inserts and batch inserts (often huge performance difference, and much lower IO)

Since you mentioned Cassandra and TTL, I'll mention ClickHouse, very nice TTL options, splitting into smaller partitions and using "ttl_only_drop_parts=1" has prove itself in the production with big data ingestion rates.

Last, but not the least, I almost always prefer Postgres for data storage needs, one can trust it to be safe and fast enough. Only some specific situations warrant other solutions, but it's a long way until that point (if ever), and better not optimize too early.

squiggleblaz · on June 12, 2021

> Creating and deleting a lot of data on PG is a pain because of MVCC and vacuum. One useful trick is to partition data into tables and to truncate/drop entire tables, drop/truncate is instant and reclaims space immediately.

In this case, the requirement is that user data must only be kept for a certain time and not longer.

If that time is a property of the record create and we're allowed to keep the data with an error of a day, I guess it's easy: We partition the table on a daily basis, and delete any partition that is older than 28 days old. Sometimes, a record will be closer to 29 days old when it's deleted, but we accepted that, and it's easy enough to write queries so that it's unavailable to the application if it's more than exactly 28*24*60*60 seconds old if that's our constraint.

If the requirement is to keep it based on the last of a certain kind of use, we'd need to move it from one partition to another if we need to keep it. For instance, if we can keep data for 28 days after the user last longs in, we can't just drop the partition - unless we've moved the user each day they log in.

If we have that kind of a constraint, where data lifespan is based on properties that change over the lifetime of the data, is partitioning + drop actually a useful approach? The drop will still be instant, I guess, but it's the movement of data over its lifetime that concerns me here.

sologoub · on June 12, 2021

You’d probably already have indices by user then. There also other ways to store the data, for example not as a time series of user actions, but as another data structure. It just depends on your requirements and data model.

The gist of the original article is asking whether you could reduce tech stack complexity and use a single set of technologies for more use cases, allowing to understand the tool you are using better.

Also, note that a traditional database may or may not be the right tool for the job - there are different storage/behavior needs for writing lots of data, processing lots of data and serving the results.

squiggleblaz · on June 12, 2021

> You’d probably already have indices by user then.

Do you mean "you'd probably already have indices by user then, so you won't be able to take advantage of quick drops"?

> There also other ways to store the data, for example not as a time series of user actions, but as another data structure. It just depends on your requirements and data model.

I suppose I want to be a bit more specific. Till now, if I wanted to deal with this issue, I would just have used a cron job and some indexes and taken on the load. But what is an example of a nice way of dealing with this? You get to make up plausible requirements and data model details, perhaps drawing on particular requirements you had when you faced a similar issue.

> The gist of the original article is asking whether you could reduce tech stack complexity and use a single set of technologies for more use cases, allowing to understand the tool you are using better.

The legitimacy of a question in a free-form conversation doesn't depend on the original inspiration, several iterations ago. But even if it did, the question is here is exactly about understanding the tools better and whether we can use one tool instead of two, so by your summary it's perfectly on topic.

> Also, note that a traditional database may or may not be the right tool for the job - there are different storage/behavior needs for writing lots of data, processing lots of data and serving the results.

The subquestion here is precisely if we can get away with just using Postgres for data with mandatory lifespan requirements that vary over the lifetime of the data.

Extra tools come at a cost - that is the presumption of this article and the thread it has spawned.

If we have to use Postgres, we need to assess the cost of doing this in Postgres before we can decide whether or not to pay the cost of other tools. Waving in the general direction of other tools isn't helpful; it probably takes as much work to enumerate candidates as it does to calculate the cost of doing it in Postgres.

cbsmith · on June 12, 2021

I was going to say... the standard approach is to simply partition the data by time, at which point it gets really easy to manage.

sologoub · on June 12, 2021

Sure, there are different design patterns for different data sizes. This also adds querying complexity, so just depends on what’s needed.

Also, most applications have peak and low periods of load that are predictable (e.g. users concentrated in a given set of time zones) which make for good times to run otherwise disrupting functions, etc.