Hacker News new | past | comments | ask | show | jobs | submit login

Actually the broader lesson would be to assume the worst in your application layer and try and remediate/verify wherever possible.

If you look at his articles: Redis, PostgreSQL, Cassandra, ElasticSearch etc all had data consistency errors. And none of those have vendors making any claims.

It's pretty sobering to say the least.




Um, this is the postgres article:

https://aphyr.com/posts/282-call-me-maybe-postgres

There were no acknowledged writes lost. The only unacked-but-successful writes resulted from a connection while a commit ack was in-flight. That doesn't qualify as a data-consistency error, it means the client has to check if the data is present after reconnecting.

But in no cases would the client reconnect to find that there were acknowledged-as-committed records that were missing or stale. In no cases would the client find that responded-as-rolled-back data was actually committed. This is very, very different than what is seen with MongoDB.


I don't think the results of Aphyr's MongoDB and postgres experiments are directly comparable. In the OP, MongoDB was run in a 5 node replicated configuration. In the post you reference, the experiment was run against a single postgres node.

Furthermore, the postgres experiment only checked that no writes were lost. As Aphyr acknowledges, MongoDB did not lose any writes with "majority" write concern. The postgres experiment did not include the verification of a linearizable history of reads, which is what the bulk of the OP is about.

I'd like to see a similar experiment run against a replicated postgres configuration with auto-failover.


Now try it again with PostgreSQL's built in sharding or replication functionality... oh, wait.


The parent post correctly pointed out that including Postgres in that list is misleading at least.


I was talking about data consistency more broadly. "However, two writes (215 and 218) succeeded, even though they threw an exception claiming that a failure occurred". This obviously isn't ideal behaviour from the perspective of a developer but isn't necessarily not correct.

My broader point was that you need to assume the worst from your database in the application layer. Which I think you missed.


I read your post as FUD regarding Postgres.

That article has various issues, for example, calling Postgres commit protocol as a special case of two phase commit is not really correct. Postgres has 2pc: http://www.postgresql.org/docs/9.2/static/sql-prepare-transa... but that was not tested.

The described behavior is "expected" and "understood". Saying that "you should assume worst from your database" is not something I would ever use for describing DB with ACID semantics.


I wasn't spreading FUD about anything. From the article there was issues with every databases expected behaviour. My point again was that you should expect and manage failure in your application layer. It's what sensible architecture looks like.

And ACID does NOT guarentee that you will not lose data. It is a theory not an implementation. I have lost data with both Oracle and Teradata due to bugs.


Well, you said that Postgres had data consistency errors (referencing Aphyr articles). This is not true (at least regarding that article).

Aphyr article about Postgres could be renamed to call-me-maybe-acid-db-over-the-network and could remain the same.


You misunderstand the results. PostgreSQL behaves as expected, and indeed, the only way it can behave. That's the Two General's Problem (http://en.wikipedia.org/wiki/Two_Generals%27_Problem). There is no way to solve it. PostgreSQL does as well as theoretically possible.

I understand you lost data with Oracle and Teradata.

1. Most big corporate vendors do not have systems which are very well designed. (1) The sale is made at a business level. (2) Most Oracle customers are not very tech companies, and have mixed quality employees. As a result, there is little pressure on building a robust, correct product, rather than one which meets a feature checklist. In addition, Oracle doesn't really recruit smart people (I know people who work there). It's just not very robust compared to something like PostgreSQL, which was written by Stonebraker, a legendary computer science professor and entrepreneur.

2. Still, more likely, the reason you lost data is because you didn't know what you were doing. Words like Eventual Consistency, ACID, Two-General's Problem, etc. are not just abstract. They have strict, formal meanings, and you need to understand what they do and do not guarantee. Otherwise, you will lose data again.


> the reason you lost data is because you didn't know what you were doing

Missed this reply and thought it was funny. I work for one of the world's largest retailers and we are one of both Oracle's and Teradata's most loved customers. We have 4 Teradata DBAs provided BY Teradata amongst a team of 20 SQL Developers. We aren't messing around.

What YOU don't seem to understand is that bugs in your database can cause data loss. ACID or Strong Consistency will not save you.


That's not exactly the sort of team where you'd expect MIT Ph.Ds to work. It's precisely big teams of mediocre people who run into issues based on not knowing exactly what the database is doing and how it's supposed to work that lead to data corruption due to misuse. It's almost always a boring problem (e.g. retail business software). It's almost always a clunky commercial "enterprise" solution (e.g. Oracle). It's almost always a big team. I'm not sure if I even need to go into application engineers -- you put your best and brightest into the core product, and solutions typically gets those that can't quite cut it there.

Regardless, your comment was about PostgreSQL, not Oracle. Oracle is a giant piece of software written by a corporation with over 100,000 employees. Something like that is bound to have bugs, and it has bugs indeed. Data corruption with Oracle certainly happens. PostgreSQL is written by a small, ultra-elite team. It's a much smaller codebase, so an expert developer can understand how the whole system works. There's a big difference in robustness between the two.

Of course database bugs can cause corruption. I've certainly had MonogDB eat my data. But the level of robustness of different databases is very different. There are many databases which are essentially bug-free. If you're losing data with PostgreSQL, odds are you're the one losing the data, not PostgreSQL.


Actually, just use Riak. Those people know distributed bases.

https://aphyr.com/posts/285-call-me-maybe-riak

Now I hear they support consistency as well.


I think at one point the employees of Basho knew how to write distributed DBs - Riak is the most advanced AP DB from a distributed systems theory perspective. However, in recent months, their CEO, CTO and Chief Architect have left, as well as many of their prominent engineers. Worryingly, the new CTO seems content to make inane comments about "Data Gravity" [0].

[0] - http://www.kdnuggets.com/2015/03/interview-dave-mccrory-bash...


I think Riak's theory is fine, but theory isn't enough. And they may have succumbed to the Osborne effect with Riak 2.0.

Here's what I mean. Think of all the nice things you expect to come out of Riak's theoretical basis -- bulletproof distributed writes, for example.

Well, the default last-write-wins writes aren't bulletproof. They clearly fail Jepsen [1]. And if you turned off last-write-wins, then you'd have to handle siblings -- and how you were supposed to handle them without introducing inconsistency was quite unspecified. Riak clients gave you no help there.

Or, as the Jepsen article says, you could use CRDTs. Before 2.0, Riak had exactly one CRDT, the counter.

Something that might be appealing to some is built-in MapReduce. On the forums, people would warn you not to actually use it unless you didn't actually want availability after all. I don't know if they ever sorted that out.

Another supposedly nice thing was Riak Search. A distributed DB with full-text search out of the box -- that sounds great, right? But there were two things called Riak Search, and the first one just plain didn't work. They deprecated it before it had a replacement, but the replacement was supposed to come in 2.0.

So, Osborne effect. When people gradually discovered that Riak 1.x was bad, and the response was "but Riak 2.0 will be great!", that's a great reason not to use 1.x. 2.0 took a very, very long time, enough time for customers and potential customers, including us, to find other solutions.

[1] https://aphyr.com/posts/285-call-me-maybe-riak


Ya, you have to handle siblings if you don't use LWW. If you would rather Riak execute a pre-defined merge strategy, use Riak's CRDT features.

The original Riak Search in vs 1.x was replaced with integrated Solr in Riak 2.x, and it works.

I'd love to hear more about your use case. Contact info in profile. I'm this username in all the usual suspects.

Disclaimer: I work at Basho.


FWIW I did write a bunch of documentation on sibling resolution in Riak: http://docs.basho.com/riak/latest/dev/using/conflict-resolut...


So it looks like the conflict resolution strategies being recommended are "pick one arbitrarily", or in an advanced section, "keep the longer list"?

Not that it matters to me anymore, but it does sound like having all data in CRDTs is the only way to pass Jepsen. (Or to have your data be immutable, in which case Jepsen doesn't apply.)


Indeed. We're evaluating Riak CS and Swift. In a vacuum, I'd choose Riak CS any day. Knowing all of the drama at Basho, we're in a holding pattern at best and leaning (reluctantly) towards Swift.

Basho needs to sell damn fast or just call it a day and open source the enterprise version.


This is kind of surprising to hear. I had no idea.

Well in that case have you heard about LeoFS? Wonder if it overlaps with any of the Riak CS features for you.

http://leo-project.net/leofs/


The drama has died down. Drama is generally isolated to pitched battles amongst engineers for feature implementations ;)

Contact info in profile, this username at all the usual suspects.

Disclaimer: I work for Basho.


Drama aside, Riak and Swift are not in the same league.


Riak CS is. I said Riak CS, not Riak.


That was a year ago and they seem to be doing fine since.


I had a bad experience with riak v2. I think something went quite wrong in the process when going from v1 to v2, especially around Riak Search.


Riak Search in v1.x and Riak Search in v2.x are completely different. Riak 2.x tightly integrats Solr.

If you wanna talk about it I'm this username at all the usual online places.

Disclaimer: I work for Basho


oooh, that is scary.


We now offer a strong consistency option.

I'm this username in the usual places online if you wanna talk about it.

Disclaimer: I work for Basho


Apache Solr's jepsen tests have had very good results.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: