nkeywal's comments

nkeywal · on July 6, 2015

You could also shut down definitively, for example if the system is corrupted and cannot restart after a partition. There is no inconsistent history in this case, just complete unavailability (you could also claim being CP).

You can also be partly available and partly inconsistent (2PC with heuristic resolution). Here you're not AP nor CP.

Partition intolerance (CA) is a specification. It's saying any network partition is a serious issue for the system, and that may break one or more invariants (ex: atomicity in a sql database).

jhugg · on July 7, 2015

> You could also shut down definitively, for example if the system is corrupted and cannot restart after a partition. There is no inconsistent history in this case, just complete unavailability (you could also claim being CP).

This is precisely CP.

> You can also be partly available and partly inconsistent (2PC with heuristic resolution). Here you're not AP nor CP.

This is basically what the AP systems do, with various ways to manage inconsistency. Dynamo-style EC is but one.

> Partition intolerance (CA) is a specification. It's saying any network partition is a serious issue for the system, and that may break one or more invariants (ex: atomicity in a sql database).

If you can break C in the face of a partition, then you're not CA, are you?

CA is not meaningful. CAP is about choosing between availability and consistency in the face of partitions, which are essentially unavoidable in any non-trivial multi-node system. There are maybe some interesting things to say about latency, but I suspect there isn't much that hasn't been said.

Side note: Mike Stonebraker posited in 2010 that partitions on a network are rare. I'm not going to outright call him wrong, but for the purposes of someone building a distributed system to be run for others on a non-appliance, you're going to run into plenty of partition events on a LAN. VoltDB changed the way our product behaved in version 3.0 to aggressively kill nodes if there is any potential for split brains. We originally intended this feature for the cloud only, but too many users with their own hardware were hitting partitions in surprising configurations.

nkeywal · on July 7, 2015

> This is precisely CP.

Hum. I would prefer to call 'precisely CP' a system that shutdown one of the partition, not one that cannot restart after a partition. Even if formally you can use both (i.e. CP/AP)

> VoltDB changed the way our product behaved in version 3.0 to aggressively kill nodes if there is any potential for split brains. We originally intended this feature for the cloud only,

Quite interesting.

nkeywal · on May 25, 2015

> isn't the entire optimization redundant anyway? With ZooKeeper the ephemeral znode is linked to a session, and if you restart the process it will have a different session. So you also need to manage the znode deletion explicitly in this case. It's possible of course, I have a little preference for the 'ultimate line in the script' as it does not rely on the restart but both options are likely good.

nkeywal · on April 30, 2015

This post looks at the application as a whole, not only the datastore. But for an eventually consistent big data store, AP is not possible. See http://blog.thislongrun.com/2015/04/cap-availability-high-av...

nkeywal · on April 30, 2015

> I frankly think the final "update" to the article should be ignored. Coda updated his post just because he was wrong on his definition of partition. Stonebraker pointed out: "You simply failover to a replica in a transactionally consistent way. Notably, at least Tandem and Vertica have been doing exactly this for years. Therefore, considering a node failure as a partition results in an obviously inappropriate CAP theorem conclusion."

nkeywal · on April 29, 2015

(cross-porting from the blog)

> Second [...] In that case, why use 2PC at all? Heuristic decisions are in the XA standard since its first version (1991). YMMV, but they are very often used (to say the least) in 2PC production systems. See for example how Mark Little describes them: http://planet.jboss.org/post/2pc_or_3pc. Not really presented as an optional thing. It shows traditional databases are not that 'CP at all cost' when it comes to managing failure.

> Your conclusion is correct [...] the resulting algorithm is not C or A Yeah... I see 2PC as not partition-tolerant as a partition breaks acid-atomicity. Once partition intolerance is accepted, CA fits well: 2PC is consistent & available until there is a partition. Saying '2PC is not consistent and not available but is partition tolerant' is not false technically but it's a much less accurate description.

> Third [...] datacenters full of commodity servers [...] > redundant NICs and cables, located next to each other > [...] It isn't hard to believe that you could build a system, like that, > in which you could add algorithms that were CA

I just totally agree with you here. CAP as a categorization tool is used for all types of distributed systems, but there is a huge difference between an application built on commodity hardware running in a multi-dc config and a 2 nodes system running on redundant hardware in a single rack. Typically, 2PC is historically used on very specific applications: few nodes, a limited number of operations between the nodes, expensive redundant hardware all over the place, limited scaling out needs (if any). Not your typical big data system.

nkeywal · on April 29, 2015

Thanks for posting this to HN (I'm the post author).

There are two posts: 1) the link you posted. It looks at the application as a whole, not really at the datastore alone. 2) Another one on the big data store themselves, and more generally of the different definitions of 'availability'. It's here http://blog.thislongrun.com/2015/04/cap-availability-high-av...

I will update the posts to make this clearer. As well, it's a series and these posts are the most complex of the series :-). Many points are more detailed in the previous posts.

nkeywal · on April 1, 2015

There is a kind of opposition between 'CAP as a theorem' and 'CAP as a tool to think about distributed systems'. The theorem does not leave much room for something else than black and white. But in the (great) paper you mention, there is a lot about "what is the future for distributed systems", it's more CAP-as-a-tool.

In the post (and in the blog) I stick to the theorem. It's not 'right' or 'wrong', it's a choice. I made this choice because a lot of people are deciding their trade-offs with CAP-as-a-theorem, while actually CAP-as-a-theorem cannot be applied to the problem they're working on.

nkeywal · on March 30, 2015

'Atomic Consistency' is used in the CAP proof. YMMV, but I've seen it more used than 'linearizability' (it's easier to pronounce...). Agreed, there is no confusion when you use linearizability.

nkeywal · on March 30, 2015

Yes, I agree. It's not 'eventually consistent' but it's nearly available-in-CAP.

There are some elements on the CAP definition traps in the first post: http://thislongrun.blogspot.com/2015/03/comparing-eventually.... I also plan to do another post on this subject.

nkeywal · on March 29, 2015

Thanks for the feedback. > if the developer is not up to understanding the locking model of a traditional DB I'd be pretty surprised if they were up to working with an eventually consistent system I agree. As well, most 'NoSQL' systems don't throw everything. Typically, some of them are strongly consistent. The ones based on Dynamo claim to have 'tunable consistency', i.e. the choice between strong & weak consistency is up to the user.

We have a lot of "acid simply works" and "NoSQL is available". The blog is basically about saying things are not that simple, and it includes this "isolation in acid isn't that simple".

AlisdairO · on March 30, 2015

First, I've made one sinful over-simplification in my own post, in conflating NoSQL systems with eventual consistency. While that's usually the case, it's absolutely not an intrinsic property: my apologies!

True SERIALIZABLE-level ACID does pretty much simply just work - and if you're using Postgres the performance hit really isn't too bad. Of course you're chucking away replication then, so whether it's suitable for your needs may vary rather!

Dynamo-based systems have 'tunable consistency' but that's almost always over one key: multi-key operations are usually inconsistent. That being the case, they're pretty much only 'easy to use' for applications with a very simple data model: my experience is that most applications of any real complexity will at some point want to do some kind of multi-key operation. That being the case, you're probably on the hook for a pretty expensive programmer.

I'm vaguely aware that this doesn't strictly apply to Cassandra, and it has a limited notion of transactions - last I checked they didn't work very well at all ( https://aphyr.com/posts/294-call-me-maybe-cassandra/ ), but that may well have changed

I do appreciate your blog post in general - I think there's an awful lot of oversimplifying of this stuff out there. Part of the problem is that high speed, concurrent, distributed data storage is a topic that is, at its heart, pretty damn complicated. Unfortunately,

nkeywal · on March 30, 2015

Well, I've never used postgres, but from the documentation v9.4 it does not look that different from the others db engine: 'Read Committed is the default isolation'; 'applications using this level must be prepared to retry transactions due to serialization failures' (plus the one you mentioned already: 'Serializable transaction isolation level has not yet been added to Hot Standby replication targets')

Not exactly what I like to call 'simply works' ;-)

But I didn't want to say that a NoSQL database is always better than an traditional one. Just that Isolation is complex on traditional systems when dealing with volumes & concurrency. And, typically, transactions between tables or even rows are difficult/impossible for a distributed database, as these rows can be on different nodes (the 'multi-key operations' you mentioned)

AlisdairO · on March 30, 2015

Postgres is quite different in one respect: it has a truly serializable snapshot isolation, at an acceptable performance cost (single digit percentage generally). Other DB systems are either not truly serializable, or have lock-based systems that are sometimes more difficult to work with for web apps.

> 'applications using this level must be prepared to retry transactions due to serialization failures'

True of any serializable system that supports concurrent access, AFAIK - not quite sure that's a fair criticism :-).

------

> And, typically, transactions between tables or even rows are difficult/impossible for a distributed database

Depends what you mean by 'distributed', really - Oracle RAC is very much distributed, and supports normal transactional behaviour. On the other hand you won't get that working across a large geographical area.

I accept that understanding the impact of isolation levels can be complex - I'm just very much of the opinion that you'll take a lot more pain trying to maintain consistency in a typical NoSQL system.

nkeywal · on April 1, 2015

I can agree with these facts. I just give them a different weight than you do: I don't like the incertitude of the 'serialization failures': it depends on a workload that can be difficult to predict, especially if you're a software vendor. YMMV :-) Thanks for the constructive feedback.

AlisdairO · on April 1, 2015

Thanks for a good read/conversation!