Hacker News new | past | comments | ask | show | jobs | submit login
When is "ACID" ACID? Rarely (bailis.org)
63 points by pbailis on Jan 22, 2013 | hide | past | favorite | 31 comments



While I appreciate how thorough the article is, it's a bit of a strawman. Pretty much everyone who makes use of a relational database in a professional capacity has to be aware of what transaction isolation level they're using, make their own choice about what to use, and then do things like acquire explicit update locks or do optimistic locking in order to ensure data integrity. But that doesn't mean that the ACID properties are useless merely because you have to do that; it might mean you have to think about a few things more than you'd like to, but it's still a different world than trying to mimic ACID properties in a NoSQL database, and there are still fairly hard guarantees about things like consistency that you get with other isolation levels. For example, with read committed or snapshot isolation, I still have transactionality and can be sure that if I issue a sequence of 10 updates in a single transaction and then commit it, any other single query is either seeing the results of all 10 or of none of them. That's an important guarantee in many situations, and it's a guarantee that I can use to make decisions about how I structure my application logic.

The author of the post basically seems to treat any isolation level below serializability as some sort of sham perpetrated on the development community, and that's not the case: they're still useful, and they're still something that you can use to build the sorts of application guarantees you want. The mere fact that pretty much every database vendor gives you a choice as to what isolation level to use should be a pretty obvious clue that there's no one-size-fits-all answer there, so harping on the non-existence of a serializable isolation level is somewhat missing the point.


I think it is extremely hard to reason about data integrity properties in non-ACID systems.

Consider a banking application that uses linked accounts such that transactions fail if the combined balance of linked accounts is below zero. If I implement this in Oracle using Snapshot Isolation (the highest isolation level Oracle offers) in the obvious manner, I'll get silent constraint violations: transactions that should fail will succeed because MVCC can't stop two competing transactions debiting the same account pair (i.e., I have two linked accounts A & B with $100 in each and I launch two simultaneous transactions to transfer $150 away from both: afterwards, I'll end up with -$50 in both or some garbage data).

Now, you can fix if you recognise the problem in advance by doing a select-for-update or changing your schema to materialize the constraint (say by creating a linked-balance table that holds the combined balance for all pairs of linked accounts).

But it is really hard to even notice the problem, especially if you've got a few dozen tables with multiple applications writing to your database (RDBMS advocates insist that this is a good thing). And there are no automated methods for determining when this will be a problem: you just get silent data corruption or silent constraint violations in your extremely expensive "ACID-compliant" (but not really) Oracle database.


While I agree that it's a hard problem, I'm not sure I'd agree that it's a hard problem to notice. It's such a common problem for anyone dealing with a relational database that I (and pretty much every engineer I work with) would recognize that sort of a problem immediately, in the same way that I always have app-level threading concerns in the back of my head when I'm working in a language like Java. I just take those sorts of things as a given, and my experience working with other engineers is that anyone else who's had to deal much with SQL has had to think about it as well. You develop a sense of when you need to worry about race conditions and when you don't.

Pretty much every ORM solution has some mechanism for dealing with this sort of a problem, as well. For example, the ORM we've built internally uses optimistic locking in most cases, which at least simplifies the problem space that our engineers have to worry about (i.e. they only have to worry about conflicts that can happen due to updates to two different rows, but any conflicts due to an update to the same row are automatically detected).

I'm also firmly in the "having multiple applications writing to the same database is sheer madness" camp, due to these (and other problems).

Again, I'm not saying that these things aren't annoying to think about, just that I think lots of people are used to thinking about them at this point, and it's a pretty well-understood and often well-communicated problem space with a set of known solution patterns, many of which are well-supported by ORM frameworks.


"MVCC can't stop two competing transactions debiting the same account pair (i.e., I have two linked accounts A & B with $100 in each and I launch two simultaneous transactions to transfer $150 away from both: afterwards, I'll end up with -$50 in both or some garbage data)."

That isn't true. In a pure MVCC world you could detect such issues, specifically that two trxs were opened and they both tried to update the same row. This would generate a conflict and one transaction could just get shot. Snapshot-isolation/MVCC describes this issue.

However what InnoDB/mysql and Clustrix actually do is use MVCC for read-write conflicts and two-phase-locking for write-write conflicts which trivially fixes your proposed problem.


To clarify the example, the two simultaneous transactions are updating different rows: one updates account A's row while the other updates account B's. There is no way for MVCC to recognize this conflict: it will happily corrupt your data.


Sorry, I'm not following. If you have 2 rows storing how much money you have in 1 account then your data isn't normalized and serializability isn't going to solve your problem, if you could be (even) more explicit, I might be able to answer.


I think we're conflating Snapshot Isolation and MVCC (e.g., Snapshot-isolation/MVCC). MVCC is a general concurrency control mechanism, not an isolation level. Coupling MVCC and Snapshot Isolation is like saying "using locks provides serializability," which is not true in general--it depends on how you use the locks.

That said, Snapshot Isolation can lead to anomalies like Example 1.2: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142... If you perform additional synchronization (e.g., 2PL, as you describe), then you can get serializability.


Yes, you're totally correct. I was thinking of SI.


The idea is that each account has a balance. Say I have a checking and savings account: there's one row for each and the bank will kindly let either of them get a negative balance provided the sum of both balances is positive.

You may not care for this schema design, but I think most people would be surprised that Oracle silently corrupts data and violates constraints here.


I'm not sure that people who use relational databases would be surprised that there can be data-level race conditions. For example, if you write code like: 1) User requests transfer in the UI 2) Query the DB to see if the transfer is legal 3) If so, update the database.

you have to think about race conditions even if the database implements serializability. First of all, you have to have enough of a clue to do steps 2 and 3 inside the same database transaction, or else the isolation level doesn't help you in the least. Everyone who deals with an RDBMS has to worry about transaction boundaries all the time anyway to avoid race conditions, so it's not too much of a leap to assume that people who are used to worrying about transaction boundaries will have to have some idea of what kind of problems those transaction boundaries will prevent for them. Most databases in production systems don't do full-on serializability as an isolation level (and if they do, it's too expensive to turn on), so pretty much every engineer's real-world experience is that if they do a query against table A, then an update against table B based on that data, that they can have a race condition and they need to lock A or materialize potential conflicts.

I guess it's possible that my personal experience or the engineers that I work with are not a representative sample, but if you're working with sensitive issues like money transfers you tend to be really, really super-paranoid about it, and there's no way that something like that is overlooked by a halfway competent team.


there are still fairly hard guarantees about things like consistency that you get with other isolation levels

What do you mean by consistency? I agree that there are many ways to ensure that application integrity constraints are not violated--without using serializability. My point is that, without ACID serializability, you'll have to do some extra work to ensure them in general [1].

The author of the post basically seems to treat any isolation level below serializability as some sort of sham perpetrated on the development community, and that's not the case

Weak isolation has a long tradition spanning decades [2] and is hardly a "sham." It's well known that weak isolation doesn't guarantee the ACID properties as traditionally defined. My point is that many databases don't provide ACID as they promise.

It's still a different world than trying to mimic ACID properties in a NoSQL database

In terms of current offerings, I'm sympathetic to this viewpoint, but you might be surprised how cheaply we can get properties like the transactional atomicity you mention.

In general, I'm curious how easily anyone's able to take their application level requirements and map them down into an isolation like "read committed," especially given how awkwardly they're defined [3] and how many corner cases there are.

[1] e.g., http://www.bailis.org/blog/when-is-acid-acid-rarely/#arbitra...

[2] "Granularity of Locks and Degrees of Consistency in a Shared Data Base," Jim Gray et al., 1976 http://diaswww.epfl.ch/courses/adms07/papers/GrayLocks.pdf

[3] e.g., http://www.bailis.org/blog/when-is-acid-acid-rarely/#weak-no...


Fundamentally, what matters is the set of invariants you want to preserve, and it's usually the case that some number can be preserved for you by the database and some can't and have to be dealt with at the application level. So by "consistency" I mean "some invariants that I care about will be preserved," but that doesn't mean all such invariants are.

For example, if I'm writing a process that merely queries a table to pull back the user's account balance in a single query, read committed isolation might be enough for that particular use case to have "consistency:" I know I'm always seeing some consistent balance, even if it might not reflect transactions that are currently in flight (thus giving me a different answer if I run the query it again in 2 seconds). That's still a better consistency guarantee than if I have a read dirty isolation level (or effectively no isolation), so it's still useful.

If I'm doing an actual update to account balances, however, that level of consistency is no longer good enough, obviously: if all updates hit the same rows in the same tables then snapshot isolation level might be good enough to avoid problems. And if that's not good enough, then I can acquire explicit update locks and such, if the conflict has a risk of update locks across different tables. Even in that case, though, I'll need to worry about application-level invariants (like "a person can only withdraw up to $300 per day.").

So my point is that even without serializable isolation the database can still guarantee some invariants for me, even if it can't guarantee all of them, and that the database can never really guarantee preservation of all of the invariants that matter to me no matter how strong it's guarantees, so I always have to think about what I'm going to handle at the application level anyway.

In the case of my company (which makes applications for insurance companies), we do have to think about those sorts of things, but again, we have to think of a ton of things anyway, and the division of labor between the app tier and the database tier is always something we have to worry about. We do things like build optimistic concurrency into our ORM layer to make most common cases easier to think about, and we have pretty well-defined transaction lifecycles, but for the most complicated cases we have to think about what the potential for race conditions in the database would be, just like we have to think about them at the application level, and then we have to decide how to handle them. Again, even a "true" ACID database wouldn't prevent us from having to do that work, because many of the invariants we want to preserve in the data aren't expressable in the database anyway.


Clustrix says it provides repeatable-read because it complies with the ANSI spec. In truth we actually provide something closer to snapshot-isolation by default. Snapshot isolation in summary is a guarantee that all reads made in a transaction will see a consistent state of the database. A transaction will only fail to commit if there is a conflict of two updates. This is a strong isolation level that does have a set of anomalies but most companies don't care about them. Microsoft put out a great paper about snapshot-isolation: http://research.microsoft.com/pubs/69541/tr-95-51.pdf

As the post points out, weaker isolation levels allow for more concurrency which translates to more scalability, which is why we chose our isolation level.


Huh? You ought not be depending on the default isolation settings. If your workload needs serializability, set it. Most don't. CAP teaches us that ACID is a spectrum in a distributed system.


Did you read the post? First, it covers both default isolation and maximum available. Second, note that many databases (like Oracle 11g) don't actually give you serializable semantics at ANY setting. Third, most of these products are not distributed systems. Fourth, ACID is not a spectrum for distributed systems; it is basically impossible to apply without sacrificing availability. But that's perhaps not a big deal if our non-distributed DBs don't provide real ACID anyway.

The real issue here is that the database world is a cargo cult where ignorant people scream ACID to denigrate new technologies without noticing that most production databases aren't running with anything close to ACID and that major database vendors can't even support ACID.


I did read the post.

It does cover both default and maximum. I didn't dispute that. I called out the notion that anyone ought to be depending on the defaults in the first place, or that SERIALIZABLE as a default was a good choice.

Yes, many don't support SERIALIZABLE. Didn't contradict that either.

As to whether many of these are or aren't distributed systems:

Ingres - has replication.

Aerospike - distributed/fault tolerant/blah blah

Persistit - nope. appears to be a library.

Clustrix - clustered.

Greenplum - this is shared-nothing clustered postgres.

DB2 for zOS - i have no idea. let's call this one not distributed, for giggles.

Informix - same

MySQL - lots of replication and HA options

MemSQL - replicated

MSSQL - replication and federated query modes

Nuodb - cloud database management? looks distributed to me.

Oracle - dont they have RAC ?

Berkeley (x2) - dont know. probably not.

PostgreSQL - a few replication options

HANA - no idea. lets call it in your favor.

ScaleDB - clustered.

Volt - shared nothing clustering

That's a little over half, by my count. Certainly close to most.

"The real issue here is that the database world is a cargo cult where ignorant people scream ACID to denigrate new technologies without noticing that most production databases aren't running with anything close to ACID and that major database vendors can't even support ACID."

Some can't. Some do. I'm not screaming. My main message is this: Don't depend on defaults. They differ from vendor to vendor. Understand your workload and use the APPROPRIATE isolation for it.

( edited for formatting and clarity )


I called out the notion that anyone ought to be depending on the defaults in the first place

Regardless of whether you should depend on them, many many people do. Heck, many people don't even understand that there's a choice to be made: after all, everyone knows that Oracle is ACID compliant, right?

Yes, many don't support SERIALIZABLE. Didn't contradict that either.

Sorry, I was confused by the bit about "If your workload needs serializability, set it" since that's physically impossible on Oracle 11g.

Just because a DB has a replication package available (like MySQL) does not mean that it is a distributed system. And the file backed DBs (like Berkeley) are definitely not distributed. Sure, there are some extremely expensive massively parallel DBs in use (like Volt), but the number of deployments for those systems is a drop in the bucket compared with single-node MySQL/Postgres/Oracle/SQLServer/DB2 instances.


Good points on both sides. One thing I'll point out is that many clustering, HA, and multi-master replication solutions either rely on a single master (what I think Michael means when he says they're not distributed) or don't provide serializability.

Things get harder when you're distributed. For example, if you cluster with a sharded master-slave configuration, then, to serialize transactions that span shards/partitions, you'll need to do 2 phase commit or similar between masters for writes and, for most read/write transactions, make sure you don't read from slaves. If you cluster via master-master/active-active, then, for serializability, you'll need locking or some other concurrency control across masters for each shard/partition. Both setups are definitely do-able (if not highly available) but require non-trivial engineering.


"Regardless of whether you should depend on them, many many people do. Heck, many people don't even understand that there's a choice to be made: after all, everyone knows that Oracle is ACID compliant, right?"

Ignorance is not an excuse. It just isn't.

"Just because a DB has a replication package available (like MySQL) does not mean that it is a distributed system. And the file backed DBs (like Berkeley) are definitely not distributed. Sure, there are some extremely expensive massively parallel DBs in use (like Volt), but the number of deployments for those systems is a drop in the bucket compared with single-node MySQL/Postgres/Oracle/SQLServer/DB2 instances."

I do not understand what point you are trying to make here. Are these systems de-facto non-distributed systems merely because of deployment counts? Is there an objective criteria here I should be aware of?


Ignorance is not an excuse. It just isn't.

I'm not interested in excusing anything, I'm interested in understanding the real world. And what I see is that a lot of people who insist on the absolute need for ACID don't really understand it because they're using non-ACID technology right now.

I do not understand what point you are trying to make here.

Consider MySQL with asynchronous replication to a slave. In a weak sense, that is a distributed system because the remote slave is on a different machine (and probably very distant) from the master. But the distributed bit here doesn't interfere with correctly implementing ACID: MySQL with async replication operates identically to single node MySQL: it just transmits the binary logs to a slave server. The database system itself is a single-node service whose state gets replicated at transaction boundaries.

In contrast, distributed shared-nothing databases like Volt have to work really hard to maintain consistency: there is no single node in those systems that does all the work and gets replicated: multiple nodes have to cooperate in order to get anything done.


I would amend that MemSQL is replicated as well as shared-nothing clustered


People scream ACiD on NoSQL's because they fail the easier A, C, or D in various forms, not the harder I part. Having AC&D go a long way in ensuring data correctness.


Unlike "C" and "A" in "CAP," "AC&D" (specifically, "C") can't be easily separated from "I".

Serializability ("I") ensures that database consistency, or maintenance of integrity constraints ("C"), is not violated. While it's possible to get consistency ("C") without serializability (which would give up traditional "I" in favor of a weaker form of isolation), it's often difficult [see http://www.bailis.org/blog/when-is-acid-acid-rarely/#arbitra...].


Things like MongoDB fail at A and D, which is what the parent was probably thinking of.


It also fails at C for concurrent writes


Sure, but A and D are the particularly embarrassing ones.


I think read committed and repeatable read is "pretty close" to ACID. Certainly compared to a straight up NoSQL system that has close to none of A, C, I or D.


When I first read this headline, I thought it had to do with purity of a certain Swiss-made variety that left its mark on San Francisco (and surprisingly, Palo Alto---e.g. Jerry Garcia and Ken Kesey). My mistake


I should have known better, especially in the context of this website. Due to a recent conversation with a friend about ph balance and the acidity of various foods, I thought it was a discussion of the chemical properties of various substances. Extraneous context affecting my initial impression.


I think it's great that someone is taking a detailed look at distinct isolation levels for different systems. It's also worth looking at distinct levels for atomicity, consistency, and durability. One could then define an overall "pH" level for systems.

I would take exception to Bailis' statement that "it is impossible to simultaneously achieve high availability and serializability." At FoundationDB, we do exactly that.

Stephen Pimentel

foundationdb.com


I think it's rather clear that he's talking about high availability in the CAP sense, which is precisely the kind of availability FoundationDB (rightly) doesn't claim to achieve (http://foundationdb.com/#CAP).

BTW, the images are failing to load on that page for me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: