I've been migrating data from a NoSQL database into PostgreSQL for the past few weeks. Now data occupies about half the size on disk. And queries are up by a factor of 100. Especially the advanced ones. Note however, that I have a pretty good understanding of SQL, so I know how to optimize the queries.
One powerful aspect of SQL is that the database can optimize the query. Some NoSQL databases enforces you to write joins manually in code. Unless you really know what you are doing and spend hours on it, the query will be slow. This is exactly what I see with many of our queries. Relational databases is not the answer to everything, but I would prefer to start with one rather than a NoSQL variant. You can always migrate when things begin to get too slow (ie, when you get mad success - which you probably never do anyway)
> Some NoSQL databases enforces you to write joins manually in code. Unless you really know what you are doing and spend hours on it, the query will be slow.
Most (?) encourage denormalization instead of joins-in-code.
> This is exactly what I see with many of our queries. Relational databases is not the answer to everything, but I would prefer to start with one rather than a NoSQL variant.
For most application domains, I agree -- but there are plenty of companies that have essentially outgrown the relational approach already.
That's like saying "I live in a capitalist nation and capitalism is not what it's cracked up to be."
NoSQL is an ideology, and anyone you hear who uses that term is an ideologue. Please name the "NoSQL" you have been using and name the specific problems you have had with it so we can discuss matters more concretely.
Just as RDBMS users prefer one above all else, so do "NoSQL" users. I have seen my fair share of them and I thought a good chunk of them where "crap" (where crap means "does not fit my mental model of the solution"; mainly because I am too naive to study the field better and appreciate what the tools are capable of, so I dismiss anything that doesn't entirely model CL's Elephant.)
Last week I rolled out my own Memcached "replacement" in Common Lisp and I was spitting in bradfitz's general direction, that looney bloat monger! Yeah, my "memcached" replacement is an in-process memoizer for all the DB accessing functions. It worked until the server instance was overloaded and the load-balancer forked a second instance: then I had a problem. I wish I could suck-back all the spit I have sent in brad's general direction.
> NoSQL is an ideology, and anyone you hear who uses that term is an ideologue.
You're sounding a little ideologue-ish yourself. :)
NoSQL is a term of limited use since there is such a huge variety of non-relational database technologies, but that doesn't mean using it makes you an ideologue any more than someone using the term "dynamic languages," which is similarly broad.
btw, it's ironic that the author of the story was complaining about the lack of querying language, when, in fact, "NoSQL" stands for "No Structured Query Language" :-D
I've had an unusual problem with couchdb. A bad query can flatten your host. The engine uses every bit of performance in pursuit of its answer, which might be the workload equivalent of a series of stupid open joins in SQL. Your bash consoles become unresponsive. The same mistake with a traditional relational database causes less damage because those databases are less effective at harvesting the power of the host.
There's a training obstacle. I know how SQL works and how to make SQL perform well almost by instinct. NoSQL requires me to relearn data querying. The lack of standardisation across offerings means that the knowledge I acquire will be less transferable what I've accumulated from Oracle/Postgresql/Openbase/Mysql/Sqlite.
I don't think that the people who use CouchDB (or similar non-traditional data stores) are really worried about "tranferable knowledge". They're just interested in solving a problem they think the NoSQL group of technologies can help them with.
You should think of CouchDB as the Web2.0 equivalent of Microsoft Access. It's good for running a simple web-app off, because you can store the HTML, JS and the data all in CouchDB (I think). So one use-case may be a simple billing system for your local surf shop that is sure as hell not going to store a lot of data or handle a lot of load. I got this impression from a Damien Katz talk. (I think the use-case they use is a recipe app.)
> you can store the HTML, JS and the data all in CouchDB
I think you're talking of CouchApp, which allows you to create apps like that on top of CouchDB. Of course, if that was the maximum of what one was trying to do, one could get similar functionality from an RDBMS, some simple routing rules, and a CGI script that served content straight from the DB.
I guess my point is that there's not really anything intrinsic to CouchDB that makes this possible: it's a pretty short and inviting conceptual leap from "this DB does everything through an HTTP interface" to "I can run my entire application from this DB". My worry (and I'm not accusing the parent of this) is that that misses out on all of the other things that CouchDB offers (especially the ability to merge DBs that have been disconnected for long periods, which often gets lost behind the HTTP interface and MapReduce hand-waving).
The difference being that if you want to answer a question like "what hour of the day is my store busiest?" you would have to write a MapReduce function in Erlang for couchdb. Where alternatively, anyone with minimal knowledge can figure out that query in SQL.
You've got something that is inherently an order of magnitude slower than other solutions. Why would you bother to try to scale it? It's not the right tool for the job.
It will soon. We have an 8-node cluster running in PROD right now. A Dynamo-like layer provides consistent hashing, partitioning, read/write quorum constants and the like.
One of the complaints that people have about the NoSQL label is that it attempts collect a huge variety of databases that don't fit the traditional label under one name. The problem with that is illustrated in this article, which does a great job of describing problems that some of the NoSQL alternatives experience – but the problems here don't apply nearly as strongly to document-oriented databases, or to graph databases. I think the title's overstating the argument, in other words.
Now if it were "the dark side of column-oriented databases," Bob's your uncle.
This are real problems. With Redis I tried to mitigate some of this problems by providing an API that allows to retrieve the full state of the database just using available commands (without having to mess with the binary dump) including for instance reading the TTL of an expiring key. This allows to create scripts that are able to convert a full database into a CSV file, and restore back the CSV file into the DB, just using the net API.
Of course with key/value stores changing the structure of the database is harder compared to SQL databases, still designing a good data layout when building your application can help a lot to make it simple to add features to the application without to resort to scripts that rebuild the data in the key/value store.
i agree with data export point. @antirez if we could clone you and distribute to other KV projects then all KV datastores skyrocket in popularity. first and foremost issue in KV popularity is ability to try quickly. if there would be a small php script that converts an sql dump into a KV datastore (and vice versa) then much more people will be trying and they know that if it wouldnt work out they can easily opt out. i hope you'll continue to be thoughtful for others who are not C++ or Java wizards. thanks!
I'm one of the lead programmers of Keyspace, which is a key-value store. The main strength of Keyspace is the consistent replication, but right now I'll tell you what my experiences are building a web app (real-time search) using Keyspace. In other words, what happened when I drank my own kool-aid.
1. Schemaless. This is a half-truth. In reality there is always some kind of schema, otherwise you'd get chaos. What works for us in Keyspace, where everything is a string, is to use "/" as a separator convention. So we have keys like:
"/user/id:001" => <data>
More complicated example:
"/user/id:001/view/keyword:keyspace/"
2. Listing. Since listing happens on a prefix basis in Keyspace (give me everything that starts with "/user/id:001"), if you just want to get a list of some properties of the user, you have to create the list explicitly, otherwise you'll get all the keys under the user. Eg.
3. If you need to list by property X, you need to create a subtree for it, as above. This sucks if you figure this out at a later point, because you have to programatically parse your existing data and create it, besides adding it to the online code for new items. That's why RDBMSs are so good.
4. Backups. No good backup strategy at this point, hot or cold.
..
So why should anyone put up with this? You have to realize there's lots of hype right now, in reality KV stores should only be used by a small fraction of DB users. If an RDBMS is Java/C++, then a KV store is (can be) Assembly. You have much more control over what's going on. (You can of course get the same control by having a giant key-value table in your RDBMS.) You can put a cache layer in your stack more easily.
So you should think of a good KV store as something that you use under certain hi-performance cases, and even then, the KV store should offer some plus over having a giant key-value table in My/Postresql. Usually, that plus is some kind of distribution or replication that usually works. In case of Keyspace it's replication.
The other big plus is, as I learned, is that programming in the KV model is much much more natural then having embedded SQL statements in your code. A typical segment in my (Python) code is:
users = conn.listkeys("/user/id:%s/views/" % user_id)
for (...):
This is much nicer then having SQL statements, and since you're using a KV store, it can't get more complicated than this. As I said, I assume you're doing hi-performance stuff, in which case some crappy ORM layer is out of the question anyway.
Personally I think there's a fairly large cultural disconnect between shops running open-source stacks and shops running Oracle, so such a comparison would not influence user choices much. But it would be very interesting for an implementor like me: if Oracle is faster, which is likely at this point, it would set a performance point to beat.
As a final fun fact, Keyspace is currently running on top of BerkeleyDB, which is an Oracle property =) But not for much longer, I'm very unsatisfied with BDB, it doesn't live up to its image of being an industrial strength btree store (it sucks).
What do you dislike about Berkeley DB? I've been using it for a little while for a pet project, and it seems decent. I'd like it to support range-based join cursors, but that's not a major complaint. Granted, I haven't yet had to deal with database recovery, so I don't know how it stacks up there.
Also --- what do you plan to replace it with? I know that AllegroCache started out based on BDB, and ended up not using it. The Franz guys said performance wasn't up to par, and I always wished for more details.
Personally I think there's a fairly large cultural disconnect between shops running open-source stacks and shops running Oracle...
Quite possibly. Our company does use a lot of open source in the Java & Perl layers, but we're pretty tightly locked in to Oracle on the backend. I think it may be true that the sheer vastness of Oracle tempts you into believing you can solve every problem the Oracle way, which is why IOTs came to mind. I'd love to experiment with other options, myself, but its difficult once everything you do is predicated on it.
It's just a marketing/catchy term for the trend to 'next-generation' (read: 'different') database engines. Since there is (in almost all cases) no SQL (the language) involved, it is a pretty appropriate name.
A better analogy IMO is Python, Ruby et al calling themselves "dynamic languages" to distinguish themselves from the popular static languages of the time.
But they called themselves dynamic, not "NoC++" or "NoJava". That is, it was described by its intrinsic properties instead of as not "a specific instance of a something that implements what they don't like."
SQL is a way to interact with a relational database, but it is not itself a database. But "NoRDBMS" still isn't as good as saying what the technology is - there are an infinite number of things that are not relational databases.
The point of the NoSQL movement is not to advocate a specific technology (e.g. Java, CouchDB, etc), but merely to encourage people to consider NoSQL databases for the 5-10% of cases when SQL works poorly.
Then it would probably be more useful to focus on the problems that the specific databases solve, rather than talking about them collectively in such a general manner. It makes it hard to focus on their strengths.
I would love to read a really detailed comparison of CouchDB (for example) to RDBMSs for common uses, by someone who thoroughly understands the relational model and is very familiar with (say) Postgres or Oracle. I like Postgres and SQLite, but I'm curious about CouchDB - it's a different kind of tool, and it probably has several complementary use cases.
NoSQL solves many of the problems that SQL has (with tradoffs, of course). The big ones:
Performance: Key value stores are fast. You lose features like queries by anything other than index and/or ACID guarantees. (Tokyo Cabinet or Redis).
Data structures: storing non-flat data (trees, graphs) in a SQL database == FAIL. Building a DB around a non-flat data structure lets you avoid doing O(N) queries to traverse a structure. (See Neo4j for graphs or CouchDB/Mongo for trees).
Yes, thank you. Details about the trade-offs involved would be better. My objection is that collectively dealing with them as "the NoSQL databases" tends to obscure that e.g. the key/value stores have very different design and performance trade-offs than CouchDB does, for example. Redis runs entirely in memory, which has big implications for both speed and data set size. And so on.
I should have said "trade-offs" to begin with...that was the heart of what I was getting at when I said "strengths".
You can store a tree using foreign keys for the children, but traversing the tree to level m requires m joins. Finding children at an arbitrary level requires O(tree depth) queries.
You can use preorder tree traversal which consists of mapping nodes to pairs of numbers representing intervals on the real line. Then "x is descended from y" is equivalent to "x's interval is contained in y's interval", and this at least lets you find the children of x. Going up a tree is harder. Inserting into the tree is tricky as well.
Another hack is to store on each node a string containing it's ancestors. I.e., "/2/4/27/35" indicates a node is descended can be reached from the root (node 2) by following the path 2->4->27->35. Then the regex search "2/4/27/35/.*" finds all children of node 35. This one is utter fail.
However, I freely admit I'm not a database expert. Is there some non-ugly solution I'm unaware of?
From what I can tell, it seems the "dark side" mentioned only really affects Voldemort and Cassandra, out of all the databases that fall into the "NoSQL" category?
Ad-hoc querying affects all distributed databases, which includes dynomite, hbase, and hypertable as well as voldemort and cassandra. Also all key-oriented databases, which include Tokyo Tyrant and redis. And some others like CouchDB (where computing a new view via map-reduce is basically equivalent to doing a query via hadoop against another system, so I don't consider that real ad-hoc query support).
Really the list of databases that do provide ad-hoc queries is much shorter. Neo4j (graph db), mongodb (single-master "document-oriented" db), the xml databases, probably some other minor projects.
TFA also mentions data export as a "dark side" but for projects that lack this it's really just a "nobody's needed it enough to write it yet" thing. So that's less interesting IMO than talking about ad hoc queries which is usually more of an architectural limitation.
Well, ad-hoc querying can be implemented in Hypertable with HQL, it's just not a priority at the moment, so features are a bit lacking. I mean, it's not an inherent 'darkside' of distributed database. OTOH, I've done cap shell with HQL and grep/sort/whatnot a few times on a few TB of data and not exactly painful :)
These have nothing to do with the concept, just the current state of the tools. There is no reason that a non-relational DB without a declarative query language can't have good support for data dumps, ad hoc queries and ad hoc "data fixing".
Do you mean because they offer poor support, or because there's usually large volumes of data involved, or both?
Cassandra, for instance, provides an easy snapshot API, and you can back up the snapshots using whatever infrastructure you'd normally use to back up TB of data. (If you don't have such an infrastructure, yeah, that's a problem, but not Cassandra's fault. :)
Edit: I should point out that since Cassandra supports multiple data center replication already, I'm having trouble picturing a scenario where you want to do anything w/ the snapshots besides just leave them on the Cassandra nodes themselves (plan extra HDD capacity as necessary depending on how long you want to keep them). But some such scenarios probably do exist.
Replication != backup, if you/someone accidentally/maliciously screw up your data with the API, the screw ups are replicated as well, especially with Cassandra, where there is no builtin data versioning.
Backups are essential when you're doing major upgrades or data migrations/mangling that can fail.
OTOH, data export is not an inherent problem of distributed database either. If enough people want the feature, they can be built without too much fuss.
Obviously. The context was using snapshots as backups -- you can keep them around indefinitely, space permitting, and if you're using Cassandra's multi-datacenter features they're automatically "remote" as well.
> especially with Cassandra, where there is no builtin data versioning
At the risk of belaboring the obvious, versioning != backup, either. :)
It seemed to me like the article was more about particular "NoSQL" databases than it was about the genre as a whole. In other words, I don't think the author's pain is endemic to this admittedly broad class of data stores, just the few that do exist.
How so. In most schema-free db's the missing alter table feature is not needed. maybe you meant the quotes to be in a different place? missing "alter table problem" perhaps?
You won't get O(1) since scheme-free dbs are not organized around columns/fields in the same way a traditional db is all about rows and columns, but just about every one I am aware of let's you create an index on a particular field so at least you get O(n). Given the infrequent use of this operation I am happy to leave it in the "O(n) is good enough" category in return for the other benefits an efficiencies that a schema-free db offer.
Well, you can get both with Hypertable, where you can alter table to add/drop column families in O(1) time, while still have schema free benefits with column qualifiers.
Hmm... there are different query APIs to all the NoSQL variants?
What stops you from doing a one-liner from shell (or a short throw-away program) instead of a SQL query/update? (Well, except for the power of SQL compared to the APIs, of course.)
Disclaimer: I don't know enough about these newfangled things and their tradeoffs to have a serious opinion. They do look intriguing. :-)
One powerful aspect of SQL is that the database can optimize the query. Some NoSQL databases enforces you to write joins manually in code. Unless you really know what you are doing and spend hours on it, the query will be slow. This is exactly what I see with many of our queries. Relational databases is not the answer to everything, but I would prefer to start with one rather than a NoSQL variant. You can always migrate when things begin to get too slow (ie, when you get mad success - which you probably never do anyway)