This is terrific. I spent all of September looking at almost every NoSQL DB in town but nothing could handle the loads we were looking at. Then I found the Postres JSON data types but was sad when I saw it was missing the one thing that would make such a data type worthwhile -- indexing. Now we have it!
No. You can't create an index on a json column that will be usable for arbitrary containment queries which is the awesome thing provided by jsonb.
What you can already do however is to create a functional index if you always search for the same elements in your JSON bodies.
This will then run at the exact same speed as any other btree index (which is still faster than the gin indexes for jsonb), but at a considerable loss of flexibility (or a ton of storage usage).
When you're working with the JSON data type you need to create a number of indexes to get acceptable performance.
Typically one per natural key in the document set ( present in and unique to each document ) for queries that are using that key and then trigram indexes on fields where you're doing partial string matches.
If you're going to be doing random queries against the entire contents of the document you'll probably want to add a full text index.
But assume you have a json field with various, lets say, product attributes which might vary greatly between various products, let's say, based on the category they are in.
Hard drives for example might have a "capacity" field, whereas monitors might have a "screen size" field. Also assume that you have 10000s of products, 100s with completely different attributes, 1000s of distinct attributes across all products.
Now you want to add a search feature that allows to search for products based on these attributes (like "give me all products where screen_size=37 inches")
If you don't have any traditional columns with indexes you can use to limit the amount of rows, that query will either cause a sequence scan over all products, or you will have to create a functional index over the screen_size field in your JSON document which then only covers screen_size, but not, say, the disk_size.
This will mean that you will have to create 1000s of indexes (one for every custom attribute) which also means that you lose a lot of flexibility: Adding a new custom attribute now means that you have to add another index (or you'll suddenly have to deal with unexpected sequence scans).
Using jsonb and a gin index, you need exactly one index to find arbitrary key/value combinations in your json document, giving you the flexibility of trivially adding new attributes but just storing that in the json document.
I think I ran your exact application (large retailer catalog). We just used solr as a sort of external index. Worked like a charm, though obviously introduced additional complexity.
Yeah. I'm also pondering the use of elastic search in my case.
But the additional complexity is scaring me a bit, especially as the product in question's main strength is not showing most of the products it knows about, so none of the elastic search results could be displayed directly, but would have to go through additional processing.
Being able to do all of this directly in Postgres will be much nicer, so I'm really looking forward to 9.4
For any kind of product search I'd go for an external index (I'm strongly biased towards ES, but SOLR is certainly on par feature-wise). Postgres is my go-to relational database and I do love it's json/hstore features, but it does not offer features such as stemming, aggregation, proper weighting or any other analyzer based features.
Adding ES as a secondary datastore just for search is relatively easy in terms of technical complexity, especially if you already have json data. The more complex part is deciding on a relevancy model, but you'd have to do that for every technology you choose.
Over the last few years, September was always early christmas for me because in September we would be getting the new release. And despite this being a database, their x.y.0 releases were always rock solid (with the exception of 9.2.0 and an issue with IN queries and indexes), so I'm usually upgrading very early.
This year, it looks like the releases is bit late, also caused by some on-disk storage issues for jsonb, so of course I'd like them to spend all the time they need, but I'm still very much looking forward to playing with jsonb as this will provide a nice way of solving some issues I'm having.
I'm not using the beta releases for anything bigger than quickly seeing whether clients still mostly work because updating between beta releases is a PITA due to it requiring a full restore most of the time.
The database I'd like to use jsonb with is 600GB in size and restoring a dump takes 2-6 hours depending on the machine.
I'm patiently waiting for this release like I might wait for a good new film or a game. Don't think I was ever that 'excited' (if I can use that word) by a database point release.
I needed a document database/RDBMS hybrid combo for a piece of work I want to do, and was going to default to MongoDB but once I heard of 9.4 I decided to wait and see how it pans out.
Ingestion of high volumes of data was approximately 2.1 times faster in Postgres
MongoDB consumed 33% more the disk space
Data inserts took almost 3 times longer in MongoDB
Data selection took more than 2.5 times longer in MongoDB than in Postgres
Now if only I had a baked in solution for a replica set, with hot master failover and command forwarding in PostgreSQL like I get with MongoDB.
Quirky bolt-on solutions, and those requiring a contract with EnterpriseDB or others need not apply.
I really don't mean for this to be as snarky as it sounds... I'd really love these features in PostgreSQL along with PLv8, I'd rather use that than MongoDB... unfortunately having a good in the box solution for MongoDB over PostgreSQL is a bigger breaking point than lack of Joins in my use case.
Failures are just a property of distributed systems, some fail in ways that are better for you, some fail in ways that are worse, but sooner or later they all fail. It's a tradeoff which one you choose.
MongoDB was tested in a replicated setup, PG not, that's a pretty fundamental difference. Loosing acked writes in a master-slave replication with failover would be expected with a sufficiently large replication lag even for PG.
MongoDB is about the last database I'd pick for any given scenario, but I acknowledge that there are people that have a usecase where it might be a good fit because the tradeoffs work out for them.
Many people here are comparing this feature of Postgres with MongoDB or other document-oriented NoSQL dbs. I think the comparison is just wrong and unfair.
Even if I'm amazed by the performance of Postgres for this particular task (considering also that is a relatively new feature), I don't think performance is the reason why people are using NoSQL dbs. The problem that NoSQL dbs are helping with is scaling. I don't see this as a priority for an RDBMS such as Postgres. Take for instance MongoDB (just because it was named by many of the other comments here, but I guess the same apply to Couch or others): it's relatively simple to deploy a cluster with automatic sharding, replica, failover, etc.. because these are all builtin features.
"NoSQL" encompases too wide a sphere to say something like the problem that NoSQL dbs are helping with is scaling.
Some NoSQL solutions are focused on extreme scale (Cassandara, HBase etc).
Some NoSQL solutions are focused on extreme performance (Redis).
Some NoSQL solutions are focused on developer ergonomics (MongoDB).
Postgres is an interesting option. It ticks the "good enough" box for just about everything, but doesn't ever seem to be leading the pack in any of the "NoSQL" categories. That's probably fine, since it does lead the "Open Source SQL Database" field, and that's a pretty important field to lead.
SQL DBs scale just as well as NoSQL and in the exact same ways: read only clones, sharding, federation, etc. YouTube runs on MySQL in case you're worried about scaling. It may be a tad easier with something like Mongo but it's not hard with MySQL.
IMO, NoSQL is attractive mostly because it's schemaless. You need to change up your model all the time, or handle arbitrary datasets but still query them? NoSQL may be the right tool for the job.
Edit: or should I say: we will have it soon?