Postgres 9.4 feature highlight: Indexing JSON data with jsonb data type

hliyan · on Nov 17, 2014

This is terrific. I spent all of September looking at almost every NoSQL DB in town but nothing could handle the loads we were looking at. Then I found the Postres JSON data types but was sad when I saw it was missing the one thing that would make such a data type worthwhile -- indexing. Now we have it!

Edit: or should I say: we will have it soon?

ddorian43 · on Nov 17, 2014

I think you can index even plain json columns, it just will be slower than jsonb.

pilif · on Nov 17, 2014

No. You can't create an index on a json column that will be usable for arbitrary containment queries which is the awesome thing provided by jsonb.

What you can already do however is to create a functional index if you always search for the same elements in your JSON bodies.

This will then run at the exact same speed as any other btree index (which is still faster than the gin indexes for jsonb), but at a considerable loss of flexibility (or a ton of storage usage).

olefoo · on Nov 17, 2014

When you're working with the JSON data type you need to create a number of indexes to get acceptable performance.

Typically one per natural key in the document set ( present in and unique to each document ) for queries that are using that key and then trigram indexes on fields where you're doing partial string matches.

If you're going to be doing random queries against the entire contents of the document you'll probably want to add a full text index.

Like so https://gist.github.com/laprice/90ea359c4f4dcd99e9f2

ddorian43 · on Nov 17, 2014

Can you tell me any usecases where you want to index every key on the json ?

pilif · on Nov 17, 2014

Not every.

But assume you have a json field with various, lets say, product attributes which might vary greatly between various products, let's say, based on the category they are in.

Hard drives for example might have a "capacity" field, whereas monitors might have a "screen size" field. Also assume that you have 10000s of products, 100s with completely different attributes, 1000s of distinct attributes across all products.

Now you want to add a search feature that allows to search for products based on these attributes (like "give me all products where screen_size=37 inches")

If you don't have any traditional columns with indexes you can use to limit the amount of rows, that query will either cause a sequence scan over all products, or you will have to create a functional index over the screen_size field in your JSON document which then only covers screen_size, but not, say, the disk_size.

This will mean that you will have to create 1000s of indexes (one for every custom attribute) which also means that you lose a lot of flexibility: Adding a new custom attribute now means that you have to add another index (or you'll suddenly have to deal with unexpected sequence scans).

Using jsonb and a gin index, you need exactly one index to find arbitrary key/value combinations in your json document, giving you the flexibility of trivially adding new attributes but just storing that in the json document.

mml · on Nov 17, 2014

I think I ran your exact application (large retailer catalog). We just used solr as a sort of external index. Worked like a charm, though obviously introduced additional complexity.

pilif · on Nov 17, 2014

Yeah. I'm also pondering the use of elastic search in my case.

But the additional complexity is scaring me a bit, especially as the product in question's main strength is not showing most of the products it knows about, so none of the elastic search results could be displayed directly, but would have to go through additional processing.

Being able to do all of this directly in Postgres will be much nicer, so I'm really looking forward to 9.4

Xylakant · on Nov 17, 2014

For any kind of product search I'd go for an external index (I'm strongly biased towards ES, but SOLR is certainly on par feature-wise). Postgres is my go-to relational database and I do love it's json/hstore features, but it does not offer features such as stemming, aggregation, proper weighting or any other analyzer based features.

Adding ES as a secondary datastore just for search is relatively easy in terms of technical complexity, especially if you already have json data. The more complex part is deciding on a relevancy model, but you'd have to do that for every technology you choose.

mappu · on Nov 17, 2014

Can the same index also efficiently find where screen_size > 37?

pilif · on Nov 17, 2014

Unfortunately not.

andrea_s · on Nov 17, 2014

So the new PG JSON indexing is for equality only?

arthursilva · on Nov 17, 2014

Yes, although the Russian index Jedis already have an even faster solution for that (also the most appropriate name ever, VODKA).

But of course PG got you covered, until VODKA gets out you can index specific fields using functional indexes though.

pilif · on Nov 17, 2014

Now if only we'd finally get that release :-)

Over the last few years, September was always early christmas for me because in September we would be getting the new release. And despite this being a database, their x.y.0 releases were always rock solid (with the exception of 9.2.0 and an issue with IN queries and indexes), so I'm usually upgrading very early.

This year, it looks like the releases is bit late, also caused by some on-disk storage issues for jsonb, so of course I'd like them to spend all the time they need, but I'm still very much looking forward to playing with jsonb as this will provide a nice way of solving some issues I'm having.

I'm not using the beta releases for anything bigger than quickly seeing whether clients still mostly work because updating between beta releases is a PITA due to it requiring a full restore most of the time.

The database I'd like to use jsonb with is 600GB in size and restoring a dump takes 2-6 hours depending on the machine.

amitlan · on Nov 17, 2014

9.4RC1 this week!

http://www.postgresql.org/message-id/23861.1415719135@sss.pg...

crucialfelix · on Nov 17, 2014

Does it not work to use the same data dir and use pg_upgrade ?

I'm just about to upgrade 9.1 - 9.3 and am wondering if that method is advisable.

pilif · on Nov 18, 2014

Yes. That works fine between official releases and I've been using it to upgrade in the last 3 years.

However, pg_updrade does not support upgrading between beta releases, which is one of the reasons why I stick to released versions.

harel · on Nov 17, 2014

I'm patiently waiting for this release like I might wait for a good new film or a game. Don't think I was ever that 'excited' (if I can use that word) by a database point release.

I needed a document database/RDBMS hybrid combo for a piece of work I want to do, and was going to default to MongoDB but once I heard of 9.4 I decided to wait and see how it pans out.

nl · on Nov 17, 2014

Does anyone know where to find the 9.4 update JSON[1] features? I was sure it was coming, but I can't see it in the docs anywhere[2][3]?

[1] http://stackoverflow.com/questions/18209625/how-do-i-modify-...

[2] http://www.postgresql.org/docs/9.4/static/datatype-json.html

[3] http://www.postgresql.org/docs/9.4/static/functions-json.htm...

gdulli · on Nov 17, 2014

Every reference to jsonb in those docs is the new 9.4 feature/type.

nl · on Nov 17, 2014

No, how do you update a field within the JSON doc? That is missing in 9.3

gdulli · on Nov 17, 2014

That's not in 9.4.

misiti3780 · on Nov 17, 2014

Does anyone have a link comparing postgres 9.4 jsonb and mongodb ?

Kassandry · on Nov 17, 2014

Right here. =) http://blogs.enterprisedb.com/2014/09/24/postgres-outperform...

zobzu · on Nov 17, 2014

The tldr:

    Ingestion of high volumes of data was approximately 2.1 times faster in Postgres

    MongoDB consumed 33% more the disk space

    Data inserts took almost 3 times longer in MongoDB

    Data selection took more than 2.5 times longer in MongoDB than in Postgres

tracker1 · on Nov 17, 2014

Now if only I had a baked in solution for a replica set, with hot master failover and command forwarding in PostgreSQL like I get with MongoDB.

Quirky bolt-on solutions, and those requiring a contract with EnterpriseDB or others need not apply.

I really don't mean for this to be as snarky as it sounds... I'd really love these features in PostgreSQL along with PLv8, I'd rather use that than MongoDB... unfortunately having a good in the box solution for MongoDB over PostgreSQL is a bigger breaking point than lack of Joins in my use case.

moe · on Nov 17, 2014

with hot master failover and command forwarding [...] like I get with MongoDB.

Failover is documented to not work very well in MongoDB (data loss): http://aphyr.com/posts/284-call-me-maybe-mongodb

On the postgres side RepMgr makes dealing with replication quite easy: https://github.com/2ndQuadrant/repmgr

See also this recent thread on the PostgreSQL mailing list: http://postgresql.nabble.com/Fully-automatic-streaming-repli...

Xylakant · on Nov 17, 2014

I'd be careful quoting the aphyr post in that context. PG also had some failures, even in the non-replicated setup: http://aphyr.com/posts/282-call-me-maybe-postgres

Failures are just a property of distributed systems, some fail in ways that are better for you, some fail in ways that are worse, but sooner or later they all fail. It's a tradeoff which one you choose.

moe · on Nov 17, 2014

PG also had some failures

MongoDB loses acknowledged writes in its supposed "highest consistency" mode.

PostgreSQL does not.

That's a pretty fundamental difference, don't you think?

Xylakant · on Nov 17, 2014

MongoDB was tested in a replicated setup, PG not, that's a pretty fundamental difference. Loosing acked writes in a master-slave replication with failover would be expected with a sufficiently large replication lag even for PG.

MongoDB is about the last database I'd pick for any given scenario, but I acknowledge that there are people that have a usecase where it might be a good fit because the tradeoffs work out for them.

moe · on Nov 18, 2014

Loosing acked writes in a master-slave replication with failover would be expected with a sufficiently large replication lag even for PG.

Bullshit.

eknkc · on Nov 17, 2014

Same here, we use MongoDB with 3 member replica sets even for small deployments. It's easy, built in, works fine with all official drivers.

I'd love to use PostgreSQL but we don't have resources to manage a decent replica set installation with all the sysop burden..

arthursilva · on Nov 17, 2014

I don't like this benchmark. But as long as you don't heavily rely on updates or big documents I'm sure Postgres is at least competitive with Mongo.

arthursilva · on Nov 18, 2014

Why the downvotes? Apparently people didn't care to check the source code before drawing conclusions ...

mau · on Nov 17, 2014

Many people here are comparing this feature of Postgres with MongoDB or other document-oriented NoSQL dbs. I think the comparison is just wrong and unfair.

Even if I'm amazed by the performance of Postgres for this particular task (considering also that is a relatively new feature), I don't think performance is the reason why people are using NoSQL dbs. The problem that NoSQL dbs are helping with is scaling. I don't see this as a priority for an RDBMS such as Postgres. Take for instance MongoDB (just because it was named by many of the other comments here, but I guess the same apply to Couch or others): it's relatively simple to deploy a cluster with automatic sharding, replica, failover, etc.. because these are all builtin features.

nl · on Nov 17, 2014

"NoSQL" encompases too wide a sphere to say something like the problem that NoSQL dbs are helping with is scaling.

Some NoSQL solutions are focused on extreme scale (Cassandara, HBase etc).

Some NoSQL solutions are focused on extreme performance (Redis).

Some NoSQL solutions are focused on developer ergonomics (MongoDB).

Postgres is an interesting option. It ticks the "good enough" box for just about everything, but doesn't ever seem to be leading the pack in any of the "NoSQL" categories. That's probably fine, since it does lead the "Open Source SQL Database" field, and that's a pretty important field to lead.

andrewstuart2 · on Nov 17, 2014

SQL DBs scale just as well as NoSQL and in the exact same ways: read only clones, sharding, federation, etc. YouTube runs on MySQL in case you're worried about scaling. It may be a tad easier with something like Mongo but it's not hard with MySQL.

IMO, NoSQL is attractive mostly because it's schemaless. You need to change up your model all the time, or handle arbitrary datasets but still query them? NoSQL may be the right tool for the job.

BenoitP · on Nov 17, 2014

> handle arbitrary datasets

I was wondering: does the postgres json type does not fulfill all of MongoDB's current feature set?

therealunreal · on Nov 17, 2014

It'd be interesting to see how this compares to hstore on flat data.