(Mostly) like a clockwork: A new year, a new release. And like every year before we find a beautiful collection of new stuff to play with.
Even better this time around: It's looking as if the next release of Ubuntu will get 9.1 packaged which spares me from manually packaging or using a PPA this time around.
The new features each release introduces are too sweet to skip just because a distribution is lagging. And ever since I began using PostgreSQL at the 7.1 days I have _never_ experienced a bug that really affected me. No byte of data has ever been lost, no single time did it crash on me due to circumstances beyond my control (cough free disk space cough).
Congratulations to everybody responsible for yet another awesome release!
For a trivial, synthetic write benchmark that I usually use to benchmark hardware and/or config changes, I'm seeing slight slowdowns for non-concurrent loads, and solid improvements for concurrent loads with.
For 1-2 clients, I'm seeing ~8% slower.
For 4 clients, 7.2% faster.
8 clients, 15% faster.
16 clients, 16.4% faster.
32 clients, 11% faster.
64 clients, 10% faster. Aggregate performance starts to level off here, so I stopped.
These are just cycling super simple INSERTs/DELETEs against the same table, columns data is 1K string, 100 byte string, then a concatenation of the pid and current iterator count. No indexes or primary keys. Each client is just a fork that performs 10,000 INSERTs, then 10,000 DELETEs in a loop of 10,000 iterations.
For the record, that's around 6,729 writes per second with 32 clients. If I set synchronous_commit = OFF in each client before running the benchmark, it's 27,157/sec. Then, if I reduce the first column size to 100 bytes, it's 50,592/sec. Impressive. I'm sure the synchronous_commit improvement would be much more drastic on disks without BBU write caches.
Database server is a 4-core Nehalem-based Xeon with 16GB RAM and a SAS disk array. PostgreSQL configuration has been decently tuned and full write durability is retained all the way down to the disks.
It seems that in general this decision should be made more frequently than it is, at least for new projects. Why do not more people select Postgres? Is it comfort level with MySQL? Tools and extensions?
> Why do not more people select Postgres? Is it comfort level with MySQL?
* Cheap (let alone free/ISP) hosting does not provide Postgres
* Most introductory tutorials, especially in the PHP world (but also Ruby I believe) use MySQL
* AMP packages/installers (is there any out there using Postgres as its db?) and a history of punitively hard installation
Following that, a combination of habit (MySQL works for me, why would I care?) and BLUB (what could it even bring to the table?), as well as drawbacks/issues whose fixes are recent (replication, master/slave, ...)
I have an unusual reason, but a strong one: I primarily use relational DBs for large-scale analysis of frozen snapshots of data, rather than transactional loads. PostgreSQL has MVCC features built into it at the most fundamental levels where they cannot be disabled, and as such it's not suitable for the types of queries that I frequently run against MySQL. In the most extreme (and trivial) case, you can't run a "SELECT COUNT(*)" on a table in PostgreSQL without a full sequential scan of your data, which can be a huge expense when your row counts are in the tens or hundreds of millions. MySQL, in contrast, can return a cached answer instantly, which isn't possible under PostgreSQL. Less trivially, MySQL is still drastically faster for aggregates on low-cardinality fields.
Yes, I know you could set something up with triggers, but the point is that it needs to be simple enough for frequent, ad-hoc usage. Usually the time I need to know how big a table is when I just made it, and I might well drop it five minutes later. I'm not going to set up an elaborate network of meta-data tables when MySQL will just do it for me for free.
Not trying to hate on Postgres, by the way. I fully get that it's superior in most regards, and I use it frequently just for the much stronger support of user-defined functions. But MySQL does still have a few tricks left in it.
If I remember correctly it is only with MyISAM you get instant count(*) while InnoDB, just like PostgreSQL, has to look at every row of the table (or every row of the index since InnoDB also supports index only scans). This is due to both PostgreSQL and InnoDB being MVCC (multi-version concurrency control) database engines.
And MyISAM is only fast in very specialized workloads with low write and read/write concurrency.
For something like developer testing against frozen snapshots of data, you don't have the CRUD aspects, so turn fsync off (fsync = off in postgresql.conf). You'll note a large speed increase when you're not waiting on the disk to confirm transactions.
Nearly all of the introductory PHP tutorials use MySQL, so many beginning web developers automatically go to MySQL because it is the only thing they have played around with.
Finally! True serializability! Never let anyone (including the docs for older postgres versions) tell you that predicate locking is too hard to implement.
Kevin Grittner will be talking on Serialization this Friday,
September 16th 11:30 a.m. – 12:30 p.m. His talk will be recorded... so those who aren't in Chicago can see/hear it later next week.
http://postgresopen.org/2011/schedule/presentations/61/
The only thing I feel is wrong with this (really: the /only thing/, which is fricken awesome... I love PostgreSQL from the bottom of my heart, and have been using it for almost all of my database needs since the late 90s) is "per-column collation": collation is not a problem "per-column", it is a problem "per-index", which means you really want it "per-operator class".
Here's the use case: you have a website, and you have users using it in English and French. With per-column collation, you are being advocated to have two fields, one english_name, and one french_name, that have /the same content/, but are defined using a different collation, so that the ordering condition on them becomes language-dependent.
The effect that has is actually terrible: it means that the size of your row (and yes, this may end up in TOAST, but there is still a massive penalty to going that route) ends up becoming ginormous, and the size of your row will just get larger the more languages you want to support as first-class citizens in your app.
Instead, what you /want/ is to just have an index english_ordered and an index french_ordered, and you want to be able to select which index you use for any specific query. If you "do it right", you'd also want to be able to support ordering the data using German collation, but it would "just be irritating slower".
Now, if you don't use PostgreSQL much, this may seem like a pipe dream of extra standards and complex interactions ("how will you specify that?!", etc.). However, it turns out there is already a feature that does 99% of this: "operator classes", which is how PostgreSQL lets you define custom collations for user-defined types.
Only, PostgreSQL operator classes are slightly more general than that, as you can specify an operator class to be used when performing order operations for your index; and, even more importantly: they are already being used to work around a specific case of operator-specific collation.
Here's the example: let's say that your database is set up for UTF-8 collation, and yet you have this one field you want to do a "prefix-match" on: WHERE a LIKE 'B%'. The problem with this is that you cannot use a Unicode collation to index this search: it might be that 'B' and 'b' and even 'Q' all index "exactly the same" for purposes of this collation (and there are some other corner cases with the other mapping direction as well).
So, to get index performance for this field, without changing your entire database to collate using "C" collation (which works out to "binary ordering"), you have a few choices, with one of them being to create a index that uses the special "operator class" called text_pattern_ops ("text" in this case as the field is likely a "text" field: there is also varchar_pattern_ops, etc.).
Once specified in your index, PostgreSQL knows to use it for purposes of the aforementioned LIKE clause. You specify this while making your index by specifying the operator class after the column.
CREATE INDEX my_index ON my_table (a text_pattern_ops);
The next piece of the puzzle is that an ORDER BY clause can take a USING parameter to pass it a custom operator, and you can always (obviously) use a custom operator for purposes of comparison. So, you now are in the position where you should be able to do this:
CREATE INDEX my_index ON my_table (a english_collation_ops);
CREATE INDEX my_index ON my_table (a french_collation_ops);
SELECT * FROM my_table
WHERE english_collation_less(a, 'Bob')
ORDER BY a USING english_collation_less;
So, really, the only thing that needs to be specified, is we need the ability to have "parameterized operator classes": as in, we really need a "meta operator class" that takes itself an argument, the string name of the collation, and then returns an operator class. With this one general technique defined, we not only drastically increase PostgreSQL's user-defined type abilities, but we better solve this whole class of collation problem.
(Unfortunately, I suck at e-mail, or I'd get on the PostgreSQL mailing list and try to argue for this in a more well-defined way; maybe someone else who cares will eventually see it and become this feature's champion; or, of course, come up with an even better solution than mine ;P.)
"With per-column collation, you are being advocated to have two fields, one english_name, and one french_name, that have /the same content/, but are defined using a different collation, so that the ordering condition on them becomes language-dependent."
I don't think that's the intention. Think of column collation as the "default" for that column, and you override it with the COLLATE clause. So if most of your ordering/range queries are based on american english, set the column collation to "en_US", and when you have a query that requires french ordering, then use the COLLATE clause by the column reference, and the ordering will change for french.
I agree that SQL makes all of these things more confusing than they need to be, but supporting the standard for this concept has some value.
"CREATE INDEX my_index ON my_table (a english_collation_ops);"
I'm not sure if you're aware of this or not, but CREATE INDEX accepts a COLLATE clause in addition to the operator class. So the collation is kind of an extra parameter to the operator class (or something like that).
So, it's not a question of capability, because I think the capability you want is readily there. However, if you are saying that it's more confusing than it needs to be, then I agree.
(I make this a reply, as it is kind of a separate thought; also, I just came up with this in the shower, and a ton of people have already read the other comment and might not notice a new edit even if they cared about the post. ;P)
As some people may balk at the "english_collation_less" operator usage (after all: using < is so much more convenient, especially if you have multiple of these operators strewn throughout every single one of your statements), all we really need to specify is a single new data type: collated_text. This, combined with a "client_collation" session variable, would allow clients to use < and ORDER BY on fields of all types, and get the behavior they would expect. It is only while constructing an index that you would need to specify the collation using a more complex syntax.
CREATE TABLE my_table (a collated_text);
CREATE INDEX my_index ON my_table (a collation_ops('en_US'));
SET client_collation = 'en_US';
SELECT * FROM my_table WHERE a < 'Bob' ORDER BY a;
(Given that they already seemed willing to add special syntax for per-column collation, if you did not believe in the dream of "per-operator class parameters" (or "parameterized operator classes", or "meta operator classes", or whatever), you could imagine "collation_ops('en_US')" being replaced by an index-specific syntax COLLATED BY 'en_US'). Just saying there are tons of options here, and they all seem better than forcing me to have 7x as much data in each of my rows, all identical.)
For my multi-lingual-content-needs I work with a main table which only has the non-translatable data items and a subtable which contains the translatable content, one row per language per row from the main table. Then I do a join to get a single full 'object' in the language of choice. And per the postgres docs I can choose collation at query time. Thus the following query would do the trick if I'm correct:
select * from main_table join sub_table on main_id=sub_id where sub_table.language='fr_FR' ORDER BY my_translatable_col COLLATE "fr_FR.utf8";
No bloating tables, just a join and correct collation.
Most content is not "translated", it just needs to be differently collated. If you have a company directory, you have the names of every user in the company in a table, and you need to display that information in a collation based on the locale of the viewing user. Having to have a new table for every language that contains the same data as the main table is just pointless overhead. Also, while you can choose the collation at query time, the point of per-column collation is to let you have an index over that collation. Can you please demonstrate how the use case of "company directory" would be cleanly and efficiently implemented using per-column collation?
EDIT: I've been looking more into this COLLATE keyword that they have added, and I'm actually somewhat curious to see if I can make this work (where the optimizer manages to choose the right index) by something like the following, in which case I'm going to be seriously happy... ;P.
CREATE TABLE my_table (a text);
CREATE INDEX my_index ON my_table ((a COLLATE "en_US"));
CREATE INDEX my_index ON my_table ((a COLLATE "de_DE"));
SELECT * FROM my_table ORDER BY a COLLATE "de_DE";
Simply out of curiosity for this same topic, do you happen to know of a good resource for finding out even just some of the less trivial differences that this solves? I'm sure it does but off hand I don't know them (I'm not all that multilingual).
I understand it'll bring in glyph orderings that don't exist in en_US or whatever you've got the default set to, such as 'Ç' in french among others.
I do not have a good resource, however, I know a few off the top of my head: 1) characters with modifiers, like umlauts, sometimes collate the same, and sometimes collate differently; 2) multiple characters may collate as a single character, such as "ll" (I just did a search to verify that this was the case in Spanish, and found the Collation page on Wikipedia, which you might find interesting); and 3) different locales may choose to collate numbers using different algorithms (in English we usually expect "1,000" to sort after "200", but if "," is a decimal point, then you might not).
Should I as a web app developer at a startup be looking for an RDBMS beyond PostgreSQL, probably a commercial one?
I have somewhat of a database background, so I see the obvious advantages of PostgreSQL over MySQL. In particular, things like more procedural language support, a better query optimizer, better concurrency control, etc. (Though things like Amazon RDS are compelling from a deployment perspective.)
I fundamentally believe that using an RDBMS, rather than a NoSQL data store is the right approach for rapid development of web apps. (Though, I primarily mean this as an attack on, e.g., MongoDB, since I think Redis is great, just not a replacement for an RDBMS.)
However, I have almost no experience with the more advanced end of the RDBMS spectrum, primarily because they tend to cost money (for the real, non-free versions). Should I be learning/looking at DB2? Should I be learning/looking at Oracle?
Or do the additional features of these more advanced RDBMS options require such specialized scenarios (a bank, a big enterprise) or such specialized hardware (weird clustered setups) that MySQL/PostgreSQL will always be just as good?
A short, probably biased and not 100% precise answer would be: no, you don't need to look for a commercial RDBMS.
The real answer is: always evaluate options. Make sure the solution you choose supports everything you need and that you will be able to learn how to use it or hire people who can. Without knowing what your requirements are it's impossible to say, but in the vast majority of cases, PostgreSQL will be as good as Oracle, MS SQL Server or DB2.
If your evaluation indicates that you need to pay for one of these, double and triple check it... and only if you're 100% sure that's the case, shell out for a commercial RDBMS.
I think that's probably true, though I do wonder from a purely engineering standpoint what features Oracle/DB2/MS have at this point that PostgreSQL does not. Special index types? Query hints? Suggestions for physical layout on table creation?
Speaking only to the differences between PostgreSQL and SQL Server:
1. "Live" clustered indexes
2. Query Parallelism
3. A richer API for data partitioning and _arguably_ a better query optimizer for partitioned data. 9.1 sees some wonderful improvements to the optimizer in this regard and I'll be poking at those very features in the coming months.
SQL Server has query hints, but I strongly agree with the PostgreSQL's team stance on them; I'm glad they avoid them.
As far as special index types: If you need a specialized index type, and nobody supports it, you're dramatically, drastically, fantastically more likely to implement it/find someone to implement it in PostgreSQL than any commercial engine.
Query hints are not likely to happen, at least not in the way that they do in other databases. There's been a number of discussions about it and you can see some of the aftermath at http://wiki.postgresql.org/wiki/OptimizerHintsDiscussion
As a PostgreSQL fan myself, the main thing I miss in PostgreSQL (as opposed to DB2 or Oracle) is nice and simple hot backups with point-in-time recovery. It is certainly possible but (at least in 9.0) it requires a non-trivial level of understanding of the WAL archiving process to set it up and run it smoothly and safely. Given how important backups are, I still tend to stay away from messing with WAL logs and just use SQL dumps (pg_dump) for backups for the time being. I really miss the simplicity of dealing with transaction logs and roll forwarding in DB2.
In spatial queries, find the 5 restaurants closest to me. In full text search, find the 5 words closest to "fuschia" so you can spell check, or try to search for what you __think__ the user meant to type.
Nice, thanks for all the examples and explanations!
Anyone know if it's possible to define a custom distance metric for use with this? We don't currently use full-text or spatial indices, but I can think of some cool things we could do with a generalised notion of "distance".
Yes, it is possible. Postgres supports custom data types and operators on those data types, and (using the GiST subsystem) you can create indexes using those custom operators.
Unfortunately, there is a caveat: this will more than likely require writing custom C code. I've never tried to implement GiST indexing for a new data type, so I don't know just how much effort would be required.
We have a table which essentially represents a shallow graph (rows in the table can have links to to one or more other rows in the table). It would probably be interesting if we could define a distance metric in terms of those links: "find all rows reachable in 3 links from this starting row".
This probably isn't ideally done by this sort of index, as it would be a pretty weird sort of distance metric, if it's even a valid one (most rows aren't connected, meaning the "distance" between most rows would be infinite, and our query would really be "WHERE distance < infinity"). But it's good to know there's some kind of support for it!
Ah, right. For the kNN/GiST support in Postgres, distance has to be defined in such a way that it is computable given only two data values; furthermore, I'm pretty sure it has to be both symmetric and satisfy the triangle inequality in order for the index to produce correct results.
Even better this time around: It's looking as if the next release of Ubuntu will get 9.1 packaged which spares me from manually packaging or using a PPA this time around.
The new features each release introduces are too sweet to skip just because a distribution is lagging. And ever since I began using PostgreSQL at the 7.1 days I have _never_ experienced a bug that really affected me. No byte of data has ever been lost, no single time did it crash on me due to circumstances beyond my control (cough free disk space cough).
Congratulations to everybody responsible for yet another awesome release!
Yes. I am a fanboy. Sorry.