Sometimes you must `EXPLAIN ANALYZE` expensive queries in production, sadly. The behavior of postgres (even on non-bitwise copies) can be different under load.
The biggest way I have seen this be true is with fragmented tables/indexes - same data but organized more compactly can change planner behavior.
Article actually touches on another way that can be true - if your `work_mem` is drastically different the planner may prefer not to use an index, for example, and there are similar settings. Even with identical settings PG may choose different plans, it can be a bit tempermental in my experience, so sometimes you have to run the nonperformant query in offpeak production to understand performance.
The article and the comments here don't make it clear why running it in production shouldn't be done. If slow_query is already running in production, why would running EXPLAIN ANALYZE slow_query be bad?
Is the overhead of running EXPLAIN ANALYZE so much worse than running slow_query itself?
No, it's really not, and that's why I say it must sometimes be done. Certainly if you're running tens or hundreds of copies of the query per minute, one more won't hurt (much).
The real problem you run into is when the query in question is doing something pathologically bad - locking a bunch of tables and then grinding away for an hour, which effectively is a denial of service attack.
one thing to consider is the person who needs to run explain analyze may not have any access whatsoever to the production database. Also, there may be no process in place to get someone to run it on prod on their behalf. Finally, if there is a DBA on production they may just say no.
I don't _think_ the query planner takes "current load" into account.
If you have:
- Same resources (CPU/memory)
- Same settings (eg work_mem, amongst others)
- Same dataset
- Same/similar statistics (gathered with ANALYZE or the autovacuum)
you should get the same results. If I'm wrong, please somebody correct me!
You might be right, definitely if you run e.g. a fully cloned server on the same disk image and hardware it will behave the same, I'm thinking of times when memory pressure on the database from other queries caused caches to get evicted, things like that. It's not really the planner, it's the measured performance from lock contention and the data in memory etc.
Generally indexes are cheap, if built concurrently, so I often build all the possible indexes (for relatively modest sized data types, load, and tables) and look at the planner statistics in production to validate which indexes are useful for the query load. That only works if you have a read-heavy usage pattern, for write-heavy usage patterns it can sometimes be better not to have an index at all (one of the things he alludes to in the article about missing foreign key indexes - indexes come with a write cost, so on a write-only table, foreign key indexes are a bad idea) but as with everything real timing data is the key.
For new table design in Serious Business Databases I do two or three step: offline replica with the same data and settings (but not 100% identical), usually in a transaction on a logical replica or main production (postgres supports DDL transactions, which is a big help, but if you alter a table in a transaction it can block tons of other transactions).
It's important to carefully construct your alter table / create table / create index statements to ensure that everything is concurrent and won't block, of course, and there are plenty of pitfalls there, but definitely doable.
The biggest way I have seen this be true is with fragmented tables/indexes - same data but organized more compactly can change planner behavior.
Article actually touches on another way that can be true - if your `work_mem` is drastically different the planner may prefer not to use an index, for example, and there are similar settings. Even with identical settings PG may choose different plans, it can be a bit tempermental in my experience, so sometimes you have to run the nonperformant query in offpeak production to understand performance.