I really wish the postgres query planner would gain the ability to replan a query mid way through execution...
Frequently the most pathological queries (ie. the dreadfully slow ones) are because the query planner didn't have some knowledge required of the data distribution and couldn't accurately estimate the cost of some approach to planning the query. This can easily have a 1000x impact on execution time (ie. 1s rather than 1ms).
You will never have 100% accurate table stats - there is always some odd joint distribution you will not capture.
So instead, allow the query to start, and if progress isn't as fast as the planner expects, feed current progress info back to the planner (pages scanned, tuples matching), and replan with that new data. If the updated plan shows it's quicker to discard current results and restart with a new approach, do that.
Unfortunately, postgres does streaming queries (ie. the first results of a query are sent back to the client before the query is done), which means that significant infrastructural changes would be needed to allow a midway change of plan. Any new plan would need to keep track of which results had already been sent to the client so they aren't resent. Postgres also allows a client to, midway through a query, request that the query reverse direction and re-return previous results in reverse order. That adds a lot of complexity.
(blog author and Postgres committer here) I personally think this would be nice to have. However, I think the part about sending of tuples to the client is even more tricky than you've implied above. It's worse because a new plan does not even guarantee that the same tuples are returned. e.g. if you wrote: SELECT * FROM table LIMIT 10, there's no ORDER BY so which tuples are returned is non-deterministic. It may be easier to do something like queue up X tuples and just start sending those out when the queue is full. When the queue is full, we can say that it's too late to replan and we're locked into the current plan. People can set X to what they want to increase the time that the plan can change at the expense of using more memory and higher latency to get the first tuple.
Or it could work in a way that the Planner has access to data about previous runs of each query, and it can use this data to change plans that were proven bad during execution. This way, the first execution would be slow, but Planner could self-learn and better next time.
SQL Server has a bunch of similar features in its query optimizer https://learn.microsoft.com/en-us/sql/relational-databases/p....
I'm not sure Postgres has infrastructure to do that, though, because it doesn't have shared plan cache, for example.
Also, many queries might be so slow they never complete, and therefore never populate the cache. (think those queries run by a data scientist with 50 JOIN's)
I'm sure there are reasons the implementation might not be easy, but conceptually this seems fixable. You just need a lower bound of "how bad can this plan be?" and that doesn't require completing the query, just observing that it's been running for a long time and/or used unreasonable amounts of resources.
Also, is the problem with the 50 join query that the planner screws it up, or that it's fundamentally doing too much?
You'd still need analyze to gather table statistics to have the planner produce plans prior to getting any feedback from the executor. So, before getting feedback, the quality of the plans needn't be worse than they are today.
For many queries, even setting X=1 would probably have big benefits. If it takes far longer than expected to find the first result, it's probably time for a new plan.
Implementing only the X=1 case would also dramatically simplify the design of such a feature. Limiting it to read only queries would also make everything much simpler.
I agree. The primary area where bad estimates bite us is estimating some path will return 1 row. When we join that Nested Loop looks like a great option. What could be faster to join to 1 row?! It just does not go well when 1 row turns into more than 1. We'd realise that the 1-row estimate we wrong by the time we got to row 2, so queuing 1 row would likely cover the majority of cases.
I have no knowledge how common queries with ORDER BY vs no ORDER BY are, but that sounds like a first implementation which only works if ORDER BY is present would be easier and still useful? Or do you think that's not common enough to justify the effort?
You have to remember that because the query has an ORDER BY, it does not mean the rows come out in a deterministic order. There'd need to be at least an ORDER BY column that provably contains unique values. Of course, you could check for that, but then I don't think that's the end of the complexity. Things like SKIP LOCKED skip over rows which we can't immediately lock. If the first time we couldn't lock the lowest order row and output the 2nd, then aborted, replanned, then next time the 1st row wasn't locked, we'd then output the rows in the wrong order. It's probably possible to figure out all these cases and not do it when there's some hazard, but it sounds very tricky and bug-prone to me.
However, at least some ORDER BY queries will have a final sorting step before the client sees the first row. (If you're sorting on a column with an index, it may be able to produce the data in sorted order; if not, though, the final sort seems unavoidable.)
Which means that the capacity to buffer up data until it's all produced is present. But it might still be awkward to make it the default.
I think the first step to making improvements in this area is to have the planner err on the side of caution more often. Today it's quite happy to join using a Nested Loop when it thinks the outer side of the join contains a single row. We have various means in the planner on how certain we might be that the 1 row thing will hold true during execution. An equality condition on a column with a unique index, is, for example, a way we could be certain of getting <= 1 row. If for example, the selectivity estimate concludes 1 row will match for some WHERE clause containing several columns with independent statistics, then the certainty level goes down. It seems silly not to swap the join order and switch to a hash join for this. Best case, we have to build a hash table to store 1 row. Probing that won't be very expensive and could even be optimised further to skip hashing if we continually probe the same bucket for N probes. The cost of additional rows over the estimated 1 row scales much more linearly than the quadratic scaling we'd have gotten with Nested Loop. I imagine a setting which controls how much risk the planner is willing to take would allow users to maintain the status quo of the current costing model. I'd imagine not many would want that, however.
> It may be easier to do something like queue up X tuples and just start sending those out when the queue is full. When the queue is full, we can say that it's too late to replan and we're locked into the current plan. People can set X to what they want to increase the time that the plan can change at the expense of using more memory and higher latency to get the first tuple.
I think another way to think about it is to allow 'long planning' queries. I.e. where it is allowed to spend a second, or maybe a few seconds choosing the best plan. That may involve collecting more statistics or running a query for a little bit.
I've considered things like this before but not had time to take it much beyond that. The idea was that the planner could run with all expensive optimisations disabled on first pass, then re-run if the estimated total cost of the plan was above some threshold with more expensive optimisations enabled. It does seem pretty silly to worry about producing a plan in a millisecond for say, an OLAP query that's going to take 6 hours to complete. On the other hand, we don't want to slow down the planner too much for a query that executes in 0.1 milliseconds.
There'd be a few hurdles to get over before we could get such a feature. The planner currently has a habit of making changes to the parsed query, so we'd either need to not do that, or make a copy of it before modifying it. The former would be best.
I don't think it's sop muhch a matter of "expensive" as "unknowable".
For instance, if you have a left join (and let's say it can't use an index for whatever reason) - the optimal plan will probably be different if almost every row has a matching row(s) in the join table than if only a few do.
I think the join search would remain at the same level of exhaustiveness for all levels of optimisation. I imagined we'd maybe want to disable optimisations that apply more rarely or are most expensive to discover when the planner "optimisation level" was set to lower settings. I suppose that would be things like LEFT JOIN removals and self-join removals. However, PostgreSQL does not have very many expensive optimisations that rarely help, so having an optimisation level might be more of a way of introducing more new optimisations that help fewer queries. Because we don't have a plan cache, there's been a focus on keeping the planner lean and having it not go to too much effort to optimise queries that are poorly written, for example.
Exactly. Web services of the 90's would let a user run a query which returns say 1 million rows, but view them in pages of perhaps 10 rows with a next page/previous page link. The query runs only once, and the million row result set is never created in full - a bit more of it is generated each time the user clicks next page. And if the user clicks previous page, the query needs to be reversible to regenerate previous results.
Very few people use that functionality today to my knowledge. Keeping a query 'open' uses hundreds of megabytes of RAM on the server, so most applications want the query done and ended right away.
If the sort order isn't fully determined by a query, can the query plan influence the result order? If so, what you're suggesting might be nearly impossible. The new query wouldn't be able to just skip the first N results, it would have to match each individual row against a dictionary of previously sent ones.
> and if progress isn't as fast as the planner expects, feed current progress info back to the planner (pages scanned, tuples matching), and replan with that new data.
That would require keeping track of those stats in every query execution. That has a price that may or may not be worth it.
And yes, you could make that behavior an option, but, for better or for worse, PostgreSQL tends to be opposed to having queries indicate how it should do its work.
Alternatively, associate some confidence value with the statistics and make conservative choices when confidence is low and/or the expected difference is small.
Sometimes a sequential scan is faster than index lookups for each item. But the sequential scan is a risky gamble, whereas the index lookups have robust performance characteristics. It's not always clear which choice is the conservative one, but often it is.
I think it should always be clear which open would scale better for additional rows over what the estimated row count is. We should always know this because we already cost for N rows, so it's possible to cost for N+1 rows and use the additional costs to calculate how the plan choice will scale when faced with more rows than expected.
What I think could potentially be done is allow threshold-based alternate plans. For a pseudo example, “if subquery A returns 8 records or fewer, use this plan for subquery B, else use that plan.” It’s an explicit admission that the query planner doesn’t have enough information to make a good decision up front, but can easily be made at a later point in time while in the middle of execution.
Yes, I think so too. There is some element of this idea in the current version of PostgreSQL. However, it does not go as far as deferring the decision until execution. It's for choosing the cheapest version of a subplan once the plan has been generated for the next query level up. See fix_alternative_subplan() in setrefs.c. Likely it would be possible to expand that and have the finish plan contain the alternative and switch between them accordingly to which one is cheaper for the number of rows that previous executions have seen. Maybe tagging on some additional details about how many rows is the crossover point where one becomes cheaper than the other so that the executor can choose without having to think too hard about it would be a good idea.
Frequently the most pathological queries (ie. the dreadfully slow ones) are because the query planner didn't have some knowledge required of the data distribution and couldn't accurately estimate the cost of some approach to planning the query. This can easily have a 1000x impact on execution time (ie. 1s rather than 1ms).
You will never have 100% accurate table stats - there is always some odd joint distribution you will not capture.
So instead, allow the query to start, and if progress isn't as fast as the planner expects, feed current progress info back to the planner (pages scanned, tuples matching), and replan with that new data. If the updated plan shows it's quicker to discard current results and restart with a new approach, do that.
Unfortunately, postgres does streaming queries (ie. the first results of a query are sent back to the client before the query is done), which means that significant infrastructural changes would be needed to allow a midway change of plan. Any new plan would need to keep track of which results had already been sent to the client so they aren't resent. Postgres also allows a client to, midway through a query, request that the query reverse direction and re-return previous results in reverse order. That adds a lot of complexity.