Hacker Newsnew | past | comments | ask | show | jobs | submit | davidrowley's commentslogin

The Oversized-Attribute Storage Technique. https://www.postgresql.org/docs/17/storage-toast.html


> Also PG has no true clustered indexes all tables are heaps which is something most use all the time in MSSQL, usually your primary key is also set as the clustered index so that the table IS the index and any lookup on the key has no indirection.

This is true, but I believe if you have an index-organised table then subsequent indexes would have to reference the primary key. With PostgreSQL, indexes can effectively point to the record in the heap by using the block number and item pointer in the block. This should mean, in theory, index lookups are faster in PostgreSQL than non-clustered index lookups in SQL Server.

I was wondering, is there a concept of Bitmap Index Scans in SQL Server?

PostgreSQL is able to effectively "bitwise" AND and bitwise OR bitmap index scan results from multiple indexes to obtain the subset of tuple identifier (ctids) or blocks (in lossy mode) that should be scanned in the heap. Effectively, that allows indexes on a single column to be used when the query has a condition with an equality condition on multiple columns which are indexed individually. Does SQL Server allow this for heap tables? In theory, supporting both allows DBAs to choose, but I wonder how well each is optimised. It may lead to surprises if certain query plan shapes are no longer possible when someone switches to an IOT.


>This is true, but I believe if you have an index-organised table then subsequent indexes would have to reference the primary key. With PostgreSQL, indexes can effectively point to the record in the heap by using the block number and item pointer in the block. This should mean, in theory, index lookups are faster in PostgreSQL than non-clustered index lookups in SQL Server.

Yes this is a tradeoff, if you primarily access by the clustered index its faster, if you access by secondary it slightly slower.

Clustered indexes work very well for tables that have a single access pattern, they can also save significant space and I/O since data is not copied in both heap and index.

Having the choice is great, Postgresql forces heaps and SQLite forces clustered, MSSQL lets you do either based on your design choice.

>I was wondering, is there a concept of Bitmap Index Scans in SQL Server?

Yes you can see the description here: https://learn.microsoft.com/en-us/sql/relational-databases/s...


Yeah, this is similar to some semi-baked ideas I was talking about in https://www.postgresql.org/message-id/CAApHDvo2sMPF9m=i+YPPU...

I think it should always be clear which open would scale better for additional rows over what the estimated row count is. We should always know this because we already cost for N rows, so it's possible to cost for N+1 rows and use the additional costs to calculate how the plan choice will scale when faced with more rows than expected.


Yes, I think so too. There is some element of this idea in the current version of PostgreSQL. However, it does not go as far as deferring the decision until execution. It's for choosing the cheapest version of a subplan once the plan has been generated for the next query level up. See fix_alternative_subplan() in setrefs.c. Likely it would be possible to expand that and have the finish plan contain the alternative and switch between them accordingly to which one is cheaper for the number of rows that previous executions have seen. Maybe tagging on some additional details about how many rows is the crossover point where one becomes cheaper than the other so that the executor can choose without having to think too hard about it would be a good idea.


> If the sort order isn't fully determined by a query, can the query plan influence the result order?

Yes. When running a query, PostgreSQL won't make any effort to provide a stable order of rows beyond what's specified in the ORDER BY.


There's some information about why that does not happen in https://www.postgresql.org/message-id/20211104234742.ao2qzqf...

In particular:

> The immediate goal is to be able to generate JITed code/LLVM-IR that doesn't > contain any absolute pointer values. If the generated code doesn't change > regardless of any of the other contents of ExprEvalStep, we can still cache > the JIT optimization / code emission steps - which are the expensive bits.

A colleague is working on getting this patch into shape. So we might see some caching work get done after the relative pointer work is in.


I've considered things like this before but not had time to take it much beyond that. The idea was that the planner could run with all expensive optimisations disabled on first pass, then re-run if the estimated total cost of the plan was above some threshold with more expensive optimisations enabled. It does seem pretty silly to worry about producing a plan in a millisecond for say, an OLAP query that's going to take 6 hours to complete. On the other hand, we don't want to slow down the planner too much for a query that executes in 0.1 milliseconds.

There'd be a few hurdles to get over before we could get such a feature. The planner currently has a habit of making changes to the parsed query, so we'd either need to not do that, or make a copy of it before modifying it. The former would be best.


I don't think it's sop muhch a matter of "expensive" as "unknowable".

For instance, if you have a left join (and let's say it can't use an index for whatever reason) - the optimal plan will probably be different if almost every row has a matching row(s) in the join table than if only a few do.


I think the join search would remain at the same level of exhaustiveness for all levels of optimisation. I imagined we'd maybe want to disable optimisations that apply more rarely or are most expensive to discover when the planner "optimisation level" was set to lower settings. I suppose that would be things like LEFT JOIN removals and self-join removals. However, PostgreSQL does not have very many expensive optimisations that rarely help, so having an optimisation level might be more of a way of introducing more new optimisations that help fewer queries. Because we don't have a plan cache, there's been a focus on keeping the planner lean and having it not go to too much effort to optimise queries that are poorly written, for example.


(blog author and Postgres committer here) I personally think this would be nice to have. However, I think the part about sending of tuples to the client is even more tricky than you've implied above. It's worse because a new plan does not even guarantee that the same tuples are returned. e.g. if you wrote: SELECT * FROM table LIMIT 10, there's no ORDER BY so which tuples are returned is non-deterministic. It may be easier to do something like queue up X tuples and just start sending those out when the queue is full. When the queue is full, we can say that it's too late to replan and we're locked into the current plan. People can set X to what they want to increase the time that the plan can change at the expense of using more memory and higher latency to get the first tuple.


Or it could work in a way that the Planner has access to data about previous runs of each query, and it can use this data to change plans that were proven bad during execution. This way, the first execution would be slow, but Planner could self-learn and better next time. SQL Server has a bunch of similar features in its query optimizer https://learn.microsoft.com/en-us/sql/relational-databases/p....

I'm not sure Postgres has infrastructure to do that, though, because it doesn't have shared plan cache, for example.


Also, many queries might be so slow they never complete, and therefore never populate the cache. (think those queries run by a data scientist with 50 JOIN's)


I'm sure there are reasons the implementation might not be easy, but conceptually this seems fixable. You just need a lower bound of "how bad can this plan be?" and that doesn't require completing the query, just observing that it's been running for a long time and/or used unreasonable amounts of resources.

Also, is the problem with the 50 join query that the planner screws it up, or that it's fundamentally doing too much?


You'd still need analyze to gather table statistics to have the planner produce plans prior to getting any feedback from the executor. So, before getting feedback, the quality of the plans needn't be worse than they are today.


> 50 JOINS

And no indexes.


For many queries, even setting X=1 would probably have big benefits. If it takes far longer than expected to find the first result, it's probably time for a new plan.

Implementing only the X=1 case would also dramatically simplify the design of such a feature. Limiting it to read only queries would also make everything much simpler.


I agree. The primary area where bad estimates bite us is estimating some path will return 1 row. When we join that Nested Loop looks like a great option. What could be faster to join to 1 row?! It just does not go well when 1 row turns into more than 1. We'd realise that the 1-row estimate we wrong by the time we got to row 2, so queuing 1 row would likely cover the majority of cases.


I have no knowledge how common queries with ORDER BY vs no ORDER BY are, but that sounds like a first implementation which only works if ORDER BY is present would be easier and still useful? Or do you think that's not common enough to justify the effort?


You have to remember that because the query has an ORDER BY, it does not mean the rows come out in a deterministic order. There'd need to be at least an ORDER BY column that provably contains unique values. Of course, you could check for that, but then I don't think that's the end of the complexity. Things like SKIP LOCKED skip over rows which we can't immediately lock. If the first time we couldn't lock the lowest order row and output the 2nd, then aborted, replanned, then next time the 1st row wasn't locked, we'd then output the rows in the wrong order. It's probably possible to figure out all these cases and not do it when there's some hazard, but it sounds very tricky and bug-prone to me.


However, at least some ORDER BY queries will have a final sorting step before the client sees the first row. (If you're sorting on a column with an index, it may be able to produce the data in sorted order; if not, though, the final sort seems unavoidable.)

Which means that the capacity to buffer up data until it's all produced is present. But it might still be awkward to make it the default.


I think the first step to making improvements in this area is to have the planner err on the side of caution more often. Today it's quite happy to join using a Nested Loop when it thinks the outer side of the join contains a single row. We have various means in the planner on how certain we might be that the 1 row thing will hold true during execution. An equality condition on a column with a unique index, is, for example, a way we could be certain of getting <= 1 row. If for example, the selectivity estimate concludes 1 row will match for some WHERE clause containing several columns with independent statistics, then the certainty level goes down. It seems silly not to swap the join order and switch to a hash join for this. Best case, we have to build a hash table to store 1 row. Probing that won't be very expensive and could even be optimised further to skip hashing if we continually probe the same bucket for N probes. The cost of additional rows over the estimated 1 row scales much more linearly than the quadratic scaling we'd have gotten with Nested Loop. I imagine a setting which controls how much risk the planner is willing to take would allow users to maintain the status quo of the current costing model. I'd imagine not many would want that, however.


SELECT x, random() FROM foo ORDER BY x;

SELECT foo.x, bar.y FROM foo WHERE bar.z > random() ORDER BY foo.x;

SELECT x FROM foo ORDER BY x; -- Meanwhile, concurrent updates are happening to foo, so re-running the query gets a different result set.

"We choose to do this thing and the others, not because they are easy, but because we ask 'how hard can it be?'"


> It may be easier to do something like queue up X tuples and just start sending those out when the queue is full. When the queue is full, we can say that it's too late to replan and we're locked into the current plan. People can set X to what they want to increase the time that the plan can change at the expense of using more memory and higher latency to get the first tuple.

Maybe warrant a new SQL syntax, like

    select * from table limit 10 queue X


> One part of this I found rather frustrating is the JIT in newer Postgres versions. The heuristics on when to use appear not robust at all to me.

(Author of the blog here and Postgres committer). I very much agree that the code to decide if JIT should be used or not needs work. For PG16, it only takes into account the estimated total cost of the plan and does not take into account how many expressions need to be compiled. It's quite fast to compile a few expressions, but if you're querying a partitioned table with hundreds of partitions and the plan contains all those partitions, then the JIT compiler has a lot of work to do. The number of expressions is not considered. A colleague and I do have some code to improve this. Unsure if it'll make PG17 at this stage.


If it's not ready for everyone, probably it shouldn't have been made a default, don't you think?


Perhaps, but it might be harsh to say it was the wrong decision when it was made as partitioned tables are far more optimised than when JIT was first worked on. It seems to me, most of the people that have issues with slow JIT times are having these issues with partitioned tables and JIT is slow due to having to compile large numbers of expressions. However, maybe this is the place for me to find out that's not always the case. The JIT costing is likely to get an overhaul soon, and if all goes to plan there JIT will be considered per plan node rather than per plan.


FWIW, I've seen planning+query times triple with JIT on, with no partitioned tables involved. It just takes forever to JIT sometimes. (This was with Postgres 13 and 14, IIRC.)

Update: I checked some old IRC logs, and found a query that took 1476 ms without JIT and 8754 ms with JIT on. And that is execution time, not planning time!


Was there an indication of the number of functions compiled? There is work ongoing in this area, so feedback on this topic is very welcome on the PostgreSQL mailing lists.


Not that I can remember, no. I don't have the details anymore; I delegated it to others to report to -perform, and then they never did. :-)


Just to clarify. I'm the author of the blog. I work for Microsoft in the Postgres open-source team. All the work mentioned in the blog is in PostgreSQL 16, which is open-source.


Answers it nicely, thank you!


thanks for the work that your team does!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: