Nathaniel-Han/End-to-End-CardEst-Benchmark

Sub-plan Query

Hap-Hugh opened this issue · 8 comments

Hi,

Thanks for presenting your codes and such great instructions.

I have a fundamental question about the sub-plan query. In the paper, you claim that:

Every time the planner needs a cardinality estimation of a sub-plan query, the modified function “calc_joinrel_size_estimate” will immediately capture it. Then, we call each CardEst method to estimate the cardinalities of the sub-plan queries and inject the estimations back into PostgreSQL.

I understand this pipeline that the corresponding CardEst will be collected and injected when Postgres is asking. However, in this repository, the generation of sub-plan queries is offline (submit the query itself to Postgres and record all related sub-queries). So it looks like this implementation is different from your paper's method. I think the sub-plan query space will change after you inject the CardEst into Postgres.

I am not very familiar with the generation of sub-queries. So please correct me if I am wrong. Look forward to your reply.

Hi,
this is how modern cbo-based optimizers work: Firstly they enumerate the whole plan space and secondly pick the best execution plan. The CardEst(and CostEst) will only affect the second step. Besides, for a certain version of Postgres(with a specific implementation) and for a specific query, the plan space is fixed. Therefore, we can first send the a set of test queries, dump all the sub_plan queries, and let different CardEst models give their estimates for these sub_plan queries.
And I should mention this strategy is only a hack way for CardEst evaluation.

Thanks for your reply.

I am still confused if the Postgres will generate the whole plan space. I run a simple test based on the STATS dataset, using the same codes, and follow all the instructions.

I firstly set:
print_single_tbl_queries=true; set print_sub_queries=true;

Then let Postgres run:
Explain SELECT COUNT(*) FROM votes as v, posts as p, badges as b, users as u WHERE p.Id = v.PostId AND u.Id = p.OwnerUserId AND u.Id = b.UserId;

In the output .txt file, I only found the following sub-plan queries:
(p, v) (b, p) (u, p) (u, b) (b, (v, p)) (u, (v, p)) (u, (p, b)) (u, (v, p, b))

So it looks like Postgres doesn't enumerate all of the sub-plan queries, however, it only selects the sub-query based on the cardinality estimation before. E.g., (u, (v, p, b)) is the only 4-table-join in the output and Postgres select it because of the previous CardEst results. While in fact, the real plan space should be much larger than these 8 :(

Please feel free to tell me if there is anything I am missing or if I didn't do it in the correct way. I am also wondering how can we get the whole plan (sub-plan query) space from Postgres.

When I talk with "whole plan space", I mean what a specific Postgres will generate for a query. It is not the theoretical whole plan space. That's why for some queries, even true cardinality can't guarantee the optimal performance. About the phenomenon you mentioned, I am wondering what version of Postgres you are using.

I am using Postgres 13.1 right now. I believe this is the same one mentioned in this repo.

Hi @Hap-Hugh ,
Thanks for your interest in our work. I have checked your case and following is our explanation:
1.First of all, in your case, Postgres do generate the theoretically whole plan space. The four tables in your query are not joined with the star format, therefore there are no join relationship between some tables such as (v,b) and (v,u). As a remind, Postgres can automatically detect both explicit join relationships(such (p,v), (u,p), (u,b) in your example) and implicit join relationships(such as (b,p) in your example). Please check EquivalenceClass in the kernel if you are interested how PG handle with implicit join.
2. In the architecture of modern query optimizers(including PG), the generation of sub_plan queries are definitely not dependent on the results of CardEst. The classical strategy is to first generate the theoretically whole plan space and then perform DP algorithm to pick the cheapest one. In most cases(If not all), the results of CardEst will definitely not affect the enumeration of possible plans. Please note, one output of our sub_plan queries only represents one request of CardEst, and not a true sub_plan in the plan space. You can see if we have three tables A,B,C, there will be three paths ((A,B),C), (A,(B,C)) and ((A,C), B). All three paths need a CardEst of (A,B,C). However, the kernel will only trigger one CardEst request of (A,B,C). You can confirm this by checking the function build_join_rel. It can show you how the kernel handle a true enumerated (sub) plan. And the DP algorithm is performed in make_rel_from_joinlist.

Thanks, Yuxing. This makes a lot more sense to me.

Really appreciate your answers.

By the way, I saw another comment from Ziniu previously. Just want to make sure that the off-line method has no problem now and we don't need to change the implementation to on-the-fly for testing the CardEst. Is that right?

@Hap-Hugh Hi Haibo, please ignore my previous comments as I did not write this code base. Sorry for the misleading message (already deleted).

Got it! Thank you, guys!

We could close this issue now. Cheers.