ankane/dexter

Dexter to suggest to create or to create an extended statistics entry about a functional dependency

sfkeller opened this issue · 2 comments

This is perhaps out-of-scope but still worth mentioning as a nice feature:

Optimization assumes that column are independent, so a query plan can go wrong not because of a missing index, but because the optimizer doesn't know about functional dependencies (FD) between columns.

Real world datasets often have FD between columns because of denormalization or just because the data is like this.

PostgreSQL offers CREATE STATISTICS on functional dependencies and there are attempts to detect those, like fdtool (with paper) or this tool. Actually, this potentially hard problem to find FDs in a table/entity becomes linear, if applied only to given slow queries - as Dexter does!

Expected behaviour:

Looking e.g. at this test database from ATP tennis tour given a slow query involving tourney_id, tourney_name in the WHERE clause, Dexter could suggest to do a CREATE STATISTICS on tourney_id, tourney_name like this:

create statistics atp_matches_2019_tourney_id_tourney_name (dependencies) 
	on tourney_id, tourney_name 
	from atp_matches_2019;

analyze atp_matches_2019; -- update extended statistics

select * from pg_statistic_ext; -- now shows the entry of "atp_matches_2019_tourney_id_tourney_name"

Thus, hoping that the planner next time considers this dependency.

BTW: CREATE STATISTICS currently also supports ndistinct, which enables n-distinct statistics, and mcv which enables most-common values lists.

Hey @sfkeller, thanks for the suggestion and the links! If there's an extension to add hypothetical statistics, I think Dexter could use the same approach it uses for indexes. I'm not sure how difficult hypothetical statistics are, but may be worth suggesting to HypoPG.

I'll suggest hypothetical statistics to HypoPG (see HypoPG/hypopg#49
) - and eventually come back to this issue, if there's any response over at HypoPG.

-- Stefan