Correct Partition Handling [feature/partitions]
Closed this issue · 9 comments
Hello,
I love your tool and already used it quite frequently. However, in a current use case, a "bug" came to our mind.
When sampling a database, the data of partitions seem to be dumped in a wrong way. This results in the same data being present in the parent table and in the respective partition, which, in our scenario with constraints, results in duplicate key errors.
One possibility would be to detect if the table to dump is a parent of a partition and, if yes, skip the data-dump. Hence, all data would be in the children partitions and no constraint problems shall occur.
Keep up with your cool program and thanks for your efforts.
Hello! Sorry for the delay in responding.
I have not used partitions. Would it be possible for you to construct a small standalone schema that can reproduce the issue?
Hey,
sure, I would be very happy to help you. I would even make a PR or something like that but I am not into perl :)
Give me some time over the weekend.
Have a nice weekend
Here we go:
CREATE SCHEMA sample_partition_test;
CREATE TABLE sample_partition_test.parent (
id int PRIMARY KEY
) PARTITION BY RANGE(id);
CREATE TABLE sample_partition_test.child PARTITION OF sample_partition_test.parent FOR values from (0) to (100);
INSERT INTO sample_partition_test.parent VALUES (10);
If you then pg_sample the respective database the following will be in the dump:
COPY "sample_partition_test"."child" FROM stdin;
10
\.
COPY "sample_partition_test"."parent" FROM stdin;
10
\.
This results in a primary key constraint violation error, because basically, what this did, is insert duplicate data.
One possibility how one could handle this kind of stuff is quite easy i think: One could check if the table which data is about to be dumped is a parent table of a partition and, if yes, omit dumping of data. Hence, data would be dumped only into the child partitions.
You can easily check parent-children relationships using this SQL:
SELECT
inhrelid::regclass::text AS child,
inhparent::regclass::text AS parent
FROM pg_catalog.pg_inherits
Hence, with filtering of results or an added WHERE condition you could check for the occurrence of a table in the "parent" column accordingly. One culprit here is that tables that are in the public schema miss the .
naming schema in the columns and are only listed asWe currently address this bug by postprocessing the dump and basically doing what i tried to describe above. But this is basically too much effort and normally the wrong place to address this problem.
If i can be of any more help just write a comment, I will try to keep an eye on the issue
Fantastic @Zyntogz, thank you!
I discovered that declarative partitioning was added in Pg 10. In reading on it, seems like it has evolved from inheritance. So I'm wondering if we can devise a fix for inheritance that would work for declarative partitioning as well.
I wrote this test:
CREATE TABLE partition_parent (
id int PRIMARY KEY
);
CREATE TABLE partition_child1 (CHECK (id BETWEEN 0 AND 10))
INHERITS (partition_parent);
CREATE TABLE partition_child2 (CHECK (id BETWEEN 10 AND 20))
INHERITS (partition_parent);
INSERT INTO partition_child1 VALUES (5);
INSERT INTO partition_child2 VALUES (15);
And with pg_sample on that, you do end up with duplicate id values.
try2=# select * from partition_parent ;
id
----
5
15
5
15
(4 rows)
try2=# select * from only partition_parent ;
id
----
5
15
(2 rows)
For the test suite, it would be nice if we could that raise an exception. I suppose we could just count duplicate rows in the parent within our tests but that feels a tad unsatisfying. Any thoughts? I guess the declarative partitioning is much better at enforcing the constraints.
Hey @Zyntogz, please try out the feature/partitions branch. All I did was quality all SELECTs as SELECT FROM ONLY. Wondering if that if sufficient. It fixes the above inheritance case.
nvm, I just tested it with pg 13 and the problem still exists
ok, give it try now please.
Checked on this, even with a simple multi-layered partitioned table like this:
CREATE SCHEMA sample_partition_test;
CREATE TABLE sample_partition_test.parent (
id int PRIMARY KEY
) PARTITION BY RANGE(id);
CREATE TABLE sample_partition_test.child PARTITION OF sample_partition_test.parent FOR values from (0) to (100) partition by range(id);
CREATE TABLE sample_partition_test.child_2 PARTITION OF sample_partition_test.child FOR values from (0) to (50);
INSERT INTO sample_partition_test.parent VALUES (10);
seems to work like a charm, great! What was the approach you used for solving this? Really can't derive this from the Code :D
On release 1.12