mla/pg_sample

Correct Partition Handling [feature/partitions]

Closed this issue · 9 comments

Hello,

I love your tool and already used it quite frequently. However, in a current use case, a "bug" came to our mind.

When sampling a database, the data of partitions seem to be dumped in a wrong way. This results in the same data being present in the parent table and in the respective partition, which, in our scenario with constraints, results in duplicate key errors.

One possibility would be to detect if the table to dump is a parent of a partition and, if yes, skip the data-dump. Hence, all data would be in the children partitions and no constraint problems shall occur.

Keep up with your cool program and thanks for your efforts.

mla commented

Hello! Sorry for the delay in responding.

I have not used partitions. Would it be possible for you to construct a small standalone schema that can reproduce the issue?

Hey,

sure, I would be very happy to help you. I would even make a PR or something like that but I am not into perl :)

Give me some time over the weekend.

Have a nice weekend

Here we go:

CREATE SCHEMA sample_partition_test;

CREATE TABLE sample_partition_test.parent (
	id int PRIMARY KEY
) PARTITION BY RANGE(id);

CREATE TABLE sample_partition_test.child PARTITION OF sample_partition_test.parent FOR values from (0) to (100);

INSERT INTO sample_partition_test.parent VALUES (10);

If you then pg_sample the respective database the following will be in the dump:

COPY "sample_partition_test"."child" FROM stdin;
10
\.

COPY "sample_partition_test"."parent" FROM stdin;
10
\.

This results in a primary key constraint violation error, because basically, what this did, is insert duplicate data.
One possibility how one could handle this kind of stuff is quite easy i think: One could check if the table which data is about to be dumped is a parent table of a partition and, if yes, omit dumping of data. Hence, data would be dumped only into the child partitions.

You can easily check parent-children relationships using this SQL:

SELECT 
	inhrelid::regclass::text AS child,
	inhparent::regclass::text AS parent
FROM pg_catalog.pg_inherits

Hence, with filtering of results or an added WHERE condition you could check for the occurrence of a table in the "parent" column accordingly. One culprit here is that tables that are in the public schema miss the .

naming schema in the columns and are only listed as
instead, hence here additional attention could be required. Example: "public.test" occurs as "test", while "test.test" occurs as "test.test" in the columns of the above query.

We currently address this bug by postprocessing the dump and basically doing what i tried to describe above. But this is basically too much effort and normally the wrong place to address this problem.

If i can be of any more help just write a comment, I will try to keep an eye on the issue

mla commented

Fantastic @Zyntogz, thank you!

I discovered that declarative partitioning was added in Pg 10. In reading on it, seems like it has evolved from inheritance. So I'm wondering if we can devise a fix for inheritance that would work for declarative partitioning as well.

I wrote this test:

CREATE TABLE partition_parent (
  id int PRIMARY KEY
);

CREATE TABLE partition_child1 (CHECK (id BETWEEN 0 AND 10))
  INHERITS (partition_parent);

CREATE TABLE partition_child2 (CHECK (id BETWEEN 10 AND 20))
  INHERITS (partition_parent);

INSERT INTO partition_child1 VALUES (5);

INSERT INTO partition_child2 VALUES (15);

And with pg_sample on that, you do end up with duplicate id values.

try2=# select * from partition_parent ;
 id 
----
  5
 15
  5
 15
(4 rows)

try2=# select * from only partition_parent ;
 id 
----
  5
 15
(2 rows)

For the test suite, it would be nice if we could that raise an exception. I suppose we could just count duplicate rows in the parent within our tests but that feels a tad unsatisfying. Any thoughts? I guess the declarative partitioning is much better at enforcing the constraints.

mla commented

Hey @Zyntogz, please try out the feature/partitions branch. All I did was quality all SELECTs as SELECT FROM ONLY. Wondering if that if sufficient. It fixes the above inheritance case.

mla commented

nvm, I just tested it with pg 13 and the problem still exists

mla commented

ok, give it try now please.

Checked on this, even with a simple multi-layered partitioned table like this:

CREATE SCHEMA sample_partition_test;

CREATE TABLE sample_partition_test.parent (
	id int PRIMARY KEY
) PARTITION BY RANGE(id);

CREATE TABLE sample_partition_test.child PARTITION OF sample_partition_test.parent FOR values from (0) to (100) partition by range(id);

CREATE TABLE sample_partition_test.child_2 PARTITION OF sample_partition_test.child FOR values from (0) to (50);

INSERT INTO sample_partition_test.parent VALUES (10);

seems to work like a charm, great! What was the approach you used for solving this? Really can't derive this from the Code :D

mla commented

On release 1.12