Smile-SA/gdpr-dump

Allowed memory size exhausted during dump

DanieliMi opened this issue ยท 22 comments

Preconditions

GdprDump Version: 2.2.0

PHP Version: 7.4.25

Database Version: 10.3.31-MariaDB-0ubuntu0.20.04.1 Ubuntu 20.04

Steps to reproduce

  1. Shopware 6 database
  2. Config:
    gdpr-dump.yml
database:
    host: '%env(DATABASE_HOST)%'
    port: '%env(DATABASE_PORT)%'
    user: '%env(DATABASE_USER)%'
    password: '%env(DATABASE_PASSWORD)%'
    name: '%env(DATABASE_NAME)%'

dump:
    hex_blob: true
    skip_definer: true

tables:
    acl_user_role:
        truncate: true
    cart:
        truncate: true
    customer:
        truncate: true
    customer_address:
        truncate: true
    customer_recovery:
        truncate: true
    dead_recovery:
        truncate: true
    enqueue:
        truncate: true
    google_shopping_account:
        truncate: true
    google_shopping_merchant_account:
        truncate: true
    integration:
        truncate: true
    log_entry:
        truncate: true
    message_queue_stats:
        truncate: true
    newsletter_recipient:
        truncate: true
    order:
        truncate: true
    order_address:
        truncate: true
    order_customer:
        truncate: true
    product_review:
        truncate: true
    refresh_token:
        truncate: true
    scheduled_task:
        truncate: true
    user:
        truncate: true
    user_recovery:
        truncate: true
    version:
        truncate: true
    version_commit:
        truncate: true
    version_commit_data:
        truncate: true
  1. Create dump: gdpr-dump gdpr-dump.yaml | sed 's$VALUES ($VALUES\n($g' | sed 's$),($),\n($g' > sql/master.sql

Expected result

A dump is created.

Actual result

PHP Fatal error

PHP Fatal error:  Allowed memory size of 4294967296 bytes exhausted (tried to allocate 20480 bytes) in phar:///usr/local/bin/gdpr-dump/src/Database/TableDependencyResolver.php on line 2
Fatal error: Allowed memory size of 4294967296 bytes exhausted (tried to allocate 20480 bytes) in phar:///usr/local/bin/gdpr-dump/src/Database/TableDependencyResolver.php on line 2

I'll try to clone the repo and debug into the TableDependencyResolver. It seems to me there might be an cyclic dependencies.

When I remove

    user:
        truncate: true

I am getting the following error:

SQLSTATE[21000]: Cardinality violation: 1241 Operand should contain 1 column(s)
SET autocommit=0;

When removing

    order:
        truncate: true

as well it works just fine. I am a bit confused in how I can figure out what the problem is with those tables.

amenk commented

Debugging this together with Daniel...

It is possible with

SELECT *
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_NAME = 'user'
AND CONSTRAINT_SCHEMA = 'your_database_goes_here'
GROUP BY TABLE_NAME

to find out the forein keys.

But adding all the tables returned here (and the same for orders) also does not help.

amenk commented

The problem is in this recursive function.

If we add

|| isset($resolved[$dependencyTable][$tableName])

to

if ($dependencyTable === $tableName) {

the resolver seems to finish but another problem appears.

Current assumption is that it is not detecting cyclic dependencies which are like A -> B -> A but only A -> A.

Because we have

  • sales_channel_domain -> sales_channel

and also

  • sales_channel -> sales_channel_domain

@guvra Question: What is the purpose of the foreign key resolving? If we just do nothing in the fuction (return $resolved;) at the beginning a dump is created, but we do not know how consistent it is.

We created text file of the foreign-key-relations for the graphviz tool and ran the nodes echod in that function through sort and uniq, then visualized with dot

This is a schema of the dependencies:
dependencies

When trying to avoid adding a dependency that has been already resolve like this:

            // Detect cyclic dependencies
            if ($dependencyTable === $tableName || isset($resolved[$dependencyTable][$tableName])) {
                continue;
            }

We're getting the following error:

Fatal error: Allowed memory size of 4294967296 bytes exhausted (tried to allocate 20480 bytes) in vendor/doctrine/dbal/lib/Doctrine/DBAL/Query/Expression/CompositeExpression.php on line 76
PHP Fatal error:  Allowed memory size of 4294967296 bytes exhausted (tried to allocate 20480 bytes) in vendor/doctrine/dbal/lib/Doctrine/DBAL/Query/Expression/CompositeExpression.php on line 76
guvra commented

Hi,

When gdpr-dump removes rows (truncate: true or limit: xx), it also removes all related rows from other tables.

Table dependencies are resolved in the class TableDependencyResolver.
The rows are then filtered out in the class TableFilterExtension.

Cyclic dependencies are detected to avoid infinite loops.
However, only cyclic dependencies on the same table are detected (table with a foreign key on itself).

The situation where two tables depend on each other is not currently detected by the dependency resolver.

So to solve this problem, you need to either update the table dependency resolver, or disable the table filter extension (there's not config for that yet, it has to be done manually in the code).

The cardinality violation error seems to be a different issue though.

amenk commented

@guvra Thanks for the explenation, that seems like a pretty cool feature.

We think we fixed the first issue with || isset($resolved[$dependencyTable][$tableName]) and will make a pull request.
@DanieliMi Will dig into the Cardinality Problem in CompositeExpression. It does not seem to happen, when there are no dependencies resolved so it seems to be slightly realted at least.

I've pushed a WIP commit. I did not look into the followup issue mentioned in #57 (comment).

Fatal error: Allowed memory size of 4294967296 bytes exhausted (tried to allocate 20480 bytes) in vendor/doctrine/dbal/lib/Doctrine/DBAL/Query/Expression/CompositeExpression.php on line 76
PHP Fatal error:  Allowed memory size of 4294967296 bytes exhausted (tried to allocate 20480 bytes) in vendor/doctrine/dbal/lib/Doctrine/DBAL/Query/Expression/CompositeExpression.php on line 76

Is due to a cyclic dependency within \Smile\GdprDump\Dumper\Mysqldump\TableFilterExtension::addDependentFilter

In the case of the cycle $subQuery->getMaxResults() (ref) return null because the sql of $subQuery is null.

If I change the condition to > 0 I am running into an SQL Eror:

SQLSTATE[21000]: Cardinality violation: 1241 Operand should contain 1 column(s)
guvra commented

It seems there's a mismatch between $dependency->getForeignColumns() and $dependency->getLocalColumns() somewhere in your database.

You could easily check this by adding var_dumps in the addDependentFilter function (or using xdebug).

I couldn't find any mismatch but I found cases with two foreign keys between two tables like this:

table               localColumns                     foreignTable foreignColumns
product_translation [product_id, product_version_id] product      [id, version_id]

I think there are in total 3 issues here:

  1. The dependency cycle in which leads to the memory exhaustion. I did manage to reproduce it in the unit test and fixed it accordingly.
  2. Another dependency cycle. This time in \Smile\GdprDump\Dumper\Mysqldump\TableFilterExtension::addDependentFilter. @amenk and I believe we have a solution for this as well but it is a bit more tricky since I am not understanding the whole process here and I did not manage to write a failing test for it.
  3. The Cardinality violation issue. I think this is due to the two foreign keys for two tables. \Smile\GdprDump\Dumper\Mysqldump\TableFilterExtension::getColumnsSql is used to get the columns for the select in the sub query and to get the columns used as left expression in comparison. The method adds parentheses around the columns which is no problem in the comparison and is no problem when it's one column in the select but if you wrap two columns in parentheses you will produce the Cardinality violation error. Example:
SELECT (`id`) FROM `products`; #works
SELECT (`id`, `name`) FROM `products`; #error

@guvra I will provide a PR with solutions for all 3 cases. The first case should be clear with the test but as I said I did not manage to create a failing test (I tried to create two tables that reference each other like A -> B, B -> A) maybe you have an idea? For the third case I do not have time to create a test right now but it should be fairly straight forward. I think you know the code the best and might have an idea whether our solutions make sense or are completely off.

guvra commented

Just to make sure I understood your problems, could you test if the commit I just added fixes all your issues?
I looked at your commit and I'm not sure why you need to add a stopping condition in addDependentFilter.

@guvra I can dump without any error on the branch fix-multi-column-fk and I compared a dump from your branch with #58 and they are identical:

$ diff our_master.sql master.sql 
6c6
< -- Date: Fri, 05 Nov 2021 09:13:39 +0100
---
> -- Date: Fri, 05 Nov 2021 09:14:45 +0100
8858c8858
< -- Dump completed on: Fri, 05 Nov 2021 09:13:39 +0100
---
> -- Dump completed on: Fri, 05 Nov 2021 09:14:45 +0100

I wondered if changing $subQuery->getMaxResults() !== 0 to $subQuery->getMaxResults() > 0 would be enough. I tried that yesterday but got some major differences in the created dumps. Maybe I had some other changes with influences at the time I tried.

guvra commented

@DanieliMi OK thanks for testing, so this commit solves all the issues you had?

If that's the case I can add a few functional tests and create a new release that includes this commit.

@DanieliMi OK thanks for testing, so this commit solves all the issues you had?

Yes it does.

If that's the case I can add a few functional tests and create a new release that includes this commit.

That would be amazing.

guvra commented

@DanieliMi I'm reopening this, after adding more functional tests, I realized there is a regression if I replace !== 0 with > 0 in the table filter extension.

guvra commented

It seems that getMaxResults returns null when the query matches all the rows of the table.
So the if statement must be evaluated to true if the function returns null.

Changing the condition to > 0 was not a real fix.

guvra commented

Fixed in commit 0291602 on master branch

@DanieliMi could you test this fix?
It should fix all issues without introducing any regression.
I added a lot of unit / functional tests to make sure that cyclic dependencies are properly detected, and that filters are also properly applied.

Hi @guvra I tested your commit and I did not run into any issues. The issues are solved. Thank you very much for your effort!

guvra commented

@DanieliMi OK thanks, I've released version 2.2.1 ๐Ÿ‘

Hopefully it will be easier to fix this kind of error in the future, there are a lot more unit/functional tests, and the complex functions are now documented (in the PHPDoc of the function, e.g. getDependencies in TableDependencyResolver).