rheinwerk-verlag/pganonymize

Using 'faker.unique.xx' as provider doesn't ensure uniqueness of the values...

Closed this issue · 5 comments

... due to the use of raw parallelization.

To prove it, run the following:

import parmap
from pganonymizer.providers import FakeProvider


provider = FakeProvider(name='fake.unique.user_name')


def gen_values(qty=100):
    base_values = [f'v{n}' for n in range(qty)]
    parallel_values = parmap.map(provider.alter_value, base_values)
    serial_values = parmap.map(provider.alter_value, base_values, pm_parallel=False)
    return parallel_values, serial_values


pvals, svals = gen_values()

# verify uniqueness
print(len(set(pvals)), len(set(svals)))

on my machine, when run it I get the following values:

$ python test.py
16 100

If it worked, it should have been 100 100

hkage commented

Hi, thanks for reporting this issue. I haven't had the time yet to take a look into it.

Disabling the parallel evaluation by setting pm_parallel to False, like you did in your fork, would also disable the whole parallelization during the importing, if I understand the parmap documentation correct?

It seems to me that parmap is used to parallelize just the the application of the anonymizations to each row by splitting the work on more threads/processes and nothing else, but maybe I'm missing something?

Same issue, unique is unusable with the parallelization and by reading the code, I come to same conclusions as @azazel75 about parmap usage

How to move forward on this issue?

  • add an option to disable parallel ( ex --no-parallel, --no-optimization, etc..)

  • remove parmap usage

    • do we need benchmark to know impacts?
  • Remove or limit support of unique (ex: warning in the documentation/code)

Same issue, unique is unusable with the parallelization and by reading the code, I come to same conclusions as @azazel75 about parmap usage

How to move forward on this issue?

* add an option to disable parallel ( ex `--no-parallel`, `--no-optimization`, etc..)

* remove parmap usage
  
  * do we need benchmark to know impacts?

* Remove or limit support of `unique` (ex: warning in the documentation/code)

Sorry for the late response. I didn't have the time to check whether this is an issue of parmap itself or in combination with faker. The parallelization was added as a contribution and a feature request for the anonymizer, so I guess removing it would not be an option for users who have to anonymize massive data.

If it turns out, that uniqueness is not given with parmap I would tend to disable it with an option flag or remove the support for unique.

But I am open for help (maybe there is a fix to have uniqueness with parmap?) and/or other suggestions.

Sorry for the late response. I didn't have the time to check whether this is an issue of parmap itself or in combination with faker. The parallelization was added as a contribution and a feature request for the anonymizer, so I guess removing it would not be an option for users who have to anonymize massive data.

It's not an issue of parmap, but a combination with faker
To keep uniqueness, Faker need to track which value has been used and the same Faker instance need to be used. But, when doing multi processes with parmap, the Faker instance is recreated in each process
Even if we pass an instance of Faker in parmap argument, the unique list won't be updated
To make unique working with multi processes, we need to share state between processes which is not trivial (and maybe impact performance)

If it turns out, that uniqueness is not given with parmap I would tend to disable it with an option flag or remove the support for unique.

+1 to add an option to enable/disable multi processing
I know is a breaking change, but what do you think to disable multi-processes by default and add an option to enable it (with a warning about unique)? IMHO the multi-processing is a more specific use case than using unique feature

However, tell me your preference, I'll implement the flag