The `UniformSynthesizer` should follow the sdtypes in metadata (not the data's dtypes)

Question

npatki opened this issue a year ago · 0 comments

The UniformSynthesizer is expected to uniformly (randomly) create data within the observed ranges or categories.

For numerical or datetime data, it should learn the min and max values during fit. Then during sample, it can create random, uniform data in the range
For categorical or boolean data, it should learn the possible categories during fit. Then during sample, it can randomly select categories with equal probability (i.e. make it uniform)
For any other sdtype (such as id, pii, etc.), it can simply use the RegexGenerator or AnonymizedFaker to generate values from scratch (no learning or uniform sampling expected)

How does this synthesizer know which type is which? It should use the provided metadata as the ground source of truth.

Rather than using the metadata to understand the sdtypes, the code just allows the RDT to guess based on the dataframe. See this line.

The automatically-detected RDT config is not guaranteed to be correct. For example:

The RDT will detect any integers as being numerical, but they may actually be categorical sdtypes or IDs
The RDT will detect any strings as being categorical, but they may actually be datetimes, PII or ID types

Instead of detect_initial_config, the synthesizer should be parsing the metadata and using the sdtype to decide what to do.