The `UniformSynthesizer` should follow the sdtypes in metadata (not the data's dtypes)
npatki opened this issue · 0 comments
npatki commented
Environment Details
- SDGym version: 0.6.0 (latest)
What is expected
The UniformSynthesizer
is expected to uniformly (randomly) create data within the observed ranges or categories.
- For
numerical
ordatetime
data, it should learn the min and max values during fit. Then during sample, it can create random, uniform data in the range - For
categorical
orboolean
data, it should learn the possible categories during fit. Then during sample, it can randomly select categories with equal probability (i.e. make it uniform) - For any other sdtype (such as
id
,pii
, etc.), it can simply use theRegexGenerator
orAnonymizedFaker
to generate values from scratch (no learning or uniform sampling expected)
How does this synthesizer know which type is which? It should use the provided metadata
as the ground source of truth.
What is actually observed
Rather than using the metadata to understand the sdtypes, the code just allows the RDT to guess based on the dataframe. See this line.
The automatically-detected RDT config is not guaranteed to be correct. For example:
- The RDT will detect any integers as being numerical, but they may actually be categorical sdtypes or IDs
- The RDT will detect any strings as being categorical, but they may actually be datetimes, PII or ID types
Instead of detect_initial_config
, the synthesizer should be parsing the metadata and using the sdtype to decide what to do.