sdv-dev/SDV

hash id values are not detected and generated properly and are also not randomized.

Closed this issue · 4 comments

Hello,

I am trying to generate mock CSV files using real data from an existing CSV. My use case involves continuously generating these CSV files, which I later ingest into another system. Each generated CSV needs to be unique while still adhering to the patterns and structure of the original data.
Here is my code:

import pandas as pd
import numpy as np
from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer

np.random.seed(12)
data = pd.read_csv('customer_data.csv', sep=';')

metadata = Metadata.detect_from_dataframe(
    data=data,
    table_name='test')

synthesizer = GaussianCopulaSynthesizer(metadata)

synthesizer.fit(data)
synthetic_data = synthesizer.sample(10)

synthetic_data.to_csv('synthetic_data.csv', index=False, sep=';')

Parts of the CSV contain columns with hash-like values. For example:

TRANSID
004560009F78964B55AC1EEFA2EA073A7E21BF43
005040009F78964B55AC1EDFA2EA2758C8B2C075
005040009F78964B55AC1EDFA2EA2758C8B2C075

The problem I am facing is as follows:

  1. Hash Generation:
  • The values generated for TRANSID are not actual hashes. Instead, I get values like:
  • sdv-pii-y3j8g, sdv-pii-efvwa, etc.
  1. Reproducibility Issue:
  • The generated values for TRANSID are always the same across executions. For instance, I consistently get:
  • sdv-pii-y3j8g, sdv-pii-efvwa, etc.

I have tried several approaches, including the suggestions in this ticket, but nothing has worked so far. Additionally, I attempted to update the column with a custom regex_format like this:

metadata.update_column(
    column_name='TRANSID',
    sdtype='id',
    regex_format='[A-Fa-f0-9]{40}')

While this approach produces hash-like values, they are still identical across executions and look like this:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAd
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

My Questions:

  1. Is it possible to work with hash-like values in SDV, ensuring they follow the correct format (e.g., [A-Fa-f0-9]{40})?
  2. If yes, can SDV detect correlations between repeated hashes in the original dataset (as these hashes often represent IDs) and generate mock data with repeated hashes in the appropriate contexts?

Thank you for your help!

Hi @Ilevy80 👋

I have a few questions to better understand your requirements for synthetic data.

  • Do you want your synthetic data to mirror the exact same values in your real data (which would follow the hash pattern) or do you want new values that follow the hash pattern?
  • Can you expand more on what you mean by "correlations between repeated hashes in the original dataset"? Would you like the synthetic data to mirror the same frequencies of hash ID values in your real data? Or correlations between rows belonging to a specific hash ID and other columns? Or something else entirely?

When using the SDV, updating the sdtypes and potentially the pre-processing transformers that SDV is using for each column play a significant role in how synthetic values are generated. Depending on your requirements, I can provide some more directed guidance with both of these!

Re: reproducibility (your 2nd question)

Which parts of the code are being re-run each time? If you want different synthetic data from the same synthesizer, then we recommend running fit() once, optionally saving the synthesizer object to disk, and then only re-running sample() each time you want more synthetic data.

If you run fit() then sample() on every run, we don't guarantee that different synthetic data will be generated. Your best bet is to re-run only the sampling part of your code. This is the easiest to see with columns that are assigned the id sdtype, where the SDV will generate entirely new values each time (compared to the categorical sdtype, where the SDV only uses pre-existing values in your real data).

Hi @srinify.

Thank you for your help!

Regarding reproducibility: As you mentioned, if I load a synthesizer and re-run sample() multiple times within the same session, each execution will generate different data. However, if I load the synthesizer and execute sample() just once, it will always generate the same data.

Let me explain my use case: A cron job triggers my script to generate new data. The script loads the synthesizer from disk and calls sample(). In other words, every time the script is executed, the synthesizer is loaded, and sample() is called. This results in the same exact data being generated each time.

Interestingly, if I call sample() a second time within the same script execution, the data generated on the second call is different from the first. It seems like the random seed (or an equivalent system) is always set to the same value for the first sample() execution after loading the synthesizer.

Regarding your questions:

  1. I am looking for new values to follow the same hash patterns.
  2. I'll try to answer your second question with explaining more about my use case :)
    The reason I need the synthetic data to maintain similar patterns as the original dataset is that this data is used in an ETL
    process I am trying to simulate. Specifically:
  • Correlations Between Rows and Columns:
    Some hash ID values repeat in specific patterns within the dataset and are tied to relationships in other columns. For
    example:
    A transaction id may repeat across multiple rows and needs to correlate with the same context id and root context id.
    Similarly, the number of rows sharing the same transaction id affects how the transformation logic in the ETL process
    behaves.

  • Mirroring Frequencies:
    I would like the synthetic data to mirror the frequency of hash ID repetitions and their clustering in the original dataset.
    For example: If a transaction id in the real data appears 10 times within a cluster, the synthetic data should also repeat
    some IDs within clusters of similar size and spread. This ensures the ETL process receives data with relationships and
    distributions that resemble the original.

  • Maintaining Relationships for ETL Testing:
    if the synthetic data is too random (e.g., completely unique or unrelated hashes), the transformation stage will not
    process it correctly because the relationships and dependencies between rows (based on shared hashes) will be broken.

Again, thank you so much for your help!

Hi @Ilevy80

Regarding reproducibility

If you load the same synthesizer object from disk and call sample() each time you generate synthetic data, then you will likely get the same synthetic data because the object's randomization seed never changes. I would recommend saving the synthesizer object back to disk after you run sample() so the randomization state is updated for the next time you load it from disk. This way, the randomization seed is always different each time you run sample() and so will the synthetic data!

Regarding your use case

Out of curiosity, what's the motivation to use synthetic data here? Because you want the data used for ETL simulation to closely resemble the real data but you still would like new values for the TRANS_ID column, you might be better served with a psuedo-anonymization approach to your data.

Using our RDT's library, which is also what the SDV uses to pre-process your real data and post-process your synthetic data, you could replace specific columns (e.g. the TRANS_ID column) with new values but retain the other columns (e.g. CONTEXT_ID and ROOT_CONTEXT_ID). With this approach, all of your other patterns & business rules will be adhered to because they're in the original data! If you have sensitive columns (e.g. social security numbers, addresses, etc), you can assign specific transformers for those columns and get new values.

One reason to use a synthetic data approach, however, is if you want to increase the quantity of data available for your ETL simulation but still retain the patterns from your real data (e.g. real data has 1k rows but you want to use 20k rows for simulation). Let me know if that's the case!

Hi @Ilevy80, are you still working on this project? Did @srinify's comment about saving the synthesizer work for your randomness needs?

I'm closing this issue off since it has been inactive for a few weeks. But please feel free to reply below if there is more to discuss -- we can always re-open.