We will covering Synthetic data generation using :
- LLM
- Synthetic data vault
- LLM (Pre-trained Large Language model) : In this we are going leverage the
distilGPT2
model andbe_great
package to generate the synthetic tabular data.
GReaT stands for Generation of Realistic Tabular data with pretrained Transformer-based language models Following is claim of the authors and we can see that it is performing better than CouplaGAN and CTGAN
-
Synthetic data vault: This library provides option to generate synthetic data using mutiple algorithm.
2.1 GaussianCoupla: Helps modeling complex relationships between variables
2.2 CTGAN (Conditional Transform Generative Adverserial Network)
2.3 PAR (Probabilistic Autoregressive for Time-Series or Sequence data)