/be_great

A novel approach for synthesizing tabular data using pretrained large language models

Primary LanguagePythonMIT LicenseMIT

PyPI version Downloads

Generation of Realistic Tabular data
with pretrained Transformer-based language models

     

Our GReaT framework utilizes the capabilities of pretrained large language Transformer models to synthesize realistic tabular data. New samples are generated with just a few lines of code, following an easy-to-use API. Please see our publication for more details.

GReaT Installation

The GReaT framework can be easily installed using with pip - requires a Python version >= 3.9:

pip install be-great

GReaT Quickstart

In the example below, we show how the GReaT approach is used to generate synthetic tabular data for the California Housing dataset.

from be_great import GReaT
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True).frame

model = GReaT(llm='distilgpt2', batch_size=32, epochs=50)
model.fit(data)
synthetic_data = model.sample(n_samples=100)

Open In Colab

GReaT Citation

If you use GReaT, please link or cite our work:

@inproceedings{borisov2023language,
  title={Language Models are Realistic Tabular Data Generators},
  author={Vadim Borisov and Kathrin Sessler and Tobias Leemann and Martin Pawelczyk and Gjergji Kasneci},
  booktitle={The Eleventh International Conference on Learning Representations },
  year={2023},
  url={https://openreview.net/forum?id=cEygmQNOeI}
}

GReaT Acknowledgements

We sincerely thank the HuggingFace 🤗 framework.