- Python >= 3.7
- Numpy >= 1.17.4
- PyTorch >= 1.13.0
- tqdm >= 4.40.2
from BERTable import BERTable
# Read dataset
df = pd.read_csv('dataset.csv', header=None)
column_type = ['numerical', 'categorical', 'numerical', 'numerical', 'categorical'....]
df = df.values.tolist()
# Initialization
bertable = BERTable(
df, column_type,
embedding_dim=5, n_layers=5, dim_feedforward=100, n_head=5,
dropout=0.15, ns_exponent=0.75, share_category=False, use_pos=False)
# Start self-supervised Pretraining
bertable.fit(
df,
max_epochs=3, lr=1e-4,
lr_weight={'numerical': 0.33, 'categorical': 0.33, 'vector': 0.33},
loss_clip = [0, 100],
n_sample=5, mask_rate=0.15, replace_rate=0.8,
batch_size=256, shuffle=True, num_workers=10)
# Feature Extraction
df_t = bertable.transform(df, batch_size=256, num_workers=10)
- df (list, required)
The data used for training.
- column_type (list, required)
Specify the column types. 'numerical, 'categorical' or 'vector'.
- embedding_dim (int, default: 5)
Embedding dimension.
- n_layers (int, default: 5)
Number of transformer encoder layers.
- dim_feedforward (int, default: 100)
Hidden dimension of transformer encoder layers.
- n_head (int, default: 5)
The number of heads in the multiheadattention models.
- dropout (float, default: 0.15)
The dropout value.
- ns_exponent (float, default: 0.75)
The exponent used to shape the negative sampling distribution.
- share_category (bool, default: Fasle)
If True, same categorical data in different columns that share the same name will be treated as the same object.
- use_pos (bool, default: Fasle)
Whether or not to add positional embedding.
- df (list, required)
The data used for training.
- max_epochs (int, default: 3)
Number of epoch to train.
- lr (float, default: 1e-4)
Learning rate for the optimizer.
- lr_weight (dict, default: {'numerical': 0.33, 'categorical': 0.33, 'vector': 0.33})
Learning rate weight for each data type.
- loss_clip (list, default: [0, 100])
Loss clipping for numerical data.
- n_sample (int, default: 4)
Number negative samples to use.
- mask_rate (float, default: 0.15)
The masking probability.
- replace_rate (float, default: 0.8)
The masking probability.
- batch_size (int, default: 32)
The batch size.
- shuffle (bool, default: True)
Whether or not to shuffle data.
- num_workers (int, default: 1)
NUmber of workers.
Check exp
folder for detail implimentatin of the experiments.