This work has been accepted at Neurips 2022 workshop on Synthetic Data for Empowering ML Research.
- Vehicle Claim - Synthetic dataset created using DVI dataset.
- Car Insurance - Kaggle(https://www.kaggle.com/datasets/buntyshah/auto-insurance-claims-data)
- Vehicle Insurance - Github(https://github.com/AnalyticsandDataOracleUserCommunity/MachineLearning)
The code to create dataset is available here.
The dataset used in the paper is available on here.
Maker
- Categorical - The brand of the vehicle.GenModel
- Categorical - The model of the vehicle.Color
- Categorical - Colour of the vehicle.Reg_Year
- Categorical - Year of Registration.Body_Type
- Categorical - Eg. SUV, Convertible.Runned_Miles
- Numerical - Distance covered by the vehicle.Engin_Size
- Categorical - Size of engine.GearBox
- Categorical - Automatic, Manual.FuelType
- Categorical - Petrol, Diesel.Price
- Numerical - Price of vehicle.Seat_num
- Numerical - Number of seats.Door_num
- Numerical - Number of Doors.issue
- Categorical - Type of damage.issue_id
- Categorical - Specific damage.repair_complexity
- Categorical - Difficulty to repair the vehicle.repair_hours
- Numerical - Time required to finish the job.repair_cost
- Numerical - Cost of repair.
Other attributes are not used for evaluation in this work.
breakdown_date
and repair_date
were added with the idea of inserting anomalies based on the number of days required to repair the vehicle.
python train.py [-h] [--dataset DATASET] [--data DATA] [--embedding EMBEDDING] [--encoding ENCODING] [--model MODEL] [--numerical NUMERICAL] [--batch_size BATCH_SIZE] [--latent_dim LATENT_DIM] [--num_mixtures NUM_MIXTURES] [--dim_embed DIM_EMBED] [--rsr_dim RSR_DIM] [--epoch EPOCH]
dataset
- Dataset for training ('vehicle_claims', 'car_insurance', 'vehicle_insurance')data
- Only Normal data or Mixed data (True = Normal data)embedding
- Embedding layer if needed (DEFAULT = False)encoding
- Categorical features encodings (DEFAULT = 'label_encode' | 'one_hot', 'gel_encode')numerical
- Only numerical features if TRUE (DEFAULT = FALSE)batch_size
- (DEFAULT = 32)epoch
- (DEFAULT = 1)latent_dim
- Dimension of latent space in autoencoder (DEFAULT = 2)
DAGMM
num_mixtures
- Number of gaussian mixture models (DEFAULT = 2)dim_embed
- Dimension of input to estimation network (DEFAULT = 4 | General case = [latent_dim + 2])
RSRAE
rsr_dim
- Dimension of RSR layer (DEFAULT = 10 | Should be less than latent_dim)
python eval.py [-h] [--dataset DATASET] [--data DATA] [--embedding EMBEDDING] [--encoding ENCODING] [--model MODEL] [--numerical NUMERICAL] [--batch_size BATCH_SIZE] [--latent_dim LATENT_DIM] [--num_mixtures NUM_MIXTURES] [--dim_embed DIM_EMBED] [--rsr_dim RSR_DIM] [--epoch EPOCH] [--threshold THRESHOLD]
train_som.py [-h] [--dataset DATASET] [--embedding EMBEDDING] [--encoding ENCODING] [--numerical NUMERICAL] [--somsize SOMSIZE] [--somlr SOMLR] [--somsigma SOMSIGMA] [--somiter SOMITER] [--mode MODE] [--threshold THRESHOLD]
somsize
- Size of Self Organizing Mapsomlr
- Learning Ratesomsigma
- Sigma for neighbourhood functionsomiter
- Number of iterations of SOMmode
- train or eval (DEFAULT = 'train')threshold
- (DEFAULT = 50 | Only in eval mode)
- DVI dataset - https://deepvisualmarketing.github.io/
- RSRAE - https://github.com/marrrcin/rsrlayer-pytorch
- DAGMM - https://github.com/RomainSabathe/dagmm
- SOM - https://github.com/JustGlowing/minisom
- NeuTraL-AD - https://github.com/boschresearch/NeuTraL-AD
- LOE - https://github.com/boschresearch/LatentOE-AD
Please consider citing our work if you found this repository to be helpful.
@article{
Author = {Ajay Chawda and Stefanie Grimm and Marius Kloft},
Title = {Unsupervised Anomaly detection for Auditing Data and Impact of Categorical Encodings},
Journal = {https://arxiv.org/abs/2210.14056},
Year = {2022},
}
This work is licensed under a Creative Commons Attribution 4.0 International License.