Kaggle VSB Power Line Fault Detection Baseline
Design Ideas
Code Structure
- Keep it efficient and simple
- Use Pytorch dataloader to do feature extraction. It should support use cases in either typical feature engineering based workflow or deep learning workflow
- Add new feature extractor only requires inheriting the
AbstractExtractor
then write yourprocess_fn
andextract_fn
. Before computation, feature extractor will use the first signal to infer feature dimension so that you don't need to pre-specify that. Remember to update the feature name as the feature extractor will automatically do the caching for you.
- Reproducibility
- Use json to configure parameters in the pipeline. It helps you to document your experiments.
- Use random seed to control data augmentation so that results are reproducible.
Signal Processing
- Remove sine pattern using a high pass filter. Based on [1], the faulty pattern exist in high frequency region so I removed frequencies under 1000 Hz.
- Wavelet de-noising as mentioned in [1] and [2]. Afterwards, peak statistics is computed.
Note: I did not do the sine synchronization. The phase shift is used as a way to do augmentation.
How to run
Full training
- Download and unzip data to "input" folder. You should have five files: metadata_test.csv, sample_submission.csv, train.parquet, metadata_train.csv and test.parquet.
- Make sure you use Python 3, install pytorch by following instructions in their website and do
pip install -r requirements.txt
. - To train and generate submission files, go to "src" folder and do
python module_train_submit.py --config configs/basic.json
. - Find generated features in "caching" folder and submission file in "result" folder.
Only generate feature
do python module_extract_feature --config configs/basic.json --mode=train
Debug mode
Set IS_DEBUG flag in common.py
to True. It will use the first 100 signals to validate the whole pipeline.
Observations
- Performance
- Running whole pipeline in 8-core machine with multiprocessing still takes 3 hr. It is expected as I did not assume the signals have same length. Some optimization can not be done. Also, the computation is CPU based and do not leverage GPUs.
- Inaccurate cross validation score
- With Group KFold, the cross validation score is still much higher than LB score. It could be related to feature skew in the test data or label skew. Proper detection could be done here to give us higher confident result.
Plans
- Build better validation scheme, add skew detection.
- Continue to push forward the limit of feature engineering.
- Emulate wavelet signal processing structure to design a 1D CNN.
Reference
[1] A Complex Classification Approach of Partial Discharges from Covered Conductors in Real Environment