The script to get data for the blank classifier is through get_blank_classifier_data.py
- This script runs in a python 2 environment present in the Master Server
Command:
sudo python get_blank_classifier_data.py --thresh=500 --email_ids=aman.khullar@oniondev.com --start_date=2019-01-01 --end_date=2019-05-31 --output=/tmp/blank_classifier.xlsx
-
Thresh is the number of items in the training data for each class. In the example above there will be 500 items for Accept class, 500 items for the Blank class
-
Here the Blank class is '0' and Accept class is '1'
-
The data generated from the above command is stored in the 'data' folder by the name
blank_classifier.xlsx
-
With respect to number of data points that should be used to train the classifier, there is no fixed rule but with around 17,000 items, we got a test set accuracy of 98% so probably using around 15,000 in the training set is a good idea.
source activate <environment_name>
NOTE: The requirements.txt has been created to install the dependencies
Download the audios of only the clean data as produced in the previous step and then convert them into wav files which are easier for feature extraction.
Download audio files:
python download_audios.py --in_file="<in_file>.xlsx" --out_folder_mp3="<folder name>"
- Time taken to download 100 audios : 30.27.30 seconds (this step also depends on the stability of internet connection)
- Time taken to download 6000 audios : 1541.53 seconds
Hardware specification: Processor: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz, 16GB RAM
Convert mp3 files to wav files:
python convert_to_wav.py --in_folder="<folder created in previous step>" --out_folder="<folder containing wav audios>"
- Time taken to Convert the audios to Wave : 9.36 seconds
- Time taken to Convert 6000 (actually 5964) audios to Wave : 833.69 seconds
The 136 features are extracted for all the audio files using this script. Corresponding feature files in the form of numpy arrays are saved. The audio features are extracted using PyAudioAnalysis library. 4 values (25th percentile, 50th percentile, 75th percentile and 95th percentile) from the 34 features are extracted to get the 136 (34*4) features.
- We use a 50ms framesize with a 25ms stepsize
- Hence, for example if we have an audio of 100ms, there will be 3 frames and for a 1min audio there will be 2399 frames
- There will be 34 features for each frame
- Hence the feature matrix size will be 34*2399 for a 1 minute audio
- After that we take the 25th percentile, 50th percentile, 75th percentile and 95th percentile for each feature across all the frames
- We finally get a 34*4=136 feature vector for any audio file
Extract and store audio features - in both npy files and csv format:
python get_features.py --in_file=data/blank_classifier.xlsx --auds_dir=audio_files_wav/ --out_file=data/blank_classifier_feats.csv --feats_dir=feats_dir
- Time taken to extract features of 50 audio files : 371.02 seconds and for 100 audio files : 741.04 seconds
- Time taken to extract features from 6000 audio files: approx 6 hours
python create_train_test_splits.py --feats_csv=data/blank_classifier_feats.csv --train_csv=data/blank_classifier_feats_train.csv --test_csv=data/blank_classifier_feats_test.csv
This script randomizes the data and keeps 80% of data in the training set and 20% of data in the test set
We are using a Random Forest classifier to label an audio into the 'Blank' vs. 'Accept' class.
python train_blank_classifier.py --feats_file=data/blank_classifier_feats_train.csv --model_name=blank_classifier_31_05_2021.pkl --scaler_name=blank_classifier_scaler.pkl
Get the Confusion Matrix and accuracy of the new model. You should compare this with the previous model on the same dataset and then decide if the new model should be used.
python test_blank_classifier.py --feats_file=data/blank_classifier_feats_test.csv --model_name=models/blank_classifier_31_05_2021.pkl --scaler_name=models/blank_classifier_scaler.pkl