ONNX accuracy discrepancy with respect to Pytorch and calibration error
andreistoian opened this issue · 6 comments
Hello,
I'm trying to compile a sound classification network using 1D convolution with the CPP export. With pytorch I get 81% accuracy on a small subset of data on which I also test and calibrate the N2D2 export.
- when exporting the network in float32 using N2D2 I obtain 67% accuracy when running "run_export". Run export gives a log ending in
232.00/343 (67.64%)
233.00/344 (67.73%)
233.00/345 (67.54%)
234.00/346 (67.63%)
235.00/347 (67.72%)
236.00/348 (67.82%)
236.00/349 (67.62%)
Score: 67.62%
- I tried calibrating the network in int8 but I get an assert in DeepNetQuantization.cpp:948
assert(scaling <= 1.0);
It seems the scaling has value 248.938 (as do all the 8 elements of mScalingPerOutput of the RectifierActivation class.
The input WAV files are floating point values between -1 and 1 (mostly in the -0.05 and 0.05 range), loaded from FLOAT32 WAV files using the code given in #77 .
Here is the code that exports the dataset, computes accuracy for the pytorch model and exports the ONNX.
sound_demo.zip
Hi,
Regarding the accuracy, how is the accuracy in N2D2, before the export (with ./n2d2 onnx.ini -test
)?
I ran it with:
bin/n2d2 ~sound1d-onnx.ini -test -seed 1 -w /dev/null
and I get
Testing #348 38.40%
Final recognition rate: 38.40% (error rate: 61.60%)
Sensitivity: 55.95% / Specificity: 94.37% / Precision: 44.80%
Accuracy: 89.73% / F1-score: 47.24% / Informedness: 50.32%
What is the recognition rate and how is it different from 'Accuracy'?
Hi,
The accuracy problem comes from a bad label mapping of the output of the network.
The label mapping should be the following:
/down 0
/go 1
/left 2
/no 3
/off 4
/on 5
/right 6
/silence 7
/stop 8
/unknown 9
/up 10
/yes 11
But in fact, since the /silence
folder is empty, no image with label /silence
is loaded in the database driver and this label is not created (this is the current behaviour of N2D2, which does not create label for empty folder). As a result, the following classes are shifted and mapped to the wrong output.
Regarding the score metrics, some remarks:
- the recognition rate is simply the number of correctly classified images over the total number of image, regardless of their classes ;
- the other metrics are in fact an average of the corresponding metrics for each class. The accuracy metric follows the standard definition, which is the number of true positives and true negatives over the total number of images, for each class. Please note that the accuracy is a really poor metric for unbalanced classes!
Finally, the calibration issue is the same as the one explained in issue #80. We are still thinking about possible solutions in this case that would not cause precision loss.
Actually, I just tested the CPP
export and it works fine! Using the command: ./n2d2 sound1d-onnx.ini -seed 1 -w /dev/null -test -export CPP -calib -1
The average recall is 80% in INT8 vs. 83% before quantization.
No calibration issue here (which should not happen for mono-branch network).
I'm sorry but I'm not able to fully reproduce the working behavior:
To fix the 'silence' class issue I added 60 wavs with silence to the directory.
- Initially I take 100% of the data as test.
[database]
Learn=0
Validation=0
Test=1
Depth=1
The pytorch model has 81% accuracy while running with N2D2 -test -seed 1 -w /dev/null
gives
Testing database size: 871 images
Notice: stimuli depth is 64F (according to database first stimulus)
[LOG] Stimuli transformations flow (transformations.png)
[LOG] Network graph (sound1d-onnx.ini.png)
Warning: using box for unknown shape cylinder
[LOG] Network SVG graph (sound1d-onnx.ini.svg)
[LOG] Network stats (stats/*)
[LOG] Solvers scheduling (schedule/*)
[LOG] Layer's receptive fields (receptive_fields.log)
[LOG] Labels mapping (*.Target/labels_mapping.log)
[LOG] Labels legend (*.Target/labels_legend.png)
[LOG] Learn frame samples (frames/frame*)
[LOG] Test frame samples (frames/test_frame*)
[10:17.89 4:7.73 7:7.62 5:2.84 9:1.87 ]
Testing #100 93.07%
Testing #200 94.53%
Testing #300 95.02%
Testing #400 86.03%
Testing #500 81.84%
Testing #600 82.70%
Testing #700 79.46%
Testing #800 76.65%
Testing #870 75.43%
Final recognition rate: 75.43% (error rate: 24.57%)
Sensitivity: 83.67% / Specificity: 97.79% / Precision: 72.44%
Accuracy: 95.91% / F1-score: 75.57% / Informedness: 81.46%
I export the model to float32: models/ONNX/sound1d-onnx.ini -test -seed 1 -export CPP -nbbits -32 -w /dev/null
. When I run 'run_export' (note I need to change the make file to -O0 -g so it does not crash) I get
649.000000/866 (74.942263%)
650.000000/867 (74.971165%)
651.000000/868 (75.000000%)
651.000000/869 (74.913694%)
651.000000/870 (74.827586%)
652.000000/871 (74.856487%)
Score: 74.856487%
- I then setup a validation set:
[database]
Learn=0
Validation=0.5
Test=0.5
Depth=1
I export to int8 with calibration on the whole validation set. models/ONNX/sound1d-onnx.ini -test -seed 1 -export CPP -calib -1 -w /dev/null
. N2D2 takes 312 stimuli for calibration (I guess Nclasses * min(card(class_i)) ?)
and I get, after a long time:
Notice: stimuli depth is 64F (according to database first stimulus)
Remove Dropout...
Fuse BatchNorm with Conv...
export_CPP_int8/stimuli_stats processing 312 stimuli
Fuse Padding...
Cross-layer equalization:
- eq. 35 and 33
- eq. 37 and 35
quant. range delta = 0.491025
export_CPP_int8/stimuli_stats processing 312 stimuli
Calculating calibration data range and histogram...
Calibration data 100/312
Calibration data 200/312
Calibration data 300/312
Quantization (8 bits)...
Quantizing free parameters:
- 17: 1.57456
- 19: 1.57456
- 20: 1.10175
- 22: 1.10175
- 23: 0.77638
- 25: 0.77638
- 26: 0.445755
- 28: 0.445755
- 29: 0.24595
- 31: 0.24595
- 33: 0.107331
- 35: 0.0528017
- 37: 0.0259759
Fuse scaling cells:
Quantizing activations:
- 17: prev=1, act=605.467, bias=1.57456
quant=63.251, global scaling=384.532 -> cell scaling=4.1115e-05
- 20: prev=384.532, act=885.995, bias=1.10175
quant=127, global scaling=804.168 -> cell scaling=0.00376515
- 23: prev=804.168, act=1939.13, bias=0.77638
quant=127, global scaling=2497.66 -> cell scaling=0.00253519
- 26: prev=2497.66, act=3749.77, bias=0.445755
quant=127, global scaling=8412.17 -> cell scaling=0.00233787
- 29: prev=8412.17, act=2682.97, bias=0.24595
quant=127, global scaling=10908.6 -> cell scaling=0.00607205
- 33: prev=10908.6, act=10430.1, bias=0.107331
quant=127, global scaling=97176.4 -> cell scaling=0.000883903
- 35: prev=97176.4, act=2751.02, bias=0.0528017
quant=127, global scaling=52100.9 -> cell scaling=0.0146863
- 37: prev=52100.9, act=3834.95, bias=0.0259759
quant=255, global scaling=147635 -> cell scaling=0.00138393
Fuse scaling cells:
- fuse: 17_rescale_act
- fuse: 20_rescale_act
- fuse: 23_rescale_act
- fuse: 26_rescale_act
- fuse: 29_rescale_act
- fuse: 33_rescale_act
- fuse: 35_rescale_act
- fuse: 37_rescale_act
Scaling approximation [3]:
- 17: 4.1115e-05
SINGLE_SHIFT: 2 ^ [- 14]
- 20: 0.00376515
SINGLE_SHIFT: 2 ^ [- 8]
- 23: 0.00253519
SINGLE_SHIFT: 2 ^ [- 8]
- 26: 0.00233787
SINGLE_SHIFT: 2 ^ [- 8]
- 29: 0.00607205
SINGLE_SHIFT: 2 ^ [- 7]
- 33: 0.000883903
SINGLE_SHIFT: 2 ^ [- 10]
- 35: 0.0146863
SINGLE_SHIFT: 2 ^ [- 6]
- 37: 0.00138393
SINGLE_SHIFT: 2 ^ [- 9]
Inputs quantization
Done!
..................
[3:0.00 4:0.00 1:0.00 0:0.00 2:0.00 ]
Testing #100 8.91%
Testing #200 7.46%
Testing #300 12.96%
Testing #400 11.22%
Testing #500 10.18%
Testing #558 9.48%
Final recognition rate: 9.48% (error rate: 90.52%)
Sensitivity: 12.37% / Specificity: 91.86% / Precision: 11.20%
Accuracy: 84.91% / F1-score: 7.73% / Informedness: 4.23%
Time elapsed: 17281.58 s
When I compile and run 'run_export' I get
Score: 14.625000%
Please don't forget to delete the export_CPP_int8 folder before running a new export when a change has been made to the dataset partitioning or pre-processing.
The problem was due to faulty stimuli in the dataset and bad partitioning compared to PyTorch.
Considering the issue solved. Closing.