CEA-LIST/N2D2

ONNX accuracy discrepancy with respect to Pytorch and calibration error

andreistoian opened this issue · 6 comments

Hello,

I'm trying to compile a sound classification network using 1D convolution with the CPP export. With pytorch I get 81% accuracy on a small subset of data on which I also test and calibrate the N2D2 export.

  • when exporting the network in float32 using N2D2 I obtain 67% accuracy when running "run_export". Run export gives a log ending in
232.00/343 (67.64%)
233.00/344 (67.73%)
233.00/345 (67.54%)
234.00/346 (67.63%)
235.00/347 (67.72%)
236.00/348 (67.82%)
236.00/349 (67.62%)


Score: 67.62%
  • I tried calibrating the network in int8 but I get an assert in DeepNetQuantization.cpp:948 assert(scaling <= 1.0); It seems the scaling has value 248.938 (as do all the 8 elements of mScalingPerOutput of the RectifierActivation class.

The input WAV files are floating point values between -1 and 1 (mostly in the -0.05 and 0.05 range), loaded from FLOAT32 WAV files using the code given in #77 .

Here is the code that exports the dataset, computes accuracy for the pytorch model and exports the ONNX.
sound_demo.zip

Hi,
Regarding the accuracy, how is the accuracy in N2D2, before the export (with ./n2d2 onnx.ini -test)?

I ran it with:

bin/n2d2 ~sound1d-onnx.ini -test -seed 1 -w /dev/null

and I get

Testing #348   38.40% 
Final recognition rate: 38.40%    (error rate: 61.60%)
    Sensitivity: 55.95% / Specificity: 94.37% / Precision: 44.80%
    Accuracy: 89.73% / F1-score: 47.24% / Informedness: 50.32%

What is the recognition rate and how is it different from 'Accuracy'?

Hi,

The accuracy problem comes from a bad label mapping of the output of the network.
The label mapping should be the following:

/down 0
/go 1
/left 2
/no 3
/off 4
/on 5
/right 6
/silence 7
/stop 8
/unknown 9
/up 10
/yes 11

But in fact, since the /silence folder is empty, no image with label /silence is loaded in the database driver and this label is not created (this is the current behaviour of N2D2, which does not create label for empty folder). As a result, the following classes are shifted and mapped to the wrong output.

Regarding the score metrics, some remarks:

  • the recognition rate is simply the number of correctly classified images over the total number of image, regardless of their classes ;
  • the other metrics are in fact an average of the corresponding metrics for each class. The accuracy metric follows the standard definition, which is the number of true positives and true negatives over the total number of images, for each class. Please note that the accuracy is a really poor metric for unbalanced classes!

Finally, the calibration issue is the same as the one explained in issue #80. We are still thinking about possible solutions in this case that would not cause precision loss.

Actually, I just tested the CPP export and it works fine! Using the command: ./n2d2 sound1d-onnx.ini -seed 1 -w /dev/null -test -export CPP -calib -1
The average recall is 80% in INT8 vs. 83% before quantization.
No calibration issue here (which should not happen for mono-branch network).

I'm sorry but I'm not able to fully reproduce the working behavior:

To fix the 'silence' class issue I added 60 wavs with silence to the directory.

  • Initially I take 100% of the data as test.
[database]
Learn=0
Validation=0
Test=1
Depth=1

The pytorch model has 81% accuracy while running with N2D2 -test -seed 1 -w /dev/null gives

Testing database size: 871 images
Notice: stimuli depth is 64F (according to database first stimulus)
[LOG] Stimuli transformations flow (transformations.png)
[LOG] Network graph (sound1d-onnx.ini.png)
Warning: using box for unknown shape cylinder
[LOG] Network SVG graph (sound1d-onnx.ini.svg)
[LOG] Network stats (stats/*)
[LOG] Solvers scheduling (schedule/*)
[LOG] Layer's receptive fields (receptive_fields.log)
[LOG] Labels mapping (*.Target/labels_mapping.log)
[LOG] Labels legend (*.Target/labels_legend.png)
[LOG] Learn frame samples (frames/frame*)
[LOG] Test frame samples (frames/test_frame*)
[10:17.89 4:7.73 7:7.62 5:2.84 9:1.87 ]
Testing #100   93.07% 
Testing #200   94.53% 
Testing #300   95.02% 
Testing #400   86.03% 
Testing #500   81.84% 
Testing #600   82.70% 
Testing #700   79.46% 
Testing #800   76.65% 
Testing #870   75.43% 
Final recognition rate: 75.43%    (error rate: 24.57%)
    Sensitivity: 83.67% / Specificity: 97.79% / Precision: 72.44%
    Accuracy: 95.91% / F1-score: 75.57% / Informedness: 81.46%

I export the model to float32: models/ONNX/sound1d-onnx.ini -test -seed 1 -export CPP -nbbits -32 -w /dev/null . When I run 'run_export' (note I need to change the make file to -O0 -g so it does not crash) I get

649.000000/866 (74.942263%)
650.000000/867 (74.971165%)
651.000000/868 (75.000000%)
651.000000/869 (74.913694%)
651.000000/870 (74.827586%)
652.000000/871 (74.856487%)

Score: 74.856487%
  • I then setup a validation set:
[database]
Learn=0
Validation=0.5
Test=0.5
Depth=1

I export to int8 with calibration on the whole validation set. models/ONNX/sound1d-onnx.ini -test -seed 1 -export CPP -calib -1 -w /dev/null. N2D2 takes 312 stimuli for calibration (I guess Nclasses * min(card(class_i)) ?)

and I get, after a long time:

Notice: stimuli depth is 64F (according to database first stimulus)
Remove Dropout...
Fuse BatchNorm with Conv...
export_CPP_int8/stimuli_stats processing 312 stimuli
Fuse Padding...
  Cross-layer equalization:
    - eq. 35 and 33
    - eq. 37 and 35
    quant. range delta = 0.491025
export_CPP_int8/stimuli_stats processing 312 stimuli
Calculating calibration data range and histogram...
Calibration data 100/312
Calibration data 200/312
Calibration data 300/312
Quantization (8 bits)...
  Quantizing free parameters:
  - 17: 1.57456
  - 19: 1.57456
  - 20: 1.10175
  - 22: 1.10175
  - 23: 0.77638
  - 25: 0.77638
  - 26: 0.445755
  - 28: 0.445755
  - 29: 0.24595
  - 31: 0.24595
  - 33: 0.107331
  - 35: 0.0528017
  - 37: 0.0259759
  Fuse scaling cells:
  Quantizing activations:
  - 17: prev=1, act=605.467, bias=1.57456
      quant=63.251, global scaling=384.532 -> cell scaling=4.1115e-05
  - 20: prev=384.532, act=885.995, bias=1.10175
      quant=127, global scaling=804.168 -> cell scaling=0.00376515
  - 23: prev=804.168, act=1939.13, bias=0.77638
      quant=127, global scaling=2497.66 -> cell scaling=0.00253519
  - 26: prev=2497.66, act=3749.77, bias=0.445755
      quant=127, global scaling=8412.17 -> cell scaling=0.00233787
  - 29: prev=8412.17, act=2682.97, bias=0.24595
      quant=127, global scaling=10908.6 -> cell scaling=0.00607205
  - 33: prev=10908.6, act=10430.1, bias=0.107331
      quant=127, global scaling=97176.4 -> cell scaling=0.000883903
  - 35: prev=97176.4, act=2751.02, bias=0.0528017
      quant=127, global scaling=52100.9 -> cell scaling=0.0146863
  - 37: prev=52100.9, act=3834.95, bias=0.0259759
      quant=255, global scaling=147635 -> cell scaling=0.00138393
  Fuse scaling cells:
  - fuse: 17_rescale_act
  - fuse: 20_rescale_act
  - fuse: 23_rescale_act
  - fuse: 26_rescale_act
  - fuse: 29_rescale_act
  - fuse: 33_rescale_act
  - fuse: 35_rescale_act
  - fuse: 37_rescale_act
  Scaling approximation [3]:
  - 17: 4.1115e-05
    SINGLE_SHIFT: 2 ^ [- 14]
  - 20: 0.00376515
    SINGLE_SHIFT: 2 ^ [- 8]
  - 23: 0.00253519
    SINGLE_SHIFT: 2 ^ [- 8]
  - 26: 0.00233787
    SINGLE_SHIFT: 2 ^ [- 8]
  - 29: 0.00607205
    SINGLE_SHIFT: 2 ^ [- 7]
  - 33: 0.000883903
    SINGLE_SHIFT: 2 ^ [- 10]
  - 35: 0.0146863
    SINGLE_SHIFT: 2 ^ [- 6]
  - 37: 0.00138393
    SINGLE_SHIFT: 2 ^ [- 9]
  Inputs quantization
  Done!

..................

[3:0.00 4:0.00 1:0.00 0:0.00 2:0.00 ]
Testing #100   8.91% 
Testing #200   7.46% 
Testing #300   12.96% 
Testing #400   11.22% 
Testing #500   10.18% 
Testing #558   9.48% 
Final recognition rate: 9.48%    (error rate: 90.52%)
    Sensitivity: 12.37% / Specificity: 91.86% / Precision: 11.20%
    Accuracy: 84.91% / F1-score: 7.73% / Informedness: 4.23%

Time elapsed: 17281.58 s

When I compile and run 'run_export' I get


Score: 14.625000%

Please don't forget to delete the export_CPP_int8 folder before running a new export when a change has been made to the dataset partitioning or pre-processing.
The problem was due to faulty stimuli in the dataset and bad partitioning compared to PyTorch.
Considering the issue solved. Closing.