Inference benchmark suite for embedding models

Created as a part of article series about semantic search embedding models.

Running the suite

You need to have SBT installed to build and run the suite.

The main benchmark code is in EncoderBenchmark. You may need to change the path variable to point to a directory with ONNX encoded models.

To run the suite, do the following command:

sbt jmh:run

To run a specific model:

sbt "jmh:run -f 1 -p model=e5-small-v2 -p words=256"

CPU results

For AMD Ryzen7 2700 with 8 physical cores:

[info] Benchmark                           (model)  (words)  Mode  Cnt    Score    Error  Units
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2        4  avgt   15    3.542 ±  0.284  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2        8  avgt   15    3.874 ±  0.307  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2       16  avgt   15    5.095 ±  0.281  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2       32  avgt   15    9.510 ±  0.454  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2       64  avgt   15   14.148 ±  0.816  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2      128  avgt   15   24.393 ±  0.994  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2      256  avgt   15   50.041 ±  1.108  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2        4  avgt   15   17.623 ±  0.636  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2        8  avgt   15   19.373 ±  0.827  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2       16  avgt   15   26.519 ±  2.227  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2       32  avgt   15   47.147 ±  7.444  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2       64  avgt   15   70.433 ±  7.199  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2      128  avgt   15  115.497 ±  1.929  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2      256  avgt   15  264.828 ±  9.909  ms/op
[info] EncoderBenchmark.latency        e5-small-v2        4  avgt   15    6.782 ±  1.066  ms/op
[info] EncoderBenchmark.latency        e5-small-v2        8  avgt   15    7.161 ±  0.211  ms/op
[info] EncoderBenchmark.latency        e5-small-v2       16  avgt   15    9.869 ±  0.636  ms/op
[info] EncoderBenchmark.latency        e5-small-v2       32  avgt   15   18.267 ±  0.622  ms/op
[info] EncoderBenchmark.latency        e5-small-v2       64  avgt   15   25.395 ±  0.414  ms/op
[info] EncoderBenchmark.latency        e5-small-v2      128  avgt   15   46.265 ±  0.800  ms/op
[info] EncoderBenchmark.latency        e5-small-v2      256  avgt   15   97.547 ±  2.253  ms/op
[info] EncoderBenchmark.latency         e5-base-v2        4  avgt   15   17.341 ±  0.753  ms/op
[info] EncoderBenchmark.latency         e5-base-v2        8  avgt   15   18.860 ±  0.473  ms/op
[info] EncoderBenchmark.latency         e5-base-v2       16  avgt   15   26.774 ±  2.681  ms/op
[info] EncoderBenchmark.latency         e5-base-v2       32  avgt   15   47.014 ±  7.148  ms/op
[info] EncoderBenchmark.latency         e5-base-v2       64  avgt   15   72.257 ±  8.049  ms/op
[info] EncoderBenchmark.latency         e5-base-v2      128  avgt   15  114.695 ±  1.655  ms/op
[info] EncoderBenchmark.latency         e5-base-v2      256  avgt   15  255.714 ±  8.537  ms/op
[info] EncoderBenchmark.latency        e5-large-v2        4  avgt   15   55.901 ±  1.081  ms/op
[info] EncoderBenchmark.latency        e5-large-v2        8  avgt   15   63.532 ±  3.418  ms/op
[info] EncoderBenchmark.latency        e5-large-v2       16  avgt   15   81.554 ±  2.683  ms/op
[info] EncoderBenchmark.latency        e5-large-v2       32  avgt   15  141.388 ±  2.329  ms/op
[info] EncoderBenchmark.latency        e5-large-v2       64  avgt   15  222.953 ±  5.689  ms/op
[info] EncoderBenchmark.latency        e5-large-v2      128  avgt   15  390.867 ±  8.989  ms/op
[info] EncoderBenchmark.latency        e5-large-v2      256  avgt   15  873.997 ± 25.694  ms/op

GPU results

For Nvidia RTX3060Ti:

[info] Benchmark                           (model)  (words)  Mode  Cnt   Score   Error  Units
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2        4  avgt    5   1.237 ± 0.160  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2        8  avgt    5   1.136 ± 0.094  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2       16  avgt    5   1.254 ± 0.062  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2       32  avgt    5   1.268 ± 0.053  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2       64  avgt    5   1.293 ± 0.071  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2      128  avgt    5   1.611 ± 0.048  ms/op
[info] EncoderBenchmark.latency   all-MiniLM-L6-v2      256  avgt    5   2.620 ± 0.197  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2        4  avgt    5   2.389 ± 0.034  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2        8  avgt    5   2.435 ± 0.059  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2       16  avgt    5   2.675 ± 0.057  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2       32  avgt    5   2.926 ± 0.078  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2       64  avgt    5   3.437 ± 0.069  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2      128  avgt    5   4.817 ± 0.125  ms/op
[info] EncoderBenchmark.latency  all-mpnet-base-v2      256  avgt    5   9.410 ± 0.115  ms/op
[info] EncoderBenchmark.latency        e5-small-v2        4  avgt    5   2.130 ± 0.513  ms/op
[info] EncoderBenchmark.latency        e5-small-v2        8  avgt    5   2.037 ± 0.260  ms/op
[info] EncoderBenchmark.latency        e5-small-v2       16  avgt    5   2.163 ± 0.384  ms/op
[info] EncoderBenchmark.latency        e5-small-v2       32  avgt    5   2.319 ± 0.378  ms/op
[info] EncoderBenchmark.latency        e5-small-v2       64  avgt    5   2.292 ± 0.087  ms/op
[info] EncoderBenchmark.latency        e5-small-v2      128  avgt    5   2.828 ± 0.471  ms/op
[info] EncoderBenchmark.latency        e5-small-v2      256  avgt    5   4.663 ± 0.157  ms/op
[info] EncoderBenchmark.latency         e5-base-v2        4  avgt    5   2.392 ± 0.090  ms/op
[info] EncoderBenchmark.latency         e5-base-v2        8  avgt    5   2.427 ± 0.079  ms/op
[info] EncoderBenchmark.latency         e5-base-v2       16  avgt    5   2.569 ± 0.068  ms/op
[info] EncoderBenchmark.latency         e5-base-v2       32  avgt    5   2.826 ± 0.090  ms/op
[info] EncoderBenchmark.latency         e5-base-v2       64  avgt    5   3.414 ± 0.206  ms/op
[info] EncoderBenchmark.latency         e5-base-v2      128  avgt    5   4.916 ± 0.090  ms/op
[info] EncoderBenchmark.latency         e5-base-v2      256  avgt    5   9.050 ± 0.321  ms/op
[info] EncoderBenchmark.latency        e5-large-v2        4  avgt    5   5.594 ± 0.229  ms/op
[info] EncoderBenchmark.latency        e5-large-v2        8  avgt    5   5.669 ± 0.158  ms/op
[info] EncoderBenchmark.latency        e5-large-v2       16  avgt    5   6.232 ± 0.213  ms/op
[info] EncoderBenchmark.latency        e5-large-v2       32  avgt    5   6.398 ± 0.197  ms/op
[info] EncoderBenchmark.latency        e5-large-v2       64  avgt    5   8.579 ± 0.046  ms/op
[info] EncoderBenchmark.latency        e5-large-v2      128  avgt    5  14.359 ± 0.299  ms/op
[info] EncoderBenchmark.latency        e5-large-v2      256  avgt    5  29.497 ± 1.061  ms/op

Quantization

QInt8, no AVX-VNNI

[info] Benchmark                 (gpu)           (model)  (quantized)  (words)  Mode  Cnt    Score    Error  Units
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true        4  avgt   30    5.056 ±  0.328  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true        8  avgt   30    6.188 ±  0.312  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true       16  avgt   30    9.475 ±  0.291  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true       32  avgt   30   18.114 ±  0.729  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true       64  avgt   30   27.633 ±  1.524  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true      128  avgt   30   49.595 ±  1.263  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true      256  avgt   30   96.119 ±  2.950  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false        4  avgt   30    6.774 ±  0.299  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false        8  avgt   30    7.699 ±  0.304  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false       16  avgt   30   10.854 ±  0.627  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false       32  avgt   30   20.046 ±  0.795  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false       64  avgt   30   29.415 ±  1.626  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false      128  avgt   30   51.847 ±  1.471  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false      256  avgt   30  104.012 ±  5.958  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true        4  avgt   30   10.101 ±  0.339  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true        8  avgt   30   12.989 ±  0.462  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true       16  avgt   30   20.710 ±  0.729  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true       32  avgt   30   41.481 ±  1.322  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true       64  avgt   30   68.744 ±  2.724  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true      128  avgt   30  108.107 ±  3.234  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true      256  avgt   30  242.078 ±  7.595  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false        4  avgt   30   18.318 ±  0.627  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false        8  avgt   30   20.901 ±  1.241  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false       16  avgt   30   27.631 ±  1.168  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false       32  avgt   30   47.793 ±  1.643  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false       64  avgt   30   76.249 ±  4.037  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false      128  avgt   30  120.735 ±  4.239  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false      256  avgt   30  255.839 ±  6.584  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true        4  avgt   30   29.147 ±  0.984  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true        8  avgt   30   39.321 ±  1.286  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true       16  avgt   30   64.002 ±  2.810  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true       32  avgt   30  131.581 ±  4.094  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true       64  avgt   30  201.466 ± 12.138  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true      128  avgt   30  353.541 ± 11.863  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true      256  avgt   30  775.755 ± 15.869  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false        4  avgt   30   58.246 ±  1.295  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false        8  avgt   30   65.960 ±  2.660  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false       16  avgt   30   86.411 ±  2.908  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false       32  avgt   30  150.251 ±  2.868  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false       64  avgt   30  227.753 ±  5.515  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false      128  avgt   30  398.923 ± 11.870  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false      256  avgt   30  871.132 ± 20.903  ms/op

QInt8, with AVX-VNNI

[info] Benchmark                 (gpu)           (model)  (quantized)  (words)  Mode  Cnt    Score    Error  Units
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true        4  avgt   30    1.726 ±  0.050  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true        8  avgt   30    2.051 ±  0.057  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true       16  avgt   30    2.863 ±  0.094  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true       32  avgt   30    5.102 ±  0.158  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true       64  avgt   30    7.426 ±  0.359  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true      128  avgt   30   13.004 ±  0.619  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx         true      256  avgt   30   23.801 ±  0.194  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false        4  avgt   30    3.167 ±  0.387  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false        8  avgt   30    2.894 ±  0.245  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false       16  avgt   30    4.182 ±  0.229  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false       32  avgt   30    7.297 ±  0.119  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false       64  avgt   30   12.136 ±  1.416  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false      128  avgt   30   22.553 ±  2.545  ms/op
[info] EncoderBenchmark.latency  false  e5-small-v2-onnx        false      256  avgt   30   41.167 ±  3.994  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true        4  avgt   30    3.655 ±  0.531  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true        8  avgt   30    4.202 ±  0.462  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true       16  avgt   30    5.572 ±  0.425  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true       32  avgt   30    8.767 ±  0.079  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true       64  avgt   30   14.252 ±  1.272  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true      128  avgt   30   23.057 ±  2.247  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx         true      256  avgt   30   52.650 ±  4.567  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false        4  avgt   30   12.693 ±  1.863  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false        8  avgt   30   12.746 ±  1.826  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false       16  avgt   30   14.428 ±  0.092  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false       32  avgt   30   23.002 ±  1.478  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false       64  avgt   30   37.776 ±  5.601  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false      128  avgt   30   53.990 ±  1.942  ms/op
[info] EncoderBenchmark.latency  false   e5-base-v2-onnx        false      256  avgt   30  150.256 ± 23.328  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true        4  avgt   30   14.164 ±  2.431  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true        8  avgt   30   15.473 ±  2.058  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true       16  avgt   30   18.127 ±  2.286  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true       32  avgt   30   26.344 ±  2.518  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true       64  avgt   30   32.557 ±  1.190  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true      128  avgt   30   62.558 ±  8.472  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx         true      256  avgt   30  126.842 ± 11.528  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false        4  avgt   30   39.614 ±  3.823  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false        8  avgt   30   43.939 ±  5.708  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false       16  avgt   30   52.877 ±  6.923  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false       32  avgt   30   90.087 ± 14.461  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false       64  avgt   30  117.197 ± 17.545  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false      128  avgt   30  176.686 ±  4.437  ms/op
[info] EncoderBenchmark.latency  false  e5-large-v2-onnx        false      256  avgt   30  419.281 ± 58.563  ms/op

License

Apache 2.0