For 10 x 10 matrices

Method MatrixSize Mean Error StdDev Median SpeedUp
SSEDLL 10 0.6794 0.005906 0.005235 0.679 1
VectorSharp 10 0.764 0.013935 0.012353 0.7623 1.124521637
AVX2DLL 10 1.0042 0.020026 0.018732 1.0027 1.478068884
Multiply1dWithTranspose 10 1.186 0.0225 0.0221 1.745657933
Multiply1dWithTransposeAndUnrolled 10 1.286 0.0273 0.0373 1.892846629
Multiply1d 10 1.331 0.0265 0.0498 1.959081543
MultiplyJaggedSharp 10 1.531 0.0306 0.0618 2.253458934
Multiply2d 10 2.477 0.0494 0.0607 3.645863998
Multiply1dDLLFirstFor 10 3.06 0.0598 0.0948 3.052 4.503974095
Multiply1dWithTransposeAndUnrolledAndParallelDLL 10 3.337 0.0667 0.1838 4.911686782
AVX2DLLParallel 10 3.6387 0.095875 0.267261 3.5417 5.355755078
OpenMPParallel 10 3.6543 0.090168 0.258708 3.5584 5.378716515
OpenMPParallel 10 3.68 0.1112 0.3082 3.655 5.416544009
SSEDLLParallel 10 3.6822 0.077172 0.216399 3.6528 5.419782161
Multiply1dSharp 10 5.574 0.0377 0.0352 5.554 8.20429791
Multiply1dWithTransposeAndUnrolledAndParallelSharp 10 5.82 0.1144 0.1272 5.774 8.566382102
VectorSharpParallel 10 5.9281 0.1183 0.26459 5.9272 8.725493082
Multiply1dWithTranspose 10 7.84 0.39191 0.45132 11.53959376
Multiply1dWithTransposeAndUnrolled 10 9.208 0.08588 0.07613 13.55313512
Multiply1d 10 9.278 0.18502 0.39428 13.65616721
Multiply1dDLLSecondFor 10 34.132 0.6811 1.4366 33.929 50.23844569
CUDASecondMultiplyWithoutCopy 10 70.879 3.9748 11.7199 63.806 104.3258758
CUDAFirstMultiplyWithoutCopy 10 72.128 3.2209 9.4968 70.66 106.1642626
CUDASecondMultiply 10 307.212 9.2096 25.9758 295.513 452.1813365
CUDAFirstMultiply 10 329.753 12.5671 36.857 322.609 485.3591404
Multiply1dDLLThirdFor 10 341.777 6.8053 10.1859 339.846 503.0571092

For 100 x 100 matrices

Method MatrixSize Mean Error StdDev Median SpeedUp
OpenMPParallel 100 90.71 1.7972 3.5475 90.368 1
CUDASecondMultiplyWithoutCopy 100 91.069 4.0229 11.8617 84.786 1.003957667
CUDAFirstMultiplyWithoutCopy 100 101.229 4.7851 14.109 93.281 1.115962959
OpenMPParallel 100 123.5062 1.394218 1.304152 123.7963 1.361549994
AVX2DLLParallel 100 127.5145 1.979355 1.85149 127.4217 1.405738066
SSEDLLParallel 100 151.2455 2.689533 2.515791 150.4583 1.667352001
Multiply1dWithTransposeAndUnrolledAndParallelDLL 100 161.152 3.1758 5.3928 1.776562672
VectorSharpParallel 100 167.6437 3.27149 4.99591 166.6681 1.848128101
Multiply1dDLLFirstFor 100 213.945 4.9133 13.6962 209.672 2.358560247
SSEDLL 100 283.0659 5.559858 7.03144 279.1983 3.120558924
VectorSharp 100 291.4436 3.41663 3.028753 291.4584 3.212915886
AVX2DLL 100 308.838 6.140071 9.559351 307.6695 3.404674237
CUDASecondMultiply 100 380.426 7.5718 19.9472 372.659 4.193870577
CUDAFirstMultiply 100 418.775 16.2783 47.7414 405.02 4.616635432
Multiply1dWithTransposeAndUnrolledAndParallelSharp 100 452.535 8.9537 19.2738 447.901 4.988810495
Multiply1dSharp 100 466.95 9.1219 8.5327 466.327 5.147723514
Multiply1dDLLSecondFor 100 496.614 17.0064 47.6879 481.255 5.474743689
Multiply1dWithTransposeAndUnrolled 100 1,076.45 12.0103 11.2345 11.86696064
MultiplyJaggedSharp 100 1,357.60 23.2452 21.7436 14.96639841
Multiply1dWithTranspose 100 1,383.89 16.8266 14.051 15.25624518
Multiply1d 100 1,448.66 28.6429 53.0914 15.97024584
Multiply2d 100 2,338.70 24.3722 21.6053 25.78218499
Multiply1dWithTransposeAndUnrolled 100 5052.179 96.05974 94.34351 55.69594312
Multiply1dWithTranspose 100 10679.086 238.10122 233.84723 117.7277698
Multiply1d 100 11062.934 257.56857 240.9298 121.959365
Multiply1dDLLThirdFor 100 34,115.24 674.722 1,466.79 34,069.34 376.091313

For 250 x 250 matrices

Method MatrixSize Mean Error StdDev Median SpeedUp
CUDASecondMultiplyWithoutCopy 250 278.101 5.5423 8.1238 276.07 1
CUDAFirstMultiplyWithoutCopy 250 466.271 9.3062 19.2189 464.37 1.67662468
OpenMPParallel 250 977.838 36.1275 105.9556 940.499 3.516125436
AVX2DLLParallel 250 1277.137 23.474968 21.9585 1275.5604 4.592349542
OpenMPParallel 250 1319.3033 14.792703 13.837104 1323.224 4.743971794
VectorSharpParallel 250 1451.0278 28.88237 33.26098 1446.1795 5.217628847
CUDASecondMultiply 250 1,494.41 17.6265 16.4878 1,494.41 5.373608869
CUDAFirstMultiply 250 1,587.01 32.2718 37.1642 1,575.51 5.706606593
SSEDLLParallel 250 1886.6368 35.732465 35.094058 1888.9606 6.783998619
Multiply1dWithTransposeAndUnrolledAndParallelDLL 250 2,579.18 125.206 369.1726 9.274249284
Multiply1dDLLFirstFor 250 3,608.78 119.0656 341.6215 3,516.38 12.97652292
AVX2DLL 250 3709.3901 74.124676 76.120585 3702.7734 13.33828393
VectorSharp 250 4023.0501 79.92504 186.82219 4019.5895 14.46614755
Multiply1dDLLSecondFor 250 4,113.06 81.2145 160.3095 4,085.47 14.78980299
SSEDLL 250 4887.964 96.258633 166.041412 4848.0801 17.57621871
Multiply1dWithTransposeAndUnrolledAndParallelSharp 250 5,679.70 106.9903 100.0788 5,644.94 20.42315921
Multiply1dSharp 250 6,107.17 121.2515 199.2196 6,088.18 21.96025545
Multiply1dWithTransposeAndUnrolled 250 16,644.40 283.8997 251.6696 59.85020191
Multiply1dWithTranspose 250 20,767.40 411.7288 364.9868 74.6757473
Multiply1d 250 22,007.59 388.9259 344.7727 79.13524223
MultiplyJaggedSharp 250 23,476.33 511.5211 717.08 84.41654291
Multiply2d 250 37,923.64 556.8028 520.8337 136.3664388
Multiply1dWithTransposeAndUnrolled 250 81530.829 1627.87636 3538.86924 293.169852
Multiply1d 250 211245.979 3626.30218 3392.04531 759.6016519
Multiply1dWithTranspose 250 211646.275 4120.88527 4905.62077 761.0410426
Multiply1dDLLThirdFor 250 230,850.58 7,159.01 20,308.92 227,076.13 830.0961701

For 500 x 500 matrices

Method MatrixSize Mean Error StdDev Median SpeedUp
CUDASecondMultiplyWithoutCopy 500 1,617.57 5.3356 4.4555 1,616.60 1
CUDAFirstMultiplyWithoutCopy 500 2,982.11 13.5538 12.6783 2,977.83 1.843570584
CUDASecondMultiply 500 4,885.78 97.2554 213.478 4,794.18 3.020438645
CUDAFirstMultiply 500 6,329.02 41.373 34.5483 6,325.76 3.912665456
OpenMPParallel 500 6,909.16 138.3472 405.7483 6,831.61 4.271312021
AVX2DLLParallel 500 7626.2035 193.681383 568.033968 7579.8578 4.714596188
OpenMPParallel 500 7767.8566 207.585201 608.81146 7787.6031 4.802167568
VectorSharpParallel 500 10103.3084 170.65822 159.63381 10064.9031 6.245967508
SSEDLLParallel 500 10665.1787 351.561003 1031.067565 10706.7508 6.593321414
Multiply1dWithTransposeAndUnrolledAndParallelDLL 500 17,061.96 371.5151 1,089.59 10.54787636
AVX2DLL 500 28751.9743 557.92099 572.94378 28723.85 17.77476151
Multiply1dDLLFirstFor 500 31,504.49 673.7689 1,878.20 31,006.38 19.4763921
Multiply1dDLLSecondFor 500 31,892.80 642.258 1,725.38 31,841.49 19.71645298
VectorSharp 500 35725.6557 1153.30357 3252.91318 34896.4154 22.08596193
SSEDLL 500 42266.4917 804.890291 790.50989 42298.7542 26.12957295
Multiply1dWithTransposeAndUnrolledAndParallelSharp 500 43,081.55 416.9136 389.9812 43,112.64 26.63345024
Multiply1dSharp 500 52,213.44 956.1917 894.4223 51,962.30 32.27887829
Multiply1dWithTransposeAndUnrolled 500 133,871.48 1,386.78 1,297.19 82.76070261
Multiply1dWithTranspose 500 168,821.07 3,231.48 3,173.75 104.3668947
Multiply1d 500 186,925.08 3,728.21 8,261.46 115.5589763
MultiplyJaggedSharp 500 214,642.38 4,247.82 4,891.79 132.6940917
Multiply2d 500 382,662.14 7,518.85 8,658.72 236.5656079
Multiply1dWithTransposeAndUnrolled 500 621591.97 11740.53246 10982.10135 384.274447
Multiply1dDLLThirdFor 500 923,522.94 18,484.37 40,573.64 912,753.65 570.9312285
Multiply1dWithTranspose 500 1776699.222 27481.4902 25706.20298 1098.373441
Multiply1d 500 1874828.175 36478.4177 43424.96138 1159.037753

For 1000 x 1000 matrices

Method MatrixSize Mean Error StdDev Median SpeedUp
CUDASecondMultiplyWithoutCopy 1000 11,579.21 19.2134 17.9722 11,570.90 1
CUDASecondMultiply 1000 18,479.25 173.4195 162.2167 18,427.10 1.595898564
CUDAFirstMultiplyWithoutCopy 1000 25,541.90 30.1455 28.1981 25,530.24 2.205840742
CUDAFirstMultiply 1000 32,335.12 159.2015 141.128 32,367.82 2.792514241
OpenMPParallel 1000 42849.9545 599.573745 468.107742 42699.2955 3.700592674
AVX2DLLParallel 1000 47126.6893 2643.418823 7710.976781 43572.2417 4.069938538
OpenMPParallel 1000 50,522.67 1,702.64 5,020.27 50,222.83 4.363221286
SSEDLLParallel 1000 65991.0517 1309.21226 1607.83023 65553.3 5.699095958
VectorSharpParallel 1000 68496.049 1358.07578 1765.88212 68741.7562 5.915431652
Multiply1dWithTransposeAndUnrolledAndParallelDLL 1000 115,897.95 2,299.41 4,949.72 10.00913784
AVX2DLL 1000 253923.7516 4909.293764 7497.010086 253550.85 21.92927358
Multiply1dWithTransposeAndUnrolledAndParallelSharp 1000 327,605.85 4,459.02 4,170.97 327,616.40 28.29258074
VectorSharp 1000 362379.0806 20471.89341 59717.47399 341778.4 31.29565449
SSEDLL 1000 389636.52 7713.942335 8883.387464 387519.9 33.64965187
Multiply1dDLLFirstFor 1000 775,414.91 21,458.89 63,272.02 770,391.65 66.9661091
Multiply1dDLLSecondFor 1000 1,020,197.32 26,687.63 77,425.67 1,006,458.80 88.10592118
Multiply1dWithTransposeAndUnrolled 1000 1,072,467.98 18,017.41 15,971.97 92.62010176
Multiply1dWithTranspose 1000 1,368,079.88 43,202.23 44,365.51 118.1496323
Multiply1d 1000 1,771,000.48 23,732.90 22,199.77 152.9465195
Multiply1dSharp 1000 2,301,483.30 45,097.46 57,033.84 2,308,407.40 198.7598896
Multiply2d 1000 3,966,262.18 79,468.64 234,315.07 342.5329369
Multiply1dDLLThirdFor 1000 4,218,862.42 79,396.25 70,382.71 4,236,403.45 364.3479101
Multiply1dWithTransposeAndUnrolled 1000 5368611.738 106780.2314 195253.691 463.6421555
MultiplyJaggedSharp 1000 10,104,715.06 175,000.92 163,695.97 872.659842
Multiply1dWithTranspose 1000 14726615.4 149368.5863 139719.4683 1271.814771
Multiply1d 1000 16993129.82 439113.3627 450937.0939 1467.554691