openwall/john-packages

Performance regression in latest Docker image?

claudioandre-br opened this issue · 3 comments

I noticed a big performance impact when testing on CircleCI. Upon reviewing I realized that I can't reproduce the issue locally, but I can using the cloud. Below, using Microsoft's cloud.

Latest version:

$ docker run -it ghcr.io/openwall/john:latest best    --test=10 --format=SHA512crypt
 best --test=10 --format=SHA512crypt
Sorry, AVX512BW is required for this build
Sorry, AVX512F is required for this build
Will use /john/run/john-avx2
Will run 8 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    3280 c/s real, 682 c/s virtual

Previous version:

$ docker run -it ghcr.io/openwall/john:v1.9.0J2_J36.7 best    --test=10 --format=SHA512crypt
 best --test=10 --format=SHA512crypt
Sorry, AVX512BW is required for this build
Sorry, AVX512F is required for this build
Will use /john/run/john-avx2
Will run 8 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    3319 c/s real, 687 c/s virtual

Locally:

$ docker run --rm ghcr.io/openwall/john:latest best --test=10 --format=SHA512crypt
best --test=10 --format=SHA512crypt
Sorry, AVX512BW is required for this build
Sorry, AVX512F is required for this build
Will use /john/run/john-avx2
Will run 8 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	2129 c/s real, 286 c/s virtual
$ docker run --rm ghcr.io/openwall/john:v1.9.0J2 best --test=10 --format=SHA512crypt
best --test=10 --format=SHA512crypt
Sorry, AVX512BW is required for this build
Sorry, AVX512F is required for this build
Will use /john/run/john-avx2
Will run 8 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	2100 c/s real, 284 c/s virtual

CircleCI now (this is AWS):

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	5263 c/s real, 2693 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	2793 c/s real, 2793 c/s virtual

A week ago:

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	6718 c/s real, 3362 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	3486 c/s real, 3486 c/s virtual

  • Is it AVX512 related? No, it seems.
  • The new image uses make strip. Is it important?
  • Is the cloud particularly busy today?

Checklist

🥇 👍

@claudioandre-br I think we shouldn't worry about that. Circle CI may well be over-booking CPUs, it and AWS may use systems with different CPUs and clock rates, and the memory bandwidth is shared with usage by other VMs (although most of our usage is not memory bandwidth bound, so this probably does not explain the major performance difference on Circle CI).

Since I can't do anything right now, I'll wait for the next package build/release when I'll test the Docker image again. If the problem persists, then we think about how to test this.

For now, this is a reminder.

Looks like everything is fine. I got something better than 7,000 c/s (an adequate figure IMO).

Version: 1.9.0-jumbo-1+bleeding-460f98c 2021-12-23 22:32:54 -0300

[...]

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE2 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1190 c/s real, 600 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE2 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	593 c/s real, 594 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSSE3 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1135 c/s real, 571 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSSE3 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	587 c/s real, 587 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE4.1 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1174 c/s real, 589 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE4.1 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	618 c/s real, 618 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE4.1 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1178 c/s real, 589 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE4.1 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	615 c/s real, 615 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 AVX 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1700 c/s real, 851 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 AVX 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	875 c/s real, 874 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	3123 c/s real, 1588 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1633 c/s real, 1633 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512F 8x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	7438 c/s real, 3790 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512F 8x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	3916 c/s real, 3916 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	7723 c/s real, 3865 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	3985 c/s real, 3985 c/s virtual

Jan 09 (7723 c/s)

vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 4
microcode	: 0x2006b06
cpu MHz		: 2999.998
cache size	: 25344 KB

Dec 27 (5303 c/s)

vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 4
microcode	: 0x2006b06
cpu MHz		: 3399.953
cache size	: 25344 KB

Dec 20 (6718 c/s)

vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 4
microcode	: 0x2006b06
cpu MHz		: 3395.094
cache size	: 25344 KB