StrawberryPerl/Perl-Dist-Strawberry

Performance degradation after strawberry perl 5.32.

Closed this issue · 14 comments

aero commented

Hi,
In a simple performance test, I found symptoms of performance degradaion from versions after strawberry perl 5.32.
For example, the results of code from https://www.tek-tips.com/viewthread.cfm?qid=1119203

* strawberry perl 5.32.1.1
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.27 usr +  0.00 sys =  0.27 CPU) @ 187969.92/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.20 usr +  0.00 sys =  0.20 CPU) @ 246305.42/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  187970/s    --  -24%
Kevin 246305/s   31%    --

* strawberry perl 5.38.2.1
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.34 usr +  0.00 sys =  0.34 CPU) @ 145348.84/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.25 usr +  0.00 sys =  0.25 CPU) @ 200000.00/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  145349/s    --  -27%
Kevin 200000/s   38%    --

[ gcc optimize option comparison from "perl -V"]

* strawberry perl 5.32.1.1
  Compiler:
    cc='gcc'
    ccflags =' -DWIN32 -DWIN64 -D__USE_MINGW_ANSI_STDIO -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields'
    optimize='-s -O2'
    ...

*strawberry perl 5.38.2.1
   Compiler:
    cc='gcc'
    ccflags =' -DWIN32 -DWIN64 -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DUSE_PERLIO -D__USE_MINGW_ANSI_STDIO -fwrapv -fno-strict-aliasing -mms-bitfields'
    optimize='-Os'
    ...

Different gcc optimization options affect this?

Different gcc optimization options affect this?

I don't actually know, but I think there will be little difference between -Os and -O2 optimizations.
With the newer mingw-w64 compilers, there are problems with -O2 optimization on x64 builds, and the GNUmakefile was therefore altered to specify -Os.

However, on 32-bit (x86) builds, there are no problems with-O2 optimization.
I haven't looked at what the SP 32-bit builds specify, but they could (and therefore probably should) optimize to level -O2 if they aren't already doing that.

That's something I could (and therefore probably should) have mentioned earlier ;-)
...... and I certainly would have mentioned it if I had not forgotten all about it.

UPDATE: Comments in the GNUmakefile about this refer to Perl/perl5#20081

I suspect this is due to a change in perl itself. There looks to be a substantial speedup late in the 5.35 series when running on Ubuntu via WSL2 (see results and modified code below). This roughly corresponds with the slowdown on Windows, noting we don't have an SP 5.34 available.

Perhaps an optimisation has been applied that works well under unices but not well under Windows.

@sisyphus - do you have some VS compiled perls spanning these versions? That would help determine if it is mingw related.

Some other examples of windows performance issues are Perl/perl5#21654 and Perl/perl5#21360. These affect different version spans but maybe there are similar root causes.

perlbrew exec perl gh160.pl
perl-5.38.0
==========
          Rate kevin  fish
kevin 455106/s    --   -9%
fish  500262/s   10%    --

perl-5.36.0
==========
          Rate kevin  fish
kevin 465891/s    --   -8%
fish  504119/s    8%    --

perl-5.35.11
==========
          Rate  fish kevin
fish  139910/s    --  -20%
kevin 174831/s   25%    --

perl-5.34.1
==========
          Rate  fish kevin
fish  127064/s    --  -33%
kevin 190860/s   50%    --
use strict;
use warnings;
use Benchmark qw /cmpthese/;

my @x = ("in the", "skipping along");
my @y = ("we walk in the park", "we walk in the dark", "we walk holding hands", "we walk skipping along");
my @regexen = map { $_ = qr/$_/ } @x; # precompile regexes


#print kevin();
#print fish();


cmpthese -2, {
    kevin => \&kevin,
    fish  => \&fish,
};


sub kevin {
    my @arr;
    for (@x) {
       foreach my $line (@y) {
          next if ($line !~ /$_/);
          push @arr, "$line\n";
       }
    }
    return @arr;
}

sub fish {
    my @arr;
    Y: foreach my $line (@y) {
      foreach my $re (@regexen) {
        if ( $line =~ $re ) {
          push @arr, "$line\n";
          next Y;
        }
      } 
    }
    return @arr;
}

Using the script posted (above) by @shawnlaffan.

For perl-5.38.0:

D:\pscrpt\msvc>perl -MConfig -le "print $]; print $Config{archname};print $Config{ccversion};"
5.038000
MSWin32-x64-multi-thread
19.33.31630

D:\pscrpt\msvc>perl bench.pl
          Rate kevin  fish
kevin 286903/s    --  -16%
fish  340481/s   19%    --

For perl-5.36.1:

D:\pscrpt\msvc>perl -MConfig -le "print $]; print $Config{archname};print $Config{ccversion};"
5.036001
MSWin32-x64-multi-thread
19.33.31630

D:\pscrpt\msvc>perl bench.pl
          Rate kevin  fish
kevin 305177/s    --  -13%
fish  351348/s   15%    --

For perl-5.36.0:

D:\pscrpt\msvc>perl -MConfig -le "print $]; print $Config{archname};print $Config{ccversion};"
5.036000
MSWin32-x64-multi-thread
19.33.31630

D:\pscrpt\msvc>perl bench.pl
          Rate kevin  fish
kevin 304766/s    --  -14%
fish  352462/s   16%    --

For perl-5.32.1:

D:\pscrpt\msvc>perl -MConfig -le "print $]; print $Config{archname};print $Config{ccversion};"
5.032001
MSWin32-x64-multi-thread
19.33.31630

D:\pscrpt\msvc>perl bench.pl
          Rate kevin  fish
kevin 300280/s    --   -8%
fish  325420/s    8%    --

Using the script linked to by @aero

For perl-5.32.1:

D:\pscrpt\msvc>perl -MConfig -le "print $]; print $Config{archname}; print $Config{ccversion};"
5.032001
MSWin32-x64-multi-thread
19.33.31630

D:\pscrpt\msvc>perl bench0.pl
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.20 usr +  0.00 sys =  0.20 CPU) @ 246305.42/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.14 usr +  0.00 sys =  0.14 CPU) @ 354609.93/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  246305/s    --  -31%
Kevin 354610/s   44%    --

For perl-5.38.0:

D:\pscrpt\msvc>perl -MConfig -le "print $]; print $Config{archname}; print $Config{ccversion};"
5.038000
MSWin32-x64-multi-thread
19.33.31630

D:\pscrpt\msvc>perl bench0.pl
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  1 wallclock secs ( 0.20 usr +  0.00 sys =  0.20 CPU) @ 245098.04/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.14 usr +  0.00 sys =  0.14 CPU) @ 357142.86/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  245098/s    --  -31%
Kevin 357143/s   46%    --

Not quite sure what that demonstrates.
@shawnlaffan, @aero, if there's any more (msvc-built) perl versions you'd like results for, please specify.
I'm about to build such a perl-5.34.2 perl-5.34.3, anyway - because I don't have a perl-5.34 built with this (VS 2022) toolset.

Thanks @sisyphus. That suggests there is no meaningful difference between versions when using MSVC so perhaps it is a gcc issue.

You don't happen to have any 5.36 or 5.38 perls built using the gcc-8 that comes with SP 5.32?

aero commented

I have my own build perl binary of perl-5.38.0 with '-s O2' gcc(same gcc version 13.1.0) optimaztion option.
(but perl-5.38.2 cannot be compiled using the '-s O2' option, I got an GNUMakefile error.)

[perl 5.38.0 performance comparison according to different options]

* Original strawberry perl 5.38.0 binary (-Os)
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  1 wallclock secs ( 0.34 usr +  0.00 sys =  0.34 CPU) @ 145348.84/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.25 usr +  0.00 sys =  0.25 CPU) @ 200000.00/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  145349/s    --  -27%
Kevin 200000/s   38%    --

* My own build perl 5.38.0 (-s -O2)
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  1 wallclock secs ( 0.27 usr +  0.00 sys =  0.27 CPU) @ 187969.92/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.19 usr +  0.00 sys =  0.19 CPU) @ 267379.68/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  187970/s    --  -30%
Kevin 267380/s   42%    --

It appears that differences in optimization options make a difference in performance.

Thanks @aero.

Re-reading the issue @sisyphus linked to shows this comment in which compilation works with nearly all of the -O2 flags:

OPTIMIZE = -Os -falign-functions -falign-jumps -falign-labels -falign-loops -freorder-blocks -freorder-blocks-algorithm=stc -freorder-blocks-and-partition

Does that compile for you? And if so, what is the performance like?

The optimisation flags might be simplified to:

OPTIMIZE = -O2 -finline-functions -fno-prefetch-loop-arrays

Or maybe even this since inline-functions seems to be a -Os flag:

OPTIMIZE = -O2 -fno-prefetch-loop-arrays

I have my own build perl binary of perl-5.38.0 with '-s O2' gcc(same gcc version 13.1.0) optimaztion option.

Could you provide the perl -V output of that build ?

Does that compile for you? And if so, what is the performance like?

I ended up compiling a 5.38.2 with the extra flags. Results below are from a Windows 10 desktop.

Script has been modified to output more info.

The more optimised 5.38.2 is slower than 5.38.0 here and on average but there is not much difference across multiple runs.

v5.32.1
optimize: -s -O2
          Rate kevin  fish
kevin 237889/s    --  -16%
fish  282288/s   19%    --

v5.38.0
optimize: -Os
          Rate kevin  fish
kevin 218270/s    --  -12%
fish  248032/s   14%    --

v5.38.2
optimize: -Os -falign-functions -falign-jumps -falign-labels -falign-loops -freorder-blocks -freorder-blocks-algorithm=stc -freorder-blocks-and-partition
          Rate kevin  fish
kevin 209478/s    --   -3%
fish  215027/s    3%    --

aero commented

@sisyphus

I have my own build perl binary of perl-5.38.0 with '-s O2' gcc(same gcc version 13.1.0) optimaztion option.

Could you provide the perl -V output of that build ?

mybuild.bat

set IO_COMPRESS_SKIP_STDIN_TESTS=1
set IPC_CMD_SKIP_TESTS=1
gmake -j4 INST_TOP=c:\perl-5.38.0-64bit\perl CCHOME=C:\strawberry-perl-5.38-64bit\c USE_MINGW_ANSI_STDIO=define USE_64_BIT_INT=define OPTIMIZE="-s -O2" man1dir=none man3dir=none html1dir=none html3dir=none INSTALLSITESCRIPT=c:\perl-5.38.0-64bit\perl\site\bin
gmake install

output

C:\perl-5.38.0-64bit\perl\bin>perl -V
Summary of my perl5 (revision 5 version 38 subversion 0) configuration:

  Platform:
    osname=MSWin32
    osvers=10.0.19045.3758
    archname=MSWin32-x64-multi-thread
    uname=''
    config_args='undef'
    hint=recommended
    useposix=true
    d_sigaction=undef
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='gcc'
    ccflags =' -DWIN32 -DWIN64 -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DUSE_PERLIO -D__USE_MINGW_ANSI_STDIO -fwrapv -fno-strict-aliasing -mms-bitfields'
    optimize='-s -O2'
    cppflags='-DWIN32'
    ccversion=''
    gccversion='13.1.0'
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='long long'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='g++'
    ldflags ='-s -L"c:\perl-5.38.0-64bit\perl\lib\CORE" -L"C:\strawberry-perl-5.38-64bit\c\lib" -L"C:\strawberry-perl-5.38-64bit\c\x86_64-w64-mingw32\lib" -L"C:\strawberry-perl-5.38-64bit\c\lib\gcc\x86_64-w64-mingw32\13.1.0"'
    libpth=C:\strawberry-perl-5.38-64bit\c\lib C:\strawberry-perl-5.38-64bit\c\x86_64-w64-mingw32\lib C:\strawberry-perl-5.38-64bit\c\lib\gcc\x86_64-w64-mingw32\13.1.0
    libs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    perllibs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    libc=
    so=dll
    useshrplib=true
    libperl=libperl538.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs
    dlext=dll
    d_dlsymun=undef
    ccdlflags=' '
    cccdlflags=' '
    lddlflags='-shared -s -L"c:\perl-5.38.0-64bit\perl\lib\CORE" -L"C:\strawberry-perl-5.38-64bit\c\lib" -L"C:\strawberry-perl-5.38-64bit\c\x86_64-w64-mingw32\lib" -L"C:\strawberry-perl-5.38-64bit\c\lib\gcc\x86_64-w64-mingw32\13.1.0"'


Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_TIMES
    HAVE_INTERP_INTERN
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_IMPLICIT_SYS
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
  Built under MSWin32
  Compiled at Dec  9 2023 19:51:21
  @INC:
    C:/perl-5.38.0-64bit/perl/site/lib
    C:/perl-5.38.0-64bit/perl/lib


C:\perl-5.38.0-64bit\perl\bin>perl \temp\bench.pl
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.25 usr +  0.00 sys =  0.25 CPU) @ 200000.00/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.17 usr +  0.00 sys =  0.17 CPU) @ 290697.67/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  200000/s    --  -31%
Kevin 290698/s   45%    --
aero commented

I don't know why it didn't compile before.
I tried compiling perl-5.38.2 again and it worked fine.
Performance improved.

mybuild.bat

set IO_COMPRESS_SKIP_STDIN_TESTS=1
set IPC_CMD_SKIP_TESTS=1
gmake -j4 INST_TOP=c:\perl-5.38.2-64bit\perl CCHOME=C:\strawberry-perl-5.38-64bit\c USE_MINGW_ANSI_STDIO=define USE_64_BIT_INT=define OPTIMIZE="-s -O2" man1dir=none man3dir=none html1dir=none html3dir=none INSTALLSITESCRIPT=c:\perl-5.38.2-64bit\perl\site\bin
gmake install

output

>perl -V
Summary of my perl5 (revision 5 version 38 subversion 2) configuration:

  Platform:
    osname=MSWin32
    osvers=10.0.19045.3758
    archname=MSWin32-x64-multi-thread
    uname=''
    config_args='undef'
    hint=recommended
    useposix=true
    d_sigaction=undef
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='gcc'
    ccflags =' -DWIN32 -DWIN64 -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DUSE_PERLIO -D__USE_MINGW_ANSI_STDIO -fwrapv -fno-strict-aliasing -mms-bitfields'
    optimize='-s -O2'
    cppflags='-DWIN32'
    ccversion=''
    gccversion='13.1.0'
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='long long'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='g++'
    ldflags ='-s -L"c:\perl-5.38.2-64bit\perl\lib\CORE" -L"C:\strawberry-perl-5.38-64bit\c\lib" -L"C:\strawberry-perl-5.38-64bit\c\x86_64-w64-mingw32\lib" -L"C:\strawberry-perl-5.38-64bit\c\lib\gcc\x86_64-w64-mingw32\13.1.0"'
    libpth=C:\strawberry-perl-5.38-64bit\c\lib C:\strawberry-perl-5.38-64bit\c\x86_64-w64-mingw32\lib C:\strawberry-perl-5.38-64bit\c\lib\gcc\x86_64-w64-mingw32\13.1.0
    libs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    perllibs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    libc=
    so=dll
    useshrplib=true
    libperl=libperl538.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs
    dlext=dll
    d_dlsymun=undef
    ccdlflags=' '
    cccdlflags=' '
    lddlflags='-shared -s -L"c:\perl-5.38.2-64bit\perl\lib\CORE" -L"C:\strawberry-perl-5.38-64bit\c\lib" -L"C:\strawberry-perl-5.38-64bit\c\x86_64-w64-mingw32\lib" -L"C:\strawberry-perl-5.38-64bit\c\lib\gcc\x86_64-w64-mingw32\13.1.0"'


Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_TIMES
    HAVE_INTERP_INTERN
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_IMPLICIT_SYS
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
  Built under MSWin32
  Compiled at Dec  9 2023 20:23:18
  @INC:
    C:/perl-5.38.2-64bit/perl/site/lib
    C:/perl-5.38.2-64bit/perl/lib

[Benchmark]-----------------------------------------------------------------------------
* This build
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.27 usr +  0.00 sys =  0.27 CPU) @ 187969.92/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  1 wallclock secs ( 0.17 usr +  0.00 sys =  0.17 CPU) @ 290697.67/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  187970/s    --  -35%
Kevin 290698/s   55%    --

* Original strawberry perl 5.38.2
Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.34 usr +  0.00 sys =  0.34 CPU) @ 145348.84/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.23 usr +  0.00 sys =  0.23 CPU) @ 213675.21/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  145349/s    --  -32%
Kevin 213675/s   47%    --

mybuild.bat

Thanks, @aero.
I have built perl-5.38.0 with OPTIMIZE="-s -O2" and found that to "work" - though a number of test scripts reported failures during gmake test.
However, I also found things to be very much the same with perl-5.38.2. (What was the error you got ?)

The "-s -O2" optimization does (despite the failing tests) provide the best performance of both of the benchmarking scripts that have been presented here.
However, of course, we cannot advocate thar "-s -O2" should become the default, until we can solve the issue of those failing tests.
IMO, the performance loss with "-Os" is not excessive - though I admit that it's a little more significant than I anticipated.

@aero, I encourage you to instead concentrate on looking at the current perl-5.39.x devel releases as they are released - with a view to investigating how they might be modified to improve the upcoming perl-5.40.0 release.
Sure, there's something to be learnt from looking back, and seeing problems with what was done in the past - but the real advances will be made by looking at what's coming next.

You could do that using the 5.38.0 toolchain that ships with SP-5.38.0,
However, I think it would be better to switch to a gcc-13.2.0 UCRT toolchain provided by https://winlibs.com.
(There's also a gcc-14.0.0 pre-release that threw up no new issues when I tried it recently.)
UCRT is what Visual Studio toolchains use; it's less troublesome than MSVCRT (especially wrt locales); and I'd be surprised if SP-5.40 is not built using a "UCRT" toolchain.

If you really need a perl that is best optimized to run those benchmarking tests, then you should probably use a static perl (ie built without threads - USE_MULTI=undef, USE_ITHREADS=undef and USE_IMP_SYS=undef).
Such builds disable threads and don't provide the fork() function. They are therefore considered unfit for general usage - so don't expect StrawberryPerl to ever provide them,

As regards these benchmarking scripts on perl-5.38.0 (MSWin32-x64-mult-thread), I found that the best performing configuration was mingw-w64-built with "-s -O2", followed by mingw-w64-built with "-Os", followed by msvc143-built with "-O1 -Zi -GL -fp:precise".
But I didn't think that any of the differences were outstandingly bad or good.

I've started experimenting with UCRT builds. I have a set of external libs but have yet to try building perl (the recent perl releases took priority).

Issues are being tracked under #152

Performance improved.

@aero, I've checked to see how much further improvement you'll see with an unthreaded build.
I have an unthreaded build of 5.38.0 ('-s -O2') and a threaded build of 5.38.0 ('-s -O2')

Threaded:

Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.17 usr +  0.00 sys =  0.17 CPU) @ 290697.67/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.11 usr +  0.00 sys =  0.11 CPU) @ 458715.60/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  290698/s    --  -37%
Kevin 458716/s   58%    --

Unthreaded:

Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.14 usr +  0.00 sys =  0.14 CPU) @ 357142.86/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  1 wallclock secs ( 0.09 usr +  0.00 sys =  0.09 CPU) @ 537634.41/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  357143/s    --  -34%
Kevin 537634/s   51%    --

The other advantage with that unthreaded 5.38.0 build (over that threaded 5.38.0 build) is that it passes all tests.
You should run gmake test on your 5.38.x ('-s -O2) builds, just so you know how horribly broken they are.
For me, repeated runs of gmake test can throw up different failures - ie there are some test scripts that don't fail every time.
If you're interested in trying the unthreaded build, just add USE_MULTI=undef USE_ITHREADS=undef USE_IMP_SYS=undef to your existing args.

For the record, the same script on the same machine, using StrawberryPer-5.38.0 ('-Os') produced:

Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  0 wallclock secs ( 0.19 usr +  0.00 sys =  0.19 CPU) @ 265957.45/s (n=50000)
            (warning: too few iterations for a reliable count)
     Kevin:  0 wallclock secs ( 0.12 usr +  0.00 sys =  0.12 CPU) @ 400000.00/s (n=50000)
            (warning: too few iterations for a reliable count)
          Rate  Fish Kevin
Fish  265957/s    --  -34%
Kevin 400000/s   50%    --

AFAIK, the only differences between 5.38.0 and 5.38.2 is that 5.38.2 includes 2 security fixes.
You should find very little (if any) difference in performance between 5.38.0 and 5.38.2 that were built with the same optimization level.
(Same goes for 5.34.3 and 5.36.3 - all that has changed is the inclusion of the 2 security fixes.)