Recurrent failures of the GitHub Actions performance test

Question

Recurrent failures of the GitHub Actions performance test

Closed this issue 2 months ago · 6 comments

No doubt many have noticed the intermittent failure of the "Performance Monitor" test. This does not appear to have anything to do with MOM6, but rather the parsing of the perf report output.

python tools/parse_perf.py -f work/p0/opt/perf.data > work/p0/opt/profile.json
Traceback (most recent call last):
  File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 71, in <module>
    main()
  File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 19, in main
    profile = parse_perf_report(args.data)
  File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 63, in parse_perf_report
    period = int(tokens[3])
ValueError: invalid literal for int() with base 10: 'std::char_traits<wchar_t>'

This is a very simple parser that makes a lot of assumptions about what it is reading. In this case, the output is expected to look something like this:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 147K of event 'cycles:u'
# Event count (approx.): 114907202161
#
# Overhead  Symbol                                                                      Period  IPC   [IPC Coverage]
# ........  ....................................................................  ............  ....................
#
    12.18%  [.] mom_barotropic_mp_btstep_                                          13995565662  -      -            
     8.53%  [.] diag_manager_mod_mp_diag_send_data_                                 9804461284  -      -            
     8.16%  [.] mpp_domains_mod_mp_mpp_do_group_update_r8_                          9378762297  -      -            
     7.95%  [.] mom_vert_friction_mp_vertvisc_coef_                                 9132010079  -      -            
     6.43%  [.] mom_hor_visc_mp_horizontal_viscosity_                               7389425771  -      -            
     3.03%  [.] mom_eos_wright_mp_int_density_dz_wright_                            3484215718  -      -            
     3.03%  [.] mom_continuity_ppm_mp_zonal_flux_layer_                             3477704292  -      -            
...

The reality is that perf output can be hard to predict. The performance metrics are usually unique to the CPU; AMD and Intel chips will rarely have the same metrics. Different platforms may have different headers or even different formats, and reading the fourth token may just not work as a test.

Unfortunately, this is incredibly hard to replicate. I have a branch that is trying to catch this and report the output, but have so far failed to replicate this error.

Answer 1 · 2024-05-06T17:08:23.000Z

We got our first perf error log:

parse_perf.py: Error extracting symbol count
line: '     0.00%  [.] std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()        250062  -      -            \n'
tokens: ['0.00%', '[.]', 'std::__cxx11::basic_string<char,', 'std::char_traits<char>,', 'std::allocator<char>', '>::~basic_string()', '250062', '-', '-']

This is happening in the C++ library, and there is poor handling of the template syntax (basic_string<char, ...>). It is not helped by the resolution at the end.

This could be resolved by either tracking <, > as delimiters or perhaps just index-counting from the opposite end.

Answer 2 · 2024-05-06T17:38:24.000Z

A second error:

line: '     0.00%  [.] std::__cxx11::messages<char>::messages(unsigned long)                  250062  -      -            \n'
tokens: ['0.00%', '[.]', 'std::__cxx11::messages<char>::messages(unsigned', 'long)', '250062', '-', '-']

This one is due to a type (or type recast) in the output: messages<char>::messages(unsigned long), which would also require tracking (, ) delimiters.

C++ is the gift which keeps on giving.

Answer 3 · 2024-05-14T16:27:53.000Z

I think that #632 should fix this issue.

Answer 4 · 2024-05-24T19:39:29.000Z

Fixed by #632

Answer 5 · 2024-06-11T21:37:20.000Z

Unfortunately not yet fixed... an error in the prototype leaked over into the production script. #664 will fix this one.

Answer 6 · 2024-08-22T13:39:43.000Z

This appears to have been fixed (although perhaps even more exotic perf output will strike again one day).