Recurrent failures of the GitHub Actions performance test
Closed this issue · 6 comments
No doubt many have noticed the intermittent failure of the "Performance Monitor" test. This does not appear to have anything to do with MOM6, but rather the parsing of the perf report
output.
python tools/parse_perf.py -f work/p0/opt/perf.data > work/p0/opt/profile.json
Traceback (most recent call last):
File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 71, in <module>
main()
File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 19, in main
profile = parse_perf_report(args.data)
File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 63, in parse_perf_report
period = int(tokens[3])
ValueError: invalid literal for int() with base 10: 'std::char_traits<wchar_t>'
This is a very simple parser that makes a lot of assumptions about what it is reading. In this case, the output is expected to look something like this:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 147K of event 'cycles:u'
# Event count (approx.): 114907202161
#
# Overhead Symbol Period IPC [IPC Coverage]
# ........ .................................................................... ............ ....................
#
12.18% [.] mom_barotropic_mp_btstep_ 13995565662 - -
8.53% [.] diag_manager_mod_mp_diag_send_data_ 9804461284 - -
8.16% [.] mpp_domains_mod_mp_mpp_do_group_update_r8_ 9378762297 - -
7.95% [.] mom_vert_friction_mp_vertvisc_coef_ 9132010079 - -
6.43% [.] mom_hor_visc_mp_horizontal_viscosity_ 7389425771 - -
3.03% [.] mom_eos_wright_mp_int_density_dz_wright_ 3484215718 - -
3.03% [.] mom_continuity_ppm_mp_zonal_flux_layer_ 3477704292 - -
...
The reality is that perf
output can be hard to predict. The performance metrics are usually unique to the CPU; AMD and Intel chips will rarely have the same metrics. Different platforms may have different headers or even different formats, and reading the fourth token may just not work as a test.
Unfortunately, this is incredibly hard to replicate. I have a branch that is trying to catch this and report the output, but have so far failed to replicate this error.
We got our first perf error log:
parse_perf.py: Error extracting symbol count
line: ' 0.00% [.] std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() 250062 - - \n'
tokens: ['0.00%', '[.]', 'std::__cxx11::basic_string<char,', 'std::char_traits<char>,', 'std::allocator<char>', '>::~basic_string()', '250062', '-', '-']
This is happening in the C++ library, and there is poor handling of the template syntax (basic_string<char, ...>
). It is not helped by the resolution at the end.
This could be resolved by either tracking <
, >
as delimiters or perhaps just index-counting from the opposite end.
A second error:
line: ' 0.00% [.] std::__cxx11::messages<char>::messages(unsigned long) 250062 - - \n'
tokens: ['0.00%', '[.]', 'std::__cxx11::messages<char>::messages(unsigned', 'long)', '250062', '-', '-']
This one is due to a type (or type recast) in the output: messages<char>::messages(unsigned long)
, which would also require tracking (
, )
delimiters.
C++ is the gift which keeps on giving.
I think that #632 should fix this issue.
Fixed by #632
Unfortunately not yet fixed... an error in the prototype leaked over into the production script. #664 will fix this one.
This appears to have been fixed (although perhaps even more exotic perf output will strike again one day).