Slurm accounting data not loading to BigQuery
fdmalone opened this issue · 5 comments
Describe the bug
With v.1.32.1 of the toolkit I've had a common issue with the cluster failing to upload the slurm accounting data to BigQuery. At some point in time the data would no longer appear in the cloud dash board. Manually running load_bq.py I found:
Traceback (most recent call last):
File "load_bq.py", line 329, in <module>
main()
File "load_bq.py", line 305, in main
jobs = load_slurm_jobs(start, end)
File "load_bq.py", line 222, in load_slurm_jobs
job_rows = [
File "load_bq.py", line 223, in <listcomp>
make_job_row(job)
File "load_bq.py", line 177, in make_job_row
job_row = {
File "load_bq.py", line 178, in <dictcomp>
field_name: dict.get(converters, field.field_type)(job[field_name])
File "load_bq.py", line 40, in make_datetime
return datetime.strptime(time_string, SLURM_TIME_FORMAT).replace(
File "/usr/lib64/python3.8/_strptime.py", line 568, in _strptime_datetime
tt, fraction, gmtoff_fraction = _strptime(data_string, format)
File "/usr/lib64/python3.8/_strptime.py", line 349, in _strptime
raise ValueError("time data %r does not match format %r" %
ValueError: time data 'None' does not match format '%Y-%m-%dT%H:%M:%S'
which results from 'None' being passed in as the job's start time:
> sacct -j 18 --format=JobID,JobName,State,Start,End,Elapsed,CPUTime
JobID JobName State Start End Elapsed CPUTime
------------ ---------- ---------- ------------------- ------------------- ---------- ----------
18 interacti+ CANCELLED+ None 2024-07-09T17:46:30 00:00:00 00:00:00
My temporary workaround was to modify the script to filter out entries with None in them, and then incrementally backfill the tables, but perhaps there is a better way.
Steps to reproduce
I've hit this issue on at least 3 clusters using this version of the toolkit but I doubt it is easily reproducible. It's odd that it's reporting the Start time as None.
Another issue is that the row insertion is not super robust as the message size can exceed the max number of rows / message size. Ideally the inserts should be batched (although granted it's unlikely these are exceeded given the cadence of uploads) I just hit it when incrementally backfilling daily.
@fdmalone Thank you for reporting and fixing the "None"-issue!
Do you mind to create a separate issue for #2989 (comment) ?
Closing, it looks like the issue was resolved and has made it to a release.