GoogleCloudPlatform/cluster-toolkit

Slurm accounting data not loading to BigQuery

fdmalone opened this issue · 5 comments

Describe the bug

With v.1.32.1 of the toolkit I've had a common issue with the cluster failing to upload the slurm accounting data to BigQuery. At some point in time the data would no longer appear in the cloud dash board. Manually running load_bq.py I found:

Traceback (most recent call last):
  File "load_bq.py", line 329, in <module>
    main()
  File "load_bq.py", line 305, in main
    jobs = load_slurm_jobs(start, end)
  File "load_bq.py", line 222, in load_slurm_jobs
    job_rows = [
  File "load_bq.py", line 223, in <listcomp>
    make_job_row(job)
  File "load_bq.py", line 177, in make_job_row
    job_row = {
  File "load_bq.py", line 178, in <dictcomp>
    field_name: dict.get(converters, field.field_type)(job[field_name])
  File "load_bq.py", line 40, in make_datetime
    return datetime.strptime(time_string, SLURM_TIME_FORMAT).replace(
  File "/usr/lib64/python3.8/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/lib64/python3.8/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data 'None' does not match format '%Y-%m-%dT%H:%M:%S' 

which results from 'None' being passed in as the job's start time:

> sacct -j 18 --format=JobID,JobName,State,Start,End,Elapsed,CPUTime
JobID           JobName      State               Start                 End    Elapsed    CPUTime
------------ ---------- ---------- ------------------- ------------------- ---------- ----------
18           interacti+ CANCELLED+                None 2024-07-09T17:46:30   00:00:00   00:00:00

My temporary workaround was to modify the script to filter out entries with None in them, and then incrementally backfill the tables, but perhaps there is a better way.

Steps to reproduce

I've hit this issue on at least 3 clusters using this version of the toolkit but I doubt it is easily reproducible. It's odd that it's reporting the Start time as None.

cc @cboneti, we chatted about this but I just got time to dig in a little

Another issue is that the row insertion is not super robust as the message size can exceed the max number of rows / message size. Ideally the inserts should be batched (although granted it's unlikely these are exceeded given the cadence of uploads) I just hit it when incrementally backfilling daily.

@fdmalone Thank you for reporting and fixing the "None"-issue!
Do you mind to create a separate issue for #2989 (comment) ?

Closing, it looks like the issue was resolved and has made it to a release.