desihub/desiBackup

add fuji and guadalupe to desi.json

Closed this issue · 18 comments

Please add the fuji and guadalupe reductions to desi.json and back them up to HPSS. They likely could follow the same structure as everest for how directories are split into individual htar files.

Please also add checksum files to these productions prior to backup to tape.

Acknowledged. I was just going to mention that today in fact.

While working on this, I've turned up a few empty directories. Listed below are the empty directories in fuji that are not in the exposures/ directory. The exposures/ directory has many empty directories.

fuji/healpix/sv1/other/26
fuji/healpix/sv1/bright/115
fuji/healpix/sv1/bright/101
fuji/run/scripts/night/20210606
fuji/run/scripts/night/20210616
fuji/run/scripts/night/20210708
fuji/run/scripts/night/20210626
fuji/run/scripts/night/20210602
fuji/run/scripts/night/20210612
fuji/run/scripts/night/20210530
fuji/run/scripts/night/20210515
fuji/run/scripts/night/20210617
fuji/run/scripts/night/20210604
fuji/run/scripts/night/20210627
fuji/run/scripts/night/20210614
fuji/run/scripts/night/20210706
fuji/run/scripts/night/20210531
fuji/run/scripts/night/20210609
fuji/run/scripts/night/20210620
fuji/run/scripts/night/20210619
fuji/run/scripts/night/20210629
fuji/run/scripts/night/20210605
fuji/run/scripts/night/20210523
fuji/run/scripts/night/20210615
fuji/run/scripts/night/20210704
fuji/run/scripts/night/20210528
fuji/run/scripts/night/20210621
fuji/run/scripts/night/20210607
fuji/run/scripts/night/20210514
fuji/run/scripts/night/20210709
fuji/run/scripts/night/20210603
fuji/run/scripts/night/20210613
fuji/run/scripts/night/20210705
fuji/run/scripts/night/20210516
fuji/run/scripts/night/20210608
fuji/run/scripts/night/20210618
fuji/run/scripts/night/20210628
fuji/run/scripts/night/20210522
fuji/run/scripts/night/20210707
fuji/run/scripts/night/20210601
fuji/run/scripts/night/20210611

guadalupe has fewer empty directories:

guadalupe/exposures/20210606/00091321
guadalupe/exposures/20210616/00094843
guadalupe/exposures/20210708/00097886
guadalupe/exposures/20210626/00096129
guadalupe/exposures/20210602/00090702
guadalupe/exposures/20210612/00093422
guadalupe/exposures/20210530/00090295
guadalupe/exposures/20210515/00088478
guadalupe/exposures/20210617/00095030
guadalupe/exposures/20210604/00091120
guadalupe/exposures/20210627/00096241
guadalupe/exposures/20210614/00094432
guadalupe/exposures/20210521/00089602
guadalupe/exposures/20210706/00097620
guadalupe/exposures/20210531/00090455
guadalupe/exposures/20210610/00093130
guadalupe/exposures/20210609/00093011
guadalupe/exposures/20210620/00095346
guadalupe/exposures/20210619/00095236
guadalupe/exposures/20210629/00096522
guadalupe/exposures/20210605/00091220
guadalupe/exposures/20210523/00089821
guadalupe/exposures/20210615/00094658
guadalupe/exposures/20210518/00089092
guadalupe/exposures/20210704/00097311
guadalupe/exposures/20210528/00090076
guadalupe/exposures/20210621/00095446
guadalupe/exposures/20210607/00091430
guadalupe/exposures/20210514/00088312
guadalupe/exposures/20210709/00097997
guadalupe/exposures/20210603/00091011
guadalupe/exposures/20210613/00093527
guadalupe/exposures/20210705/00097491
guadalupe/exposures/20210529/00090176
guadalupe/exposures/20210516/00088813
guadalupe/exposures/20210608/00092319
guadalupe/exposures/20210618/00095134
guadalupe/exposures/20210628/00096399
guadalupe/exposures/20210522/00089714
guadalupe/exposures/20210707/00097714
guadalupe/exposures/20210601/00090558
guadalupe/exposures/20210611/00093319
guadalupe/exposures/20210517/00088960

Is it safe to remove these? They definitely throw a minor monkey wrench into the backup process.

In addition, there are a handful of empty files:

fuji/tiles/pernight/80733/20210208/qso_qn-1-80733-20210208.fits
fuji/run/dashboard/dashboard.err
guadalupe/run/dashboard/dashboard.err

These are less of an issue for backups, but do we need to keep these?

Thanks for noticing these, @weaverba137. Mentioning @akremin for this to be on his radar as well. I'd like to check what happened with the blank qso_qn-1-80733-20210208.fits and the blank healpix dirs:

This file should not have been blank or otherwise should have been something like the *.misscameras.txt files; we'll have to decide how strictly we respect "fuji is frozen":

fuji/tiles/pernight/80733/20210208/qso_qn-1-80733-20210208.fits

Normal workflow should not have left empty healpix dirs behind so I'd like to understand if these are missing data, or leftover cleanup from the first round of running healpix:

fuji/healpix/sv1/other/26
fuji/healpix/sv1/bright/115
fuji/healpix/sv1/bright/101

I'm not worried about blank dashboard.err files, and although blank directories under scripts/ and exposures/ is a bit distasteful, it is currently a side effect of the current pipeline so it will appear in the future unless there is additional coding work, so I'd prefer the backup system to be robust to these cases (whether or not we get that cleanup done for DR1).

When you say you want the backup system to be "robust", are you explicitly saying you want these empty directories saved? Or that the backup system can remove these in order to ensure a clean backup that requires less coding to handle special cases?

I think it would be ok if the backup did not include the blank directories, such that a restoration from the backup also wouldn't have the blank directories and thus be somewhat different from the original.

But I think the backup system should not automatically delete blank directories in the original copy, nor should it require that they not exist in the place (i.e. requiring that humans remove them first).

The presence of empty directories increases the complexity in terms of computing checksums. I want to at least get a very well-defined specification for how to handle this, so that the complexity does not increase arbitrarily.

  1. Empty directories may exist in the exposures/ directory. We'll just write code to deal with that.
  2. Empty directories that exist outside exposures/ should require human investigation and potential removal.
  3. The empty file fuji/tiles/pernight/80733/20210208/qso_qn-1-80733-20210208.fits should also be investigated.
  4. I don't think anyone cares about the presence or absence of dashboard.err files. So far there hasn't been a lot of movement to document the run/ directory.

Ultimately though, there's a bit of a contradiction here: we want to save these empty directories because of reluctance to change the production, yet I'm writing checksum files into every single directory (that contains files), so no matter what, the production will be dramatically changed.

Can we combine requirements (1) and (2) into:

The backup system will ignore empty directories.

i.e. I don't think empty directories should be an "opt-in" that require special handling for each case, vs. making the backup system as agnostic as possible about what it is backing up, with general handling for empty directories wherever they may be. This is independent of whatever effort we put into avoiding blank directories in the future, or any cleanup of past productions.

That contradicts your earlier comments on this same ticket, where you asked @akremin to investigate the empty directories in the healpix directory.

Also, in the case of the healpix directories, it is impossible to create a htar backup where the only input is an empty directory. The existing empty directories would trigger errors.

In the case of exposures directories the empty directories are "buried" within a night directory, and would not trigger errors, because night directories are never empty.

Again, all of this increases code complexity, and I don't feel this complexity is justified.

Finally, we need to address the contradiction:

Ultimately though, there's a bit of a contradiction here: we want to save these empty directories because of reluctance to change the production, yet I'm writing checksum files into every single directory (that contains files), so no matter what, the production will be dramatically changed.

That contradicts your earlier comments on this same ticket, where you asked @akremin to investigate the empty directories in the healpix directory.

Yes, I want to investigate what happened there. But at the same time, I don't think that a backup system should put constraints like not being able to backup a production because some unexpected empty directories exist.

Also, in the case of the healpix directories, it is impossible to create a htar backup where the only input is an empty directory. The existing empty directories would trigger errors.

Then let's skip over them. I know it's always easy to claim that someone else's work should be a simple fix, but it smells like we should be able to handle this with some if len(os.listdir(dirname)) == 0: continue or try/except blocks or similar. It's ok that we can't/don't make an htar backup of an empty directory, but that shouldn't prevent us from being able to backup all the other directories that do have files.

My insistence on this is more because I'm sure that at some point in the future we'll have another unexpected empty directory and that shouldn't cause the production backup to grind to a halt. It feels like the tail wagging the dog if the backup system itself is defining what's allowed (other than readable files).

Again, all of this increases code complexity, and I don't feel this complexity is justified.

Finally, we need to address the contradiction:

Ultimately though, there's a bit of a contradiction here: we want to save these empty directories because of reluctance to change the production, yet I'm writing checksum files into every single directory (that contains files), so no matter what, the production will be dramatically changed.

Yes, the production will technically change by adding checksum files. I think that's fine.

After a live discussion, we will:

  • Continue adding checksum files and creating tape backups for the parts of fuji and guadalupe that do not have any issues with empty directories.
  • Work on the existing script to improve handling of empty directories for both checksum and htar.
  • Work on separating the checksum and backup steps. See desihub/desispec#1763.

Mentioning @akremin for this to be on his radar as well. I'd like to check what happened with the blank qso_qn-1-80733-20210208.fits

This was due to a job timeout, presumably during I/O of the fits file that left it blank.

fuji/healpix/sv1/other/26
fuji/healpix/sv1/bright/115
fuji/healpix/sv1/bright/101

All three of these have job scripts and logs that show they were attempted. Digging into one of them:

fuji/healpix/sv1/bright/101

Looking at job level logs for zpix-sv1-bright-10159-*.log it looks like this failed to generate spectra files on three occasions. The above empty directory was created Feb 22nd, but the last job attempt was Feb 17th, so there are two possibilities I can see:

  1. Jobs repeatedly failed because there was no valid data to put into a spectra file and we therefore didn't intend to have this directory but some other script autogenerated it.
  2. A fix was made, and fuji/healpix/sv1/bright/101 was removed and replaced, but we forgot to rerun the scripts.

The job level logs are very sparse and we don't have the spectra generation logs because the directories don't exist. The next thing to check would be whether any targets actually overlap any of these healpix

@akremin, To be clear, those directories are pixel groups. So there could be multiple healpix in each group. Or was there only one in each group, each one of which failed?

There are multiple pixels in each group, each with scripts that failed. In all cases I looked at it failed in the same way as the example healpix 10159 I mentioned above. It is certainly odd that multiple healpix all failed in each of these three groups, but the above possibilities are still the only things I can imagine.

OK, good to know.

Based on the file modification times I found a useful exchange on slack that should explain 101 and 115. They were a mislabeled tile that when properly labeled moved the data to a different subfolder. But the pixel groups must not have been removed.

This doesn't explain 26, but hopefully it has a similar explanation.

Exchange below between Anand and Stephen during the fuji processing:

Anand Raichoor 2 months ago

if useful, for fuji+guadalupe, I find the following 17 folders with no hpixexp*csv file:
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10146
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10147
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10150
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10151
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10152
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10153
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10154
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10155
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10156
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10157
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10158
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/101/10159
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/115/11520
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/115/11521
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/115/11522
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/115/11523
/global/cfs/cdirs/desi/spectro/redux/fuji/healpix/sv1/bright/115/11524

Stephen Bailey 2 months ago

@ anand thanks. All of those come from tile 80866 unwisebluebright which originally was sv1/bright and then became sv1/other . I'll remove those directories.