NCAR/CUPiD

/glade/scratch

Opened this issue · 7 comments

Describe the bug
The /glade/scratch partition is not available and at least one of the notebooks points there.

To Reproduce
cupid-run config.yml

Expected behavior
The following message:

PermissionError: [Errno 13] Permission denied: '/glade/scratch'

ploomber.exceptions.TaskBuildError: Error when executing task 'ocean_surface'. Partially executed notebook available at /glade/u/home/dbailey/CUPiD/examples/coupled_model/computed_notebooks/quick-run/ocean_surface.ipynb
ploomber.exceptions.TaskBuildError: Error building task "ocean_surface"
===================================================== Summary (1 task) =====================================================
NotebookRunner: ocean_surface -> File('computed_notebook...cean_surface.ipynb')
===================================================== DAG build failed =====================================================

Additional context
There are a number of paths hard coded to /glade/scratch in mom-tools.

I think the issue is that mom6-tools uses ncar-jobqueue, and the default configuration for that package points to /glade/scratch/. Do you have a ~/.config/dask/ncar-jobqueue.yaml file on glade? If so, there's probably a block like

casper-dav:
  pbs:
    #    project: XXXXXXXX
    name: dask-worker-casper-dav
    cores: 1 # Total number of cores per job
    memory: '10GB' # Total amount of memory per job
    processes: 1 # Number of Python processes per job
    interface: ext
    walltime: '01:00:00'
    resource-spec: select=1:ncpus=1:mem=25GB
    queue: casper
    log-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/logs'
    local-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/local-dir'
    job-extra: []
    env-extra: []
    death-timeout: 60

Where I've already updated log-directory and local-directory to use /glade/derecho/scratch but your version may specify /glade/scratch instead. Another place to look is ~/.dask/jobqueue.yaml, where the block is

jobqueue:
  pbs:
    cores: 1
    interface: ext
    job-extra: []
    local-directory: /glade/derecho/scratch/mlevy
    log-directory: /glade/derecho/scratch/mlevy
    memory: 10GiB
    name: dask-worker
    processes: 1
    queue: regular
    resource-spec: select=1:ncpus=1:mem=10GB
    walltime: 01:00:00

and again, I've updated log-directory and local-directory.

Got it. Should I just wipe out that whole directory? When did it get created?

I would just modify those two files (or whichever of them exist) to make sure the path is correct

(while you're at it, make sure interface is ext instead of ib0)

There is no setting for derecho in these files and there is still a hobart setting. How does it get created? We should wipe this directory out and make sure everyone gets a fresh version.

I'm not sure how it gets created, hence my reluctance to remove it :) I noticed the lack of derecho settings, but CUPiD runs fine on derecho so I don't think it's an issue. Instead of outright deleting it, can you rename it and see if it's recreated (or if CUPiD runs without it)?

Interesting. I deleted the ~/.config/dask directory and it got recreated when I reran the cupid-run. Or more accurately, I also wiped out the computed notebooks and then it recreated this. The ncar-jobqueue.yml file is out of date. This must be coming from a CISL file somewhere.