/glade/scratch

Question

/glade/scratch

Opened this issue 10 months ago · 7 comments

Describe the bug
The /glade/scratch partition is not available and at least one of the notebooks points there.

To Reproduce
cupid-run config.yml

Expected behavior
The following message:

PermissionError: [Errno 13] Permission denied: '/glade/scratch'

ploomber.exceptions.TaskBuildError: Error when executing task 'ocean_surface'. Partially executed notebook available at /glade/u/home/dbailey/CUPiD/examples/coupled_model/computed_notebooks/quick-run/ocean_surface.ipynb
ploomber.exceptions.TaskBuildError: Error building task "ocean_surface"
===================================================== Summary (1 task) =====================================================
NotebookRunner: ocean_surface -> File('computed_notebook...cean_surface.ipynb')
===================================================== DAG build failed =====================================================

Additional context
There are a number of paths hard coded to /glade/scratch in mom-tools.

Answer 1 · 2024-03-08T17:44:14.000Z

I think the issue is that mom6-tools uses ncar-jobqueue, and the default configuration for that package points to /glade/scratch/. Do you have a ~/.config/dask/ncar-jobqueue.yaml file on glade? If so, there's probably a block like

casper-dav:
  pbs:
    #    project: XXXXXXXX
    name: dask-worker-casper-dav
    cores: 1 # Total number of cores per job
    memory: '10GB' # Total amount of memory per job
    processes: 1 # Number of Python processes per job
    interface: ext
    walltime: '01:00:00'
    resource-spec: select=1:ncpus=1:mem=25GB
    queue: casper
    log-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/logs'
    local-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/local-dir'
    job-extra: []
    env-extra: []
    death-timeout: 60

Where I've already updated log-directory and local-directory to use /glade/derecho/scratch but your version may specify /glade/scratch instead. Another place to look is ~/.dask/jobqueue.yaml, where the block is

jobqueue:
  pbs:
    cores: 1
    interface: ext
    job-extra: []
    local-directory: /glade/derecho/scratch/mlevy
    log-directory: /glade/derecho/scratch/mlevy
    memory: 10GiB
    name: dask-worker
    processes: 1
    queue: regular
    resource-spec: select=1:ncpus=1:mem=10GB
    walltime: 01:00:00

and again, I've updated log-directory and local-directory.

Answer 2 · 2024-03-08T17:47:35.000Z

Got it. Should I just wipe out that whole directory? When did it get created?

Answer 3 · 2024-03-08T17:50:54.000Z

I would just modify those two files (or whichever of them exist) to make sure the path is correct

Answer 4 · 2024-03-08T17:52:31.000Z

(while you're at it, make sure interface is ext instead of ib0)

Answer 5 · 2024-03-08T17:58:18.000Z

There is no setting for derecho in these files and there is still a hobart setting. How does it get created? We should wipe this directory out and make sure everyone gets a fresh version.

Answer 6 · 2024-03-08T18:05:26.000Z

I'm not sure how it gets created, hence my reluctance to remove it :) I noticed the lack of derecho settings, but CUPiD runs fine on derecho so I don't think it's an issue. Instead of outright deleting it, can you rename it and see if it's recreated (or if CUPiD runs without it)?

Answer 7 · 2024-03-08T18:42:08.000Z

Interesting. I deleted the ~/.config/dask directory and it got recreated when I reran the cupid-run. Or more accurately, I also wiped out the computed notebooks and then it recreated this. The ncar-jobqueue.yml file is out of date. This must be coming from a CISL file somewhere.