/glade/scratch
Opened this issue · 7 comments
Describe the bug
The /glade/scratch partition is not available and at least one of the notebooks points there.
To Reproduce
cupid-run config.yml
Expected behavior
The following message:
PermissionError: [Errno 13] Permission denied: '/glade/scratch'
ploomber.exceptions.TaskBuildError: Error when executing task 'ocean_surface'. Partially executed notebook available at /glade/u/home/dbailey/CUPiD/examples/coupled_model/computed_notebooks/quick-run/ocean_surface.ipynb
ploomber.exceptions.TaskBuildError: Error building task "ocean_surface"
===================================================== Summary (1 task) =====================================================
NotebookRunner: ocean_surface -> File('computed_notebook...cean_surface.ipynb')
===================================================== DAG build failed =====================================================
Additional context
There are a number of paths hard coded to /glade/scratch in mom-tools.
I think the issue is that mom6-tools
uses ncar-jobqueue
, and the default configuration for that package points to /glade/scratch/
. Do you have a ~/.config/dask/ncar-jobqueue.yaml
file on glade? If so, there's probably a block like
casper-dav:
pbs:
# project: XXXXXXXX
name: dask-worker-casper-dav
cores: 1 # Total number of cores per job
memory: '10GB' # Total amount of memory per job
processes: 1 # Number of Python processes per job
interface: ext
walltime: '01:00:00'
resource-spec: select=1:ncpus=1:mem=25GB
queue: casper
log-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/logs'
local-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/local-dir'
job-extra: []
env-extra: []
death-timeout: 60
Where I've already updated log-directory
and local-directory
to use /glade/derecho/scratch
but your version may specify /glade/scratch
instead. Another place to look is ~/.dask/jobqueue.yaml
, where the block is
jobqueue:
pbs:
cores: 1
interface: ext
job-extra: []
local-directory: /glade/derecho/scratch/mlevy
log-directory: /glade/derecho/scratch/mlevy
memory: 10GiB
name: dask-worker
processes: 1
queue: regular
resource-spec: select=1:ncpus=1:mem=10GB
walltime: 01:00:00
and again, I've updated log-directory
and local-directory
.
Got it. Should I just wipe out that whole directory? When did it get created?
I would just modify those two files (or whichever of them exist) to make sure the path is correct
(while you're at it, make sure interface
is ext
instead of ib0
)
There is no setting for derecho in these files and there is still a hobart setting. How does it get created? We should wipe this directory out and make sure everyone gets a fresh version.
I'm not sure how it gets created, hence my reluctance to remove it :) I noticed the lack of derecho
settings, but CUPiD runs fine on derecho so I don't think it's an issue. Instead of outright deleting it, can you rename it and see if it's recreated (or if CUPiD runs without it)?
Interesting. I deleted the ~/.config/dask directory and it got recreated when I reran the cupid-run. Or more accurately, I also wiped out the computed notebooks and then it recreated this. The ncar-jobqueue.yml file is out of date. This must be coming from a CISL file somewhere.