Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/glade/scratch #81

Open
dabail10 opened this issue Mar 8, 2024 · 7 comments
Open

/glade/scratch #81

dabail10 opened this issue Mar 8, 2024 · 7 comments

Comments

@dabail10
Copy link
Collaborator

dabail10 commented Mar 8, 2024

Describe the bug
The /glade/scratch partition is not available and at least one of the notebooks points there.

To Reproduce
cupid-run config.yml

Expected behavior
The following message:

PermissionError: [Errno 13] Permission denied: '/glade/scratch'

ploomber.exceptions.TaskBuildError: Error when executing task 'ocean_surface'. Partially executed notebook available at /glade/u/home/dbailey/CUPiD/examples/coupled_model/computed_notebooks/quick-run/ocean_surface.ipynb
ploomber.exceptions.TaskBuildError: Error building task "ocean_surface"
===================================================== Summary (1 task) =====================================================
NotebookRunner: ocean_surface -> File('computed_notebook...cean_surface.ipynb')
===================================================== DAG build failed =====================================================

Additional context
There are a number of paths hard coded to /glade/scratch in mom-tools.

@mnlevy1981
Copy link
Collaborator

I think the issue is that mom6-tools uses ncar-jobqueue, and the default configuration for that package points to /glade/scratch/. Do you have a ~/.config/dask/ncar-jobqueue.yaml file on glade? If so, there's probably a block like

casper-dav:
  pbs:
    #    project: XXXXXXXX
    name: dask-worker-casper-dav
    cores: 1 # Total number of cores per job
    memory: '10GB' # Total amount of memory per job
    processes: 1 # Number of Python processes per job
    interface: ext
    walltime: '01:00:00'
    resource-spec: select=1:ncpus=1:mem=25GB
    queue: casper
    log-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/logs'
    local-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/local-dir'
    job-extra: []
    env-extra: []
    death-timeout: 60

Where I've already updated log-directory and local-directory to use /glade/derecho/scratch but your version may specify /glade/scratch instead. Another place to look is ~/.dask/jobqueue.yaml, where the block is

jobqueue:
  pbs:
    cores: 1
    interface: ext
    job-extra: []
    local-directory: /glade/derecho/scratch/mlevy
    log-directory: /glade/derecho/scratch/mlevy
    memory: 10GiB
    name: dask-worker
    processes: 1
    queue: regular
    resource-spec: select=1:ncpus=1:mem=10GB
    walltime: 01:00:00

and again, I've updated log-directory and local-directory.

@dabail10
Copy link
Collaborator Author

dabail10 commented Mar 8, 2024

Got it. Should I just wipe out that whole directory? When did it get created?

@mnlevy1981
Copy link
Collaborator

I would just modify those two files (or whichever of them exist) to make sure the path is correct

@mnlevy1981
Copy link
Collaborator

(while you're at it, make sure interface is ext instead of ib0)

@dabail10
Copy link
Collaborator Author

dabail10 commented Mar 8, 2024

There is no setting for derecho in these files and there is still a hobart setting. How does it get created? We should wipe this directory out and make sure everyone gets a fresh version.

@mnlevy1981
Copy link
Collaborator

I'm not sure how it gets created, hence my reluctance to remove it :) I noticed the lack of derecho settings, but CUPiD runs fine on derecho so I don't think it's an issue. Instead of outright deleting it, can you rename it and see if it's recreated (or if CUPiD runs without it)?

@dabail10
Copy link
Collaborator Author

dabail10 commented Mar 8, 2024

Interesting. I deleted the ~/.config/dask directory and it got recreated when I reran the cupid-run. Or more accurately, I also wiped out the computed notebooks and then it recreated this. The ncar-jobqueue.yml file is out of date. This must be coming from a CISL file somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants