pygridtools/gridmap

conda env vars seemly not copied causing "Could not import drmaa" error

nick-youngblut opened this issue ยท 15 comments

I've installed gridmap 0.14.0 via conda, and the map-reduce example works if jobs are run locally. However, if jobs are run on the SGE cluster, then gridmap never detects job completion/failure. I just get an endless stream of:

2020-04-15 14:03:09,524 - Running grid_map
2020-04-15 14:03:09,546 - Setting up JobMonitor on tcp://172.18.3.170:33741
Your job gridmap_job0 has been submitted with id 9400605
Your job gridmap_job1 has been submitted with id 9400606
Your job gridmap_job2 has been submitted with id 9400607
Your job gridmap_job3 has been submitted with id 9400608
2020-04-15 14:03:09,565 - Starting local hearbeat
2020-04-15 14:03:09,569 - Starting ZMQ event loop
2020-04-15 14:03:09,569 - 0 out of 4 jobs completed
2020-04-15 14:03:09,569 - Waiting for message
2020-04-15 14:03:09,571 - Connecting to JobMonitor (tcp://172.18.3.170:33741)
2020-04-15 14:03:09,572 - Sending message: {'job_id': -1, 'host_name': 'rick', 'ip_address': '172.18.3.170', 'command': 'heart_beat', 'data': {}}
2020-04-15 14:03:09,573 - Received message: {'job_id': -1, 'host_name': 'rick', 'ip_address': '172.18.3.170', 'command': 'heart_beat', 'data': {}}
2020-04-15 14:03:09,573 - Checking if jobs are alive
2020-04-15 14:03:09,573 - Sending reply:
2020-04-15 14:03:09,574 - 0 out of 4 jobs completed
2020-04-15 14:03:09,574 - Waiting for message
2020-04-15 14:03:24,586 - Connecting to JobMonitor (tcp://172.18.3.170:33741)
2020-04-15 14:03:24,587 - Sending message: {'job_id': -1, 'host_name': 'rick', 'ip_address': '172.18.3.170', 'command': 'heart_beat', 'data': {}}
2020-04-15 14:03:24,587 - Received message: {'job_id': -1, 'host_name': 'rick', 'ip_address': '172.18.3.170', 'command': 'heart_beat', 'data': {}}
2020-04-15 14:03:24,587 - Checking if jobs are alive
2020-04-15 14:03:24,587 - Sending reply:
2020-04-15 14:03:24,588 - 0 out of 4 jobs completed
2020-04-15 14:03:24,588 - Waiting for message
2020-04-15 14:03:39,602 - Connecting to JobMonitor (tcp://172.18.3.170:33741)
2020-04-15 14:03:39,603 - Sending message: {'job_id': -1, 'host_name': 'rick', 'ip_address': '172.18.3.170', 'command': 'heart_beat', 'data': {}}
2020-04-15 14:03:39,604 - Received message: {'job_id': -1, 'host_name': 'rick', 'ip_address': '172.18.3.170', 'command': 'heart_beat', 'data': {}}
[...]

All gridmap_job1.e* files contain a single line stating: Could not import drmaa. Only local multiprocessing supported.

Shouldn't my conda env be copied to the SGE jobs, since I'm using copy_env=True? I don't see why the SGE jobs can't import drmaa

The copy_env argument only applies to environment variables on the master side, not entire the Python virtual environment (see here). Does the conda environment already exist on the non-master nodes?

Thanks for the quick response! The conda env is accessible from all nodes via our NFS. What do others do for using gridmap with virtual envs? I don't see anything in the docs about this.

Just a shot in the dark but you may need to have DRMAA_LIBRARY_PATH set if libdrmaa.so is installed in a non-standard location? Does import gridmap work in your conda environment locally?

There were different versions of libdrmaa.so on the cluster nodes versus our submit host. That has been changed so that all nodes and the submit host have it installed at /usr/lib/gridengine-drmaa/lib/libdrmaa.so.1.0. This got rid of the Could not import drmaa. Only local multiprocessing supported. error messages for each cluster job. However, when I login to a node, activate my conda env and import gridmap, I still get a segmentation fault. Also, the qsub jobs submitted by gridmap seem to be dying immediately. The OS slightly differs between the nodes and submit host (Ubuntu 18.04.3 vs 18.04.2). Is that potentially the reason for my seg fault when I try to import gridmap on a cluster node?

Note: importing gridmap via my conda env on the submit host does work without a segmentation fault (unlike trying the same thing on a cluster node).

Our cluster admin updated the nodes to Ubuntu 18.04.3, and I'm still getting the seg fault when trying to import gridmap after activating my gridmap conda env on a cluster node. Any ideas on what else could be different between the cluster nodes and the submit host that could be causing the seg fault?

So import gridmap works on your submit host but with the exact same version of Ubuntu and libdrmaa.so, the same command doesn't work on your cluster nodes? That's pretty strange. Does import drmaa also segfault?

I can import drmaa on the submit host and cluster nodes without a segfault. I just get a segfault when importing gridmap on a cluster node.

Ah, gridmap also tries to import matplotlib since it can send plots in job notification emails, if configured. Does import matplotlib also segfault on the cluster nodes?

Yeah, it's matplotlib that's causing the segfault. Why does gridmap need matplotlib?

As I said in my previous message, it's because it can generate plots of CPU and memory usage and include then in emails which can be quite useful when your jobs die for unknown reasons.

It's most likely a backend issue. You just probably need to send the MPLBACKEND environment variable appropriately or perhaps reinstall QT. Here's an example issue discussing this: matplotlib/matplotlib#9294

Sorry, I missed that. Thanks for the link! I haven't used matplotlib in years, so I have forgotten how much of a pain it can be. It might be worth it to make the matplotlib import only occur if the user wants the plots generated.

I think it's pretty rare for someone to want to use gridmap but not have a working installation of matplotlib in general. Contributions are welcome though!

After looking at the code, isn't the import already conditional:

CREATE_PLOTS = 'TRUE' == os.getenv('CREATE_PLOTS', 'True').upper()
if CREATE_PLOTS:
    try:
        import matplotlib
    except ImportError:
        logger = logging.getLogger(__name__)
        logger.warning('Could not import matplotlib. No plots will be created' +
                       ' in debug emails.')

Update: I just used export CREATE_PLOTS="False" and that allowed me import gridmap on a node

Excellent! I had totally forgotten about that! Glad it works for you now.

Thanks for you help! The map-reduce scripts works just fine now. I appears that gridmap will continue with checking for job completion forever if the cluster jobs segfault right away.