pangeo-data/pangeo

Training ML models in parallel with multiple GPUs

jsadler2 opened this issue · 6 comments

I've been using the Pangeo ML image with one GPU and that has been great. I'm wondering how we could take it one step further and take advantage of multiple GPUs. This would be really helpful since many ML workflows involve training multiple independent models with different configurations (different random seeds, hyper-parameters, etc.). I know this can be done on HTC systems (I have a colleague using Drake to spin of SBatch jobs). How could we do it on the cloud?

Maybe there's already a solution out there. I've done quite a bit of googling, but please point out any solutions that already exist.

@jhamman and I talked a little about this and he said one way to do it (correct me if I'm wrong) would be to configure worker nodes to have GPUs instead of the scheduler node (how it currently is being done in the ML notebooks). Then Dask could be used to orchestrate the farming out the jobs.

configure worker nodes to have GPUs instead of the scheduler node

I don't think the scheduler nodes of GPUs right now. Perhaps just the user notebook nodes?

Will there be a need for worker to worker communication of GPU memory? I suspect since you're training multiple independent models just having a GPU per worker node will be sufficient.

I don't think the scheduler nodes of GPUs right now. Perhaps just the user notebook nodes?

Yes. I get the different nodes mixed up.

Will there be a need for worker to worker communication of GPU memory? I suspect since you're training multiple independent models just having a GPU per worker node will be sufficient.

Yes. I think that just having a GPU per worker node would be sufficient.

So instead of having a GPU on the notebook node, we would have the GPU on the worker nodes.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale commented

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.