capstone-coal/pycoal

Explore Running Mineral Classification Distributed with AWS

Closed this issue · 3 comments

It seems that the typical use case for AWS Machine Learning is training models rather than our use case, which is splitting up a long for loop. I've identified the main challenges in using AWS for our use case:

1. Splitting up data
Using for example 1 Amazon S3 instance to store an AVIRIS image, we would need to read different portions of that data from multiple Amazon EC2 instances. For example, if we have 2 EC2 instances available, they need to be able to communicate with each other so that one instance reads the first half of the file and the other instance reads the second half of the file. We could synchronize this using a fast, in-memory database like Redis.

2. Synchronizing the result
This part seems more challenging. We need a way to save a classification in multiple parts. From what I can tell, the Spectral library only allows calling "save_classification" via a complete image (SpyFile) or ndarray. The SAM algorithm we currently use to classify the image saves each classified pixel in an MxN array, so to save the classification to a .img file, we would need to combine two (M/2)x(N/2) arrays from 2 separate Amazon EC2 instances. I'm thinking we could accomplish this by again saving the whole array in Redis, but my concern is that we may not have enough memory and that if we do have enough memory, sending each pixel to Redis will be too slow. Also, one of the EC2 instances would need to read the whole ndarray at the end when it's saving the classified .img file.

Maybe AWS already provides some better way for multiple EC2 instances to write to a shared NumPy array? I'll have to look into this more, but so far I haven't found anything outside of utilizing pre-existing libraries like TensorFlow or PyTorch to distribute the training portion (not distributing a for loop, though).

Run Non-Distributed on AWS?
Alternatively, we could simply try using high-memory AWS machines to load the whole image file in memory, which would make classification take significantly shorter (although I'm not sure quantifiably how much shorter). https://aws.amazon.com/ec2/instance-types/ shows a list of options, some of which have more than enough memory (but are likely expensive).

@capstone-coal/19-usc-capstone-team does anyone have suggestions or ideas about how we could overcome these challenges?

After some thought and discussion, it seems that there are 2 options to try:

  1. Wait for @TinyBugBigProblem to finish and push his PyTorch parallelization. I can build off of that to use PyTorch's already-developed way of distributing the algorithm to classify different pixels on different machines. A caveat here is that this might only be supported for distributing the training of a model, rather than the classification of large quantities of pixels. So we won't know if this will work until we try it.

  2. Divide up the pixel ranges and send them to one machine at the end of the task. The one machine should be able to combine all the pixel data into a single ndarray and write it to a file without needing Redis or any other kind of database synchronization.

Hio @Lactem having discussed this on today's call, let's shelve this issue for the time. We will come back to it once the Dask and/or PyTorch tickets are resolved. Thanks for the verbose commentary this is excellent.

Thanks for your work above @Lactem I'm going to close this off for the time being. The idea would be to utilize GPU with Pytorch before anything else.