Memory issues

Question

Memory issues

Closed this issue 5 years ago · 19 comments

@JoshuaHess12 We are running out of memory on large images and I would like to brainstorm what options we would have to reduce the memory requirements. Have you tried to split the code, so we can run each channel separately?
@clarenceyapp How did you parallelize your code in python, that would make sense here?

Answer 1 · 2020-03-20T17:19:00.000Z

@DenisSch my python code isn't parallelized. Artem sent me this, which might help https://docs.python.org/3.4/library/multiprocessing.html?highlight=process

I have not had a chance to take a look yet.

Answer 2 · 2020-03-20T17:28:18.000Z

It should be possible to split the quantification code to run each channel separately. That process is implemented for an ome.tif Ilastik segmentation pipeline here: https://github.com/DenisSch/Triplets-Melanoma.git. The Ilastik preparation module will load channels sequentially in order to bypass memory issues; however, it is not parallelized. Happy to hear suggestions.

…

On Fri, Mar 20, 2020 at 1:12 PM DenisSch ***@***.***> wrote: @JoshuaHess12 <https://github.com/JoshuaHess12> We are running out of memory on large images and I would like to brainstorm what options we would have to reduce the memory requirements. Have you tried to split the code, so we can run each channel separately? @clarenceyapp <https://github.com/clarenceyapp> How did you parallelize your code in python, that would make sense here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALCXKJFWA7ERQXZ4HPSR4STRIOPW5ANCNFSM4LQQTK2Q> .

Answer 3 · 2020-03-20T17:33:51.000Z

@ArtemSokolov can we launch the quantification of each channel as a separate job ? Similar to the way UNet processes each TMA core as a separate job after dearraying?

Answer 4 · 2020-03-20T17:50:25.000Z

Nextflow operates at the level of files. If you want it to process each channel in parallel, then the previous step (segmenter) needs to generate a separate file for each channel.

Answer 5 · 2020-03-20T17:52:18.000Z

...and you would need the quantifier to write a separate output file for each channel.
...and you would also need a separate tool that then combines all those files into a single one.

So, it's probably too messy to do it at the pipeline level.

Answer 6 · 2020-03-20T17:53:02.000Z

I will play around with the code to make it more efficient! I will keep you all updated

Answer 7 · 2020-03-20T17:53:22.000Z

@ArtemSokolov if we did that, wouldn't saving each channel be better done by Ashlar when each channel is being stitched and registered? Segmenter just generates masks based on a handful of channels.

Answer 8 · 2020-03-20T17:54:27.000Z

@clarenceyapp But Ilastik needs all channels. Let's keep it for now as it is. Easier to adapt the code.

Answer 9 · 2020-03-20T17:56:21.000Z

@DenisSch can't ilastik load all channels from separate files?

Maybe you can try clearing variables in Python, like in Matlab. Is it already running out of memory on the 1st channel, or somewhere halfway?

Answer 10 · 2020-03-20T17:58:32.000Z

@clarenceyapp See my post on Slack. The issue is in combining the output of multiple quantification processes.

Posting here for completeness:

Basically, Nextflow can spawn a bunch of processes for the same (Image.tif, Mask.tif) tuple:

docker run labsyspharm/quantification python app.py -f Image.tif --mask Mask.tif -m Marker1
docker run labsyspharm/quantification python app.py -f Image.tif --mask Mask.tif -m Marker2
docker run labsyspharm/quantification python app.py -f Image.tif --mask Mask.tif -m Marker3
...

but each process would generate its own output file, so those would need to be combined somehow.

Answer 11 · 2020-03-20T21:12:38.000Z

This commit should improve memory. It peaks at about half the previous memory now with exemplar-001
Master:

Current memory usage is 3.872657MB; Peak was 349.515514MB

New branch:

Current memory usage is 3.79084MB; Peak was 160.486444MB

Answer 12 · 2020-03-21T18:47:47.000Z

Even with the update I can't push the memory below 250GB. @JoshuaHess12 Any ideas how to improve this? Split image into pieces artificially?

Answer 13 · 2020-03-21T18:51:51.000Z

Image I am working with is about 66GB size

Answer 14 · 2020-03-21T18:57:02.000Z

We could rework the quantification code to only read and run calculations on a single channel at a time. Currently all channels are stored in memory and each channel is looped through sequentially to extract single cell information. If memory is the issue, then we could sacrifice run time by only reading single planes. Parallelizing this code though would possibly reintroduce the memory issue.

…

On Sat, Mar 21, 2020 at 2:52 PM DenisSch ***@***.***> wrote: Image I am working with is about 66GB size — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALCXKJD726FBSUIJKTODZZTRIUEFHANCNFSM4LQQTK2Q> .

Answer 15 · 2020-03-21T19:03:50.000Z

This is what I did in the memory branch. Can you take a look?

…

On Sat, Mar 21, 2020 at 14:57 JoshuaHess12 ***@***.***> wrote: We could rework the quantification code to only read and run calculations on a single channel at a time. Currently all channels are stored in memory and each channel is looped through sequentially to extract single cell information. If memory is the issue, then we could sacrifice run time by only reading single planes. Parallelizing this code though would possibly reintroduce the memory issue. On Sat, Mar 21, 2020 at 2:52 PM DenisSch ***@***.***> wrote: > Image I am working with is about 66GB size > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #7 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ALCXKJD726FBSUIJKTODZZTRIUEFHANCNFSM4LQQTK2Q > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVPRBM3CBCDD5S4SG76IZ3RIUEY3ANCNFSM4LQQTK2Q> .

Answer 16 · 2020-03-21T19:48:56.000Z

I see now. If we could read tiles in x-y dimensions on the full image (I think OpenSlide or bioformats may be able to do this but not scikit image), we could create a nifty workaround to keep track of single-cell contours from the region props function to extract mean pixel intensities. I am not sure how it would look yet and it may take a decent amount of effort, but it should be able to be done.

Answer 17 · 2020-03-21T21:54:15.000Z

@DenisSch @JoshuaHess12 can you confirm whether the line skimage.io.imread(image,img_num=z,plugin='tifffile') actually allows you to take an image plane as the 2nd argument? The documentation I'm looking at for this library doesn't have that. https://scikit-image.org/docs/dev/api/skimage.io.html
Can you check the memory allocation immediately after reading an image to confirm if it's in fact the size of one channel? If the memory goes up by 66GB, then we know it's loading the whole dataset.

I'm using tifffile (https://pypi.org/project/tifffile/) which has explicit documentation on loading a single channel like this: tifffile.imread(path/to/image, key=index) where index is the channel/Z plane you want.

Answer 18 · 2020-03-23T15:43:55.000Z

@clarenceyapp @DenisSch I can confirm the line skimage.io.imread(image,img_num=z,plugin='tifffile') reads a single plane from ome.tif images. I traced the memory allocation and array size after reading in an image and it seems to be doing what it is supposed to.

Answer 19 · 2020-03-24T03:51:44.000Z

@JoshuaHess12 I tried multiple strategies and I believe that we (may) have to sacrifice speed to reduce memory load by writing out the results for each channel independently as I have done in histoCAT headless (https://github.com/DenisSch/histoCAT/blob/master/Headless_histoCAT_loading.m).

Do you have the capacity to implement it? This way we would run all channels in parallel and spatial extraction (x,y etc.). Save all the results in independent files (h5(?)) and combine them in the next step?