usnistgov/MIST

Live Stitching

Opened this issue · 15 comments

Hello,

Thank you for this nice Stitching project

Is it possible to stitch when scanning the tissue ?

FFTW is fast but if I take much tiles for example 1000 or above, it takes to much time to stitch them. Final image is bigger than 50.000 x 50.000 pixels

My tiles are bigger than 2048 x 1536 pixels. Sometimes I am also activate the linear blending method, and result is fine but blending also takes much time. If i deactivate blending and only use assembling(with python), stitching result is not good, there are some artifacts on overlap areas(I can also open a new isue for this one)

Actually I tried to stitch row by row with stitched rows (when scanning) but this time, encoding/decoding to big images takes much time

Does MIST has this live stitch fuctionality ? or Is it possible to add this method ? I am not good at Java but maybe I can help on Python branch

What hardware are you using (disk, CPU, RAM)? The algorithm will scale to the number of cores assuming you have sufficient memory. Also double check that the software is not falling back to sequential execution. In our experience we have stitched very large grids and on our hardware (32-core, SSD, 256GB RAM machines) we often see stitching times in a couple minutes, and the majority of the time being spent on assembly.

We had thought awhile back about live stitching, but the algorithm requires three passes through the images, from stitching to assembly. The first pass is used to calculate a search space for the global optimization, then the second pass identifies the optimal translation. The final pass is a traversal through the grid to assemble based on the highest quality translations. So a streaming algorithm would require code that attempts to handle the global optimization and assembly on-the-fly, which we believe will greatly impact accuracy.

There are two ideas to consider:
(1) Improve your hardware, such as acquiring PCIe gen5 NVMe SSDs, increasing cores, and RAM
(2) Can assist with development of updating the CUDA version. There were a couple functions that need to be reimplemented, outlined in this issue: #25

As for the python version, I'm not too sure how much better a Python version will perform compared to the Java version. We utilize native bindings to C-level calls in the Java version, so majority of the computation is done close to the metal rather than in the virtual machine. Disk I/O could have a massive detrimental effect though. However, I think it would be fantastic to get support with developing the python version, and @mmajurski could provide additional details about the status of that version.

We do have a legacy C++ version of the MIST algorithm, but since its inception, I'm not 100% sure if all of the bells and whistles have been implemented. So there may be some discrepancies in performance between the Java version and C++ versions.

The python version is a work in progress. I have vague aspirations to use a framework like pytorch to build the stitching compute graph and then let its dataflow engine orchestrate, but I lack the cycles to do it.

I need to get a bare bones (and likely slower) numpy version of the stitching working and validated. If I recall, I am about 70% done with the numpy version. If you are inclined to python and have the spare time, feel free to make pull requests against the python branch to move it towards completion.

I do know the C++ version is missing much of the improvements we have made since starting the Java Fiji plugin version. So I would not use that, as the result will be sub-par.

Turns out the python version basics is complete there. Good job past me...

It lacks some of the bells and whistles of the Fiji plugin, but should cover the basics.

Hello,

Thanks for the information

My hardware:

CPU: i7  4.90 GHz(Turbo) 16 threads with 8 physical core
RAM: 40 GB total, 25-30GB available RAM
SSD: Samsung Evo 970 Plus

Stitching statistics: (When tiles in HDD)

Stitching took 166 seconds for 1520 x 2688 pixel about 1000 tiles
Blending Profile: Init Time: 92 seconds
Blend Time: 49 seconds
Blend Call Time: 48 seconds
Post Proccess Time: 92 seconds

Total time with blending : 432 seconds for 1520 x 2688 pixel about 1000 tiles

Stitching statistics: (When tiles in SSD)

Stitching took 159 seconds for 1520 x 2688 pixel about 1000 tiles
Blending Profile: Init Time: 96 seconds
Blend Time: 55 seconds
Blend Call Time: 55 seconds
Post Proccess Time: 50 seconds

Total time with blending : 373 seconds for 1520 x 2688 pixel about 1000 tiles

In normal case when scanning 40x, 2-3MP resolution for tiles is enough, but sometimes I have to downsample tiles to 5-6 MP resolution. (Camera has 12MP resolution)

I know this affects the speed a lot, but when it comes to digital zoom sometimes 2-3 Mp not enough in some situation.

Please look at this, I shared tiles, img-statictics and final stitched image, you can see them. In my side tiles are saved as .bmp, but i converted them to .jpeg (2GB upload limit)

If you also want to stitch them, jpeg decoding takes time, you can convert them to .bmp (to get real stitch time)

CUDA support can handle this issue, but sorry I have no JCUDA knowledge

In Python side, I would love to contribute to MIST. First I need to play with python version and understand the how internal algorithm works

In later time I can also play with CuPy (Python GPU Lib.) in python version, it is compatible with NumPy, but If i have enough time, i will explore how can i convert it to live stitch

Thanks again for this nice Stitching tool
I hope i will contribute to MIST

Sounds great! Always welcome contributions!

In the meantime, could you also post your log file from your execution. I'd like to take a look to make sure its not actually running in sequential mode. I'd check to make sure the JVM memory is increased from the default to use more of your 40 GB. We automatically will select sequential if we detect there is insufficient memory. So want to rule that scenario out.

Hello,

This is my cmd:

java.exe -Xmx40G -jar MIST_-2.1-jar-with-dependencies.jar stitch_args..

After I run this command, there is no gui and log file contains :

IJ Log string was null. 
This is likely due to the stitching being run in Headless mode.

I also set headless=False, but log file content is same. So to solve this issue, I changed the stdout of java process:

proc = subprocess.Popen(shlex.split(cmd), stderr=stderr_log, stdout=stdout_log) 
proc.wait()

Here is the stdout_log.txt (loglevel = verbose) and statistics.txt

statistics.txt
stdout_log.txt

Overall looks good! I don't see any issues related to performance for your execution of the Java version. If you have access to a higher-end machine, it would be interesting to see how things scale. The current execution time is certainly not ideal. But we do utilize every core on the machine and cache data where we can to avoid hitting the disk.

I could try to download the tiles, but I'm a little weary of using the fastupload site. Do you happen to have access to a google drive account where you could upload the uncompressed tiles there and share? I could then try out the stitching on a variety of machines to see how things scale.

@tblattner I am interested in this portion: "Blending Profile: Init Time: 96 seconds"
How is the blend taking 96 seconds to init?

Stitching took 159 seconds for 1520 x 2688 pixel about 1000 tiles
Blending Profile: Init Time: 96 seconds
Blend Time: 55 seconds
Blend Call Time: 55 seconds
Post Proccess Time: 50 seconds

Total time with blending : 373 seconds for 1520 x 2688 pixel about 1000 tiles

@alexlsa is the statistics.txt file you shared from the run with -Xmx40G? as the file shows

Free memory available to JVM (GB): 1
Total memory available to JVM (GB): 1
Max memory for JVM (GB): 40

With only 1 GB available free to the JVM, despite 40 being the max.

Also @tblattner we need to add timestamps to the MIST logging. Reading through the out file tells me nothing about when things happened. I have been spoiled by the python logging utility.

According to the statistics.txt it looks like it did not use sequential stitching. Yea the blending is painfully slow, mainly because everything in Java is a pain when you try to do it in a scalable way haha. Aside from making a new tiled tiff writer, there is not much more we can do to optimize that section of code sadly.

But yea I was also confused by our output of total memory available. It might be a typo or misunderstanding of how we interpret those values also as the code deemed sequential was not needed and that we could save all tiles in memory.

@mmajurski yes I also ran them with -Xmx40G

I also tried without -Xmx40G, this time it takes much time and it say ;

Free memory available to JVM (GB): 0
Total memory available to JVM (GB): 0
Max memory for JVM (GB): 9
Running sequential version (LOW MEMORY): false
Keep all pixel data in memory: false

I also run MIST via Fiji, and set to Maximum Memory to 40525MB and Parallel Threads to 16. Here is log and statistics ;
img-log.txt
img-statistics.txt

@tblattner here is the drive link, I can also add different tissue tiles.

Thank you

hello @tblattner,

Did you have a time to stitch tiles ?

hello @tblattner,

Did you have a time to stitch tiles ?

Will try to find some cycles either tomorrow or sometime next week. Been a busy month!

Yes I understand you, thank you for your interest and time

Had a chance to just run it on my local desktop, which is not extremely high-end but should be representative. One thing that is clear, the linear blend is very slow. So opportunities to accelerate that could help a lot with assembly. Using an overlay blending mode instead resulted in the full image time to be around 34 seconds versus 262 seconds for linear blend.

Here are all of the stats on this machine:
Total time for experiment (ms): 201171
Total Stitching Time (ms): 165469
Relative Displacement Time (ms): 119202
Global Optimization Time (ms): 46258
Stage Model Build Time (ms): 74
Global Position Time (ms): 8
Output Full Image Time (ms): 33999

Running on a 20-core i9-10900 CPU, NVMe gen4 SSD, and 64 GB of RAM.

The faster SSD and more cores can help with the relative displacement time. Having enough RAM to hold tiles (avoiding re-read) will help across the board. And using the overlay instead of linear blend will help with full image time.

Will try to run the experiment on some higher end hardware another time, but this is a good example dataset to play with scaling. We have some servers with 192 cores and faster SSDs, so I'm curious how things will scale. Thanks for sharing it!

Hello @tblattner,
I'm sorry for late reply

Thank you for your time, overlay blending is good instead assembling, but linear blending gives perfect stitching result but you know, it is slow.

If it is possible, can you please share your assembled sititched final image (and also stitching parameters) that I can compare with my result ?

Also I can scan the tissue with varying resolution from 1MP to 12MP, I can share them with you via google drive whenever you want

Thanks again for your help