scverse/rapids_singlecell

"IOStream.flush timed out" is printed when running rsc.tl.harmony_integrate

Closed this issue · 13 comments

At present, I have completed PCA, but when using rsc.tl.harmony_integrate, I encountered the error of "IOStream.flush timed out", which will keep printing, and then keep my jupyter in a connecting state. I have tried to reopen the jupyter kernel, but this error still occurs.

rsc.tl.harmony_integrate(adata,'sample')

2023-08-09 10:51:35,319 - harmonypy_gpu - INFO - Iteration 1 of 10
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
2023-08-09 13:59:08,718 - harmonypy_gpu - INFO - Iteration 2 of 10
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out

which version of rapids-singlecell are you running?

0.7.2

can you update to 0.7.5 and see if the error is still there?

Ok, I'll try after updating to 0.7.5.
Everything works fine so far when running on a dataset with a small number of cells, but when I use a dataset with a large number of cells I get this error.

It might be related to #35 which was fixed with version 0.7.5

@YH-Zheng if it works with 0.7.5 let me know so I can close the issue

@Intron7 Of course, I have now updated to 0.7.5 and started running harmony_gpu, I will let you know the result as soon as possible.

@Intron7 After 11 hours of calculation, the first iteration is still not completed. It takes about 3 hours to complete an iteration when using sc, and the jupyter kernel shows that it is connecting, but it can still be observed that the GPU is performing calculation tasks.
截屏2023-08-10 09 20 18

Now the GPU occupation has stopped, and can see the occupation of 1 CPU core, but jupyter kernel still cannot connect. When I checked the jupyter log file, I still found the "IOStream.flush timed out" error.

Traceback (most recent call last):
File "/home/zhengyuhui/anaconda3/envs/scanpy/lib/python3.8/site-packages/tornado/websocket.py", line 1089, in wrapper
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
Task exception was never retrieved
future: <Task finished name='Task-114505' coro=<WebSocketProtocol13.write_message..wrapper() done, defined at /home/zhengyuhui/anaconda3/envs/scanpy/lib/python3.8/site-packages/tornado/websocket.py:1085> exception=WebSocketClosedError()>
Traceback (most recent call last):
File "/home/zhengyuhui/anaconda3/envs/scanpy/lib/python3.8/site-packages/tornado/websocket.py", line 1087, in wrapper
await fut
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/zhengyuhui/anaconda3/envs/scanpy/lib/python3.8/site-packages/tornado/websocket.py", line 1089, in wrapper
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out

What kind of GPU are you using and how big is your dataset?

GPU: 2080ti Mem: 768G
Data shape: 6903432, 36326
hvg: 3000

When I tried to check adata, I found that it added harmony information, but I don't know if the result of harmony is credible when there are still errors in the jupyter log? It took about 13 hours, which is close to the time sc ran this step, and no process information is printed.

@YH-Zheng, the root of the problem might indeed be the memory limitation of the GPU being used. The 2080ti, while a powerful card, comes with 11GB of VRAM. This is often sufficient for many tasks, but when dealing with large datasets like the one you've mentioned, it can quickly become a limitation.

Even with the RAPIDS Memory Manager (RMM) enabled, which allows for GPU memory oversubscription, you can usually only go over the native memory by a factor of 2 or so. This effectively gives you an upper bound of around 33GB of usable GPU memory for your computations. Given the size of your dataset and the memory-intensive nature of operations like Harmony, it's likely that this limitation is causing the issues you're observing.

@Intron7, I understand what you mean, but the memory of the 2080ti I use has been increased to 22GB, does this mean that I can use up to 66GB of memory, but because my data set is too large, this may still not be enough, which led to this Mistakes happen.

Thank you for your reply!