For multithreaded inferencing on GPU machine, with preload_model=True and default_workers_per_model=2 getting the following error
msameedkhan opened this issue · 1 comments
msameedkhan commented
**At the very start I was getting this error**
Traceback (most recent call last):
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/dist-packages/mms/service.py", line 108, in predict
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ret = self._entry_point(input_batch, self.context)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/service.py", line 79, in handle
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - result = _service.inference(image)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/service.py", line 41, in inference
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - batch_size=50)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/easyocr.py", line 382, in readtext
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - add_margin, add_free_list_margin, False)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/easyocr.py", line 305, in detect
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - False, self.device)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/detection.py", line 111, in get_textbox
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - bboxes, polys = test_net(canvas_size, mag_ratio, detector, image, text_threshold, link_threshold, low_text, poly, device)
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/detection.py", line 37, in test_net
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - x = x.to(device)
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 164, in _lazy_init
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - "Cannot re-initialize CUDA in forked subprocess. " + msg)
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, yo
**After adding this line**
torch.multiprocessing.set_start_method('spawn', True)
**I'm now getting the following error**
Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Backend worker process died
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/dist-packages/mms/model_service_worker.py", line 241, in <module>
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - worker.run_server()
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/local/lib/python3.6/dist-packages/mms/model_service_worker.py", line 213, in run_server
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - p.start()
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self._popen = self._Popen(self)
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return _default_context.get_context().Process._Popen(process_obj)
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return Popen(process_obj)
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
2020-11-11 05:55:46,340 [ERROR] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - Unknown exception
io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer
at io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source)
2020-11-11 05:55:46,340 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - super().__init__(process_obj)
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self._launch(process_obj)
2020-11-11 05:55:46,341 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-f6cb4ddf Worker disconnected. WORKER_STARTED
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - reduction.dump(process_obj, fp)
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
2020-11-11 05:55:46,341 [DEBUG] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at com.amazonaws.ml.mms.wlm.WorkerThread.runWorker(WorkerThread.java:145)
at com.amazonaws.ml.mms.wlm.WorkerThread.run(WorkerThread.java:208)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ForkingPickler(file, protocol).dump(obj)
2020-11-11 05:55:46,342 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - TypeError: can't pickle module objects
2020-11-11 05:55:46,343 [WARN ] W-9000-get-text com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: get-text, error: Worker died.
2020-11-11 05:55:46,343 [DEBUG] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-get-text State change WORKER_STARTED -> WORKER_STOPPED
2020-11-11 05:55:46,344 [INFO ] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-f6cb4ddf in 1 seconds.
any help would be highly appreciated. Thanks
wdh234 commented
preload_model=True是会在model_server_worker.py的服务进程里已经加装模型到gpu上,但是在multiprocessing.Process(target=self.start_worker, args=(cl_socket,))是在子进程里共享服务主进程上的gpu数据,这就造成不同进程不同gpu之间共享了数据,因此会报错