awslabs/multi-model-server

For multithreaded inferencing on GPU machine, with preload_model=True and default_workers_per_model=2 getting the following error

msameedkhan opened this issue · 1 comments

**At the very start I was getting this error** 

Traceback (most recent call last):
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/mms/service.py", line 108, in predict
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/service.py", line 79, in handle
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     result = _service.inference(image)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/service.py", line 41, in inference
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     batch_size=50)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/easyocr.py", line 382, in readtext
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     add_margin, add_free_list_margin, False)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/easyocr.py", line 305, in detect
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     False, self.device)
2020-11-10 15:14:25,940 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/detection.py", line 111, in get_textbox
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     bboxes, polys = test_net(canvas_size, mag_ratio, detector, image, text_threshold, link_threshold, low_text, poly, device)
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/home/model-server/tmp/models/4a3eed4c207edc8eb1d4e78a953f1151424ac604/easyocr/detection.py", line 37, in test_net
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     x = x.to(device)
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 164, in _lazy_init
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     "Cannot re-initialize CUDA in forked subprocess. " + msg)
2020-11-10 15:14:25,941 [INFO ] W-get-text-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, yo

**After adding this line**
torch.multiprocessing.set_start_method('spawn', True)
**I'm now getting the following error**

Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Backend worker process died
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/mms/model_service_worker.py", line 241, in <module>
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     worker.run_server()
2020-11-11 05:55:46,338 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/mms/model_service_worker.py", line 213, in run_server
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     p.start()
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._popen = self._Popen(self)
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _default_context.get_context().Process._Popen(process_obj)
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return Popen(process_obj)
2020-11-11 05:55:46,339 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
2020-11-11 05:55:46,340 [ERROR] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - Unknown exception
io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer
	at io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source)
2020-11-11 05:55:46,340 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     super().__init__(process_obj)
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._launch(process_obj)
2020-11-11 05:55:46,341 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-f6cb4ddf Worker disconnected. WORKER_STARTED
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     reduction.dump(process_obj, fp)
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
2020-11-11 05:55:46,341 [DEBUG] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
	at com.amazonaws.ml.mms.wlm.WorkerThread.runWorker(WorkerThread.java:145)
	at com.amazonaws.ml.mms.wlm.WorkerThread.run(WorkerThread.java:208)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2020-11-11 05:55:46,341 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ForkingPickler(file, protocol).dump(obj)
2020-11-11 05:55:46,342 [INFO ] W-9000-get-text-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - TypeError: can't pickle module objects
2020-11-11 05:55:46,343 [WARN ] W-9000-get-text com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: get-text, error: Worker died.
2020-11-11 05:55:46,343 [DEBUG] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-get-text State change WORKER_STARTED -> WORKER_STOPPED
2020-11-11 05:55:46,344 [INFO ] W-9000-get-text com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-f6cb4ddf in 1 seconds.

any help would be highly appreciated. Thanks

preload_model=True是会在model_server_worker.py的服务进程里已经加装模型到gpu上,但是在multiprocessing.Process(target=self.start_worker, args=(cl_socket,))是在子进程里共享服务主进程上的gpu数据,这就造成不同进程不同gpu之间共享了数据,因此会报错