awslabs/multi-model-server

`com.amazonaws.ml.mms.metrics.MetricCollector - java.io.IOException: Broken pipe` and `error while loading shared libraries: libpython3.7m.so.1.0`

llorenzo-matterport opened this issue · 0 comments

Hi there!

We're encountering an issue with MMS and deployment of MXNET models. We thought it was related to the way we're packing the model, but after some digging, it seems that it's related to MMS with MXNET in CPU mode.

The errors we're seeing are related to metrics throwing exceptions, from both a host with and without GPU devices, steps to reproduce:
1 docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.8.0-cpu-py37-ubuntu16.04
2 docker run -ti --entrypoint="/bin/bash" -p 60000:8080 -p 60001:8081 8828975689bb (change your image)
3 multi-model-server --start --models squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar 
And:

% docker run -ti --entrypoint="/bin/bash" -p 60000:8080 -p 60001:8081 8828975689bb
root@eb4f03280c9c:/# multi-model-server --start --models squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
root@eb4f03280c9c:/# 2022-02-04T22:35:40,112 [INFO ] main com.amazonaws.ml.mms.ModelServer -
MMS Home: /usr/local/lib/python3.7/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 2
Max heap size: 1547 M
Python executable: /usr/local/bin/python3.7
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Model Store: N/A
Initial Models: squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
Log dir: null
Metrics dir: null
Netty threads: 0
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2022-02-04T22:35:40,125 [INFO ] main com.amazonaws.ml.mms.ModelServer - Loading initial models: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar  preload_model: false
2022-02-04T22:35:41,145 [WARN ] main com.amazonaws.ml.mms.ModelServer - Failed to load model: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
com.amazonaws.ml.mms.archive.DownloadModelException: Failed to download model from: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar , code: 403
	at com.amazonaws.ml.mms.archive.ModelArchive.download(ModelArchive.java:156) ~[model-server.jar:?]
	at com.amazonaws.ml.mms.archive.ModelArchive.downloadModel(ModelArchive.java:72) ~[model-server.jar:?]
	at com.amazonaws.ml.mms.wlm.ModelManager.registerModel(ModelManager.java:99) ~[model-server.jar:?]
	at com.amazonaws.ml.mms.ModelServer.initModelStore(ModelServer.java:212) [model-server.jar:?]
	at com.amazonaws.ml.mms.ModelServer.start(ModelServer.java:315) [model-server.jar:?]
	at com.amazonaws.ml.mms.ModelServer.startAndWait(ModelServer.java:103) [model-server.jar:?]
	at com.amazonaws.ml.mms.ModelServer.main(ModelServer.java:86) [model-server.jar:?]
2022-02-04T22:35:41,160 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-02-04T22:35:41,449 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://127.0.0.1:8080
2022-02-04T22:35:41,451 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2022-02-04T22:35:41,459 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
Model server started.
2022-02-04T22:35:41,477 [ERROR] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector -
java.io.IOException: Broken pipe
	at java.io.FileOutputStream.writeBytes(Native Method) ~[?:1.8.0_292]
	at java.io.FileOutputStream.write(FileOutputStream.java:326) ~[?:1.8.0_292]
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_292]
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_292]
	at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_292]
	at com.amazonaws.ml.mms.metrics.MetricCollector.run(MetricCollector.java:76) [model-server.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_292]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_292]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_292]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

root@eb4f03280c9c:/#

#### After a good while 1-2mins:
root@eb4f03280c9c:/# 2022-02-04T22:36:41,413 [ERROR] Thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - /usr/local/bin/python3.7: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

Extra info:

  • I've tried with versions of MMS 1.1.7 and 1.1.8, same effect.
  • This happens only with CPU docker containers. I cannot reproduce this error unless I forget to pass the the --gpus all flag to a GPU-enabled docker from a GPU-host, in which case we get similar Java exceptions, but related to cuda, which makes sense.
  • The error with libpython3.7m.so.1.0 hints to me that the issue might be that in the MMS-worker's when it sets it's Python execution environment the LD_LIBRARY_PATH is missing or wrongly set. Furthermore, you can reproduce this specific .so error while installing Python, and not setting that LD_LIBRARY_PATH environment, for example, if you follow the steps in https://github.com/aws/deep-learning-containers/blob/master/mxnet/inference/docker/1.8/py3/Dockerfile.cpu until line 91 and try to run pip for example. At this point setting the LD_LIBRARY_PATH solves the issue. So, I've tried to manually set LD_LIBRARY_PATH prior to executing multi-model-server --start (...) but no luck.
  • I've managed to load our model and serve it successfully, even with the errors reported above happening and registered in the logs/stdin.

Any help to understand this would be appreciated, thanks!