NVIDIA/ai-assisted-annotation-client

Slicer default server needs an upgrade to latest version of AIAA/Clara

SachidanandAlle opened this issue · 14 comments

@lassoan creating this issue for tracking clara v3 release and to upgrade the default public server for 3D slicer.

Hi all
The Clara v3 release is here: https://ngc.nvidia.com/containers/nvidia:clara-train-sdk
To upgrade the default server (http://skull.cs.queensu.ca:8123/) to V3.
We need to do the steps below:

  1. As the deep neural network weights can be re-used (those model.trt.pb files)
    We need to first backup the workspace of AIAA by copying the models/ folder to disk to persist.

  2. docker pull the v3 container and have it running following the instructions in documentation, this part is similar to how you start a server before. Note that to run the container let's use -p 8123:80. And don't specify ports when running start_aas.sh

  3. Now we have new server running, let's load all the models back using new configs provided as attached.
    configs.zip

    The command to load a model back would be

#!/bin/bash
MODEL_NAME="segmentation_ct_spleen"
MODEL_PATH=<where you back up the workspace>

curl -X PUT "http://skull.cs.queensu.ca:8123/admin/model/$MODEL_NAME" \
       -F "config=@$MODEL_NAME.json" \
       -F "data=@$MODEL_PATH/$MODEL_NAME.trt.pb"
  1. One new model is added which is the DeepGrow model, it can be downloaded in here:
    https://ngc.nvidia.com/catalog/models/nvidia:clara_train_deepgrow_aiaa_inference_only/
    You can use the following command to load it to your AIAA server.
#!/bin/bash
MODEL_NAME="clara_deepgrow"

curl -X PUT "http://skull.cs.queensu.ca:8123/admin/model/$MODEL_NAME" \
       -F "data=@files.zip"

@SachidanandAlle it took very long to get back to this request, but I'm here now.

I've tried to run Clara 3.0 on the same server by running this command:

docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=30000000 -it --restart unless-stopped -p 8123:80 -v /var/nvidia/aiaa3:/workspace --name=aiaa3 nvcr.io/nvidia/clara-train-sdk:v3.0 start_aas.sh --workspace /workspace --monitor true

I've uploaded models that are listed here: https://ngc.nvidia.com/catalog/collections/nvidia:claratrainframework

For example:

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/med/clara_train_deepgrow_aiaa_inference_only/versions/1/zip -O clara_train_deepgrow_aiaa_inference_only_1.zip
curl -X PUT "http://127.0.0.1:8123/admin/model/clara_deepgrow" -F "data=@clara_train_deepgrow_aiaa_inference_only_1.zip"

Model upload was successful (with a few exceptions), and the models appear correctly in the model selectors in Slicer.

All the 5 annotation models work perfectly:
clara_mri_annotation_brain_tumors_t1ce_tc_amp_1.zip
clara_mri_annotation_brain_tumors_t1ce_tc_no_amp_1.zip
clara_ct_annotation_spleen_amp_1.zip
clara_ct_annotation_spleen_no_amp_1.zip
clara_train_covid19_annotation_ct_lung_1.zip

However, Auto-segmentation models all fail with the same error. For example, clara_train_covid19_ct_lung_seg_1.zip:

Traceback (most recent call last):
  File "/home/perklab/.config/NA-MIC/Extensions-29402/NvidiaAIAssistedAnnotation/lib/Slicer-4.11/qt-scripted-modules/SegmentEditorNvidiaAIAALib/SegmentEditorEffect.py", line 390, in onClickSegmentation
    extreme_points, result_file = self.logic.segmentation(in_file, session_id, model)
  File "/home/perklab/.config/NA-MIC/Extensions-29402/NvidiaAIAssistedAnnotation/lib/Slicer-4.11/qt-scripted-modules/SegmentEditorNvidiaAIAALib/SegmentEditorEffect.py", line 1038, in segmentation
    params = aiaaClient.inference(model, {}, image_in, result_file, session_id=session_id)
  File "/home/perklab/.config/NA-MIC/Extensions-29402/NvidiaAIAssistedAnnotation/lib/Slicer-4.11/qt-scripted-modules/NvidiaAIAAClientAPI/client_api.py", line 378, in inference
    raise AIAAException(AIAAError.SERVER_ERROR, 'Status: {}; Response: {}'.format(status, form))
NvidiaAIAAClientAPI.client_api.AIAAException: (3, 'Status: 500; Response: b\'{"error":{"message":["Unable to get status for \\\'clara_train_covid19_ct_lung_seg\\\'"],"type":"TimeoutError"},"success":false}\\n\'')

I can find these errors in aiaa_apache.log:

[2020-11-09 21:14:58.942032] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.www.api.api_v1:api_v1_inference) - Running Inference for: clara_train_covid19_ct_lung_seg
[2020-11-09 21:15:00.823231] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.www.api.api_v1:run_inference) - Using FileName: /tmp/tmp_s__pmff/tmp4l44zm33.nii.gz
[2020-11-09 21:15:00.823294] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.www.api.api_v1:run_inference) - Using Params: {}
[2020-11-09 21:15:00.823343] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.actions.inference_engine:run) - Load Data from: /tmp/tmp_s__pmff/tmp4l44zm33.nii.gz
[2020-11-09 21:15:00.823380] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.actions.inference_engine:run) - Using Image: /tmp/tmp_s__pmff/tmp4l44zm33.nii.gz
[2020-11-09 21:15:00.823415] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.actions.inference_engine:run) - Using Params: {}
[2020-11-09 21:15:00.823465] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.actions.inference_engine:_pre_processing) - Run Pre Processing
[2020-11-09 21:15:00.823505] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.actions.inference_engine:_pre_processing) - Pre-Processing Input Keys: dict_keys(['image', 'image_path', 'params'])
[2020-11-09 21:15:00.823544] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.inference_utils:run_transforms) - PRE - Run Transforms
[2020-11-09 21:15:00.823591] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.inference_utils:run_transforms) - PRE - Input Keys: dict_keys(['image', 'image_path', 'params'])
[2020-11-09 21:15:02.126409] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.inference_utils:run_transforms) - PRE - Time consumed by Transform (LoadNifti): 1.302718162536621
[2020-11-09 21:15:02.126556] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.inference_utils:run_transforms) - PRE - Time consumed by Transform (ConvertToChannelsFirst): 8.249282836914062e-05
[2020-11-09 21:15:05.889478] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.inference_utils:run_transforms) - PRE - Time consumed by Transform (ScaleByResolution): 3.762814998626709
[2020-11-09 21:15:06.061586] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.inference_utils:run_transforms) - PRE - Time consumed by Transform (ScaleIntensityRange): 0.1719968318939209
[2020-11-09 21:15:06.061645] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.actions.inference_engine:run) - ++ Total Time consumed for pre-processing: 5.2382636070251465
[2020-11-09 21:15:06.061680] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.actions.inference_engine:run) - Pre-Processing Output Keys: dict_keys(['image', 'image_path', 'params', 'image.affine', 'image.original_affine', 'image.file_name', 'image.file_format', 'image.original_shape', 'image.original_shape_format', 'image.spacing', 'image.as_canonical', 'image.shape_format'])
[2020-11-09 21:15:06.061724] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.trtis_inference:inference) - Run TRTIS Inference for: clara_train_covid19_ct_lung_seg
[2020-11-09 21:15:06.061783] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.trtis_inference:_init_context) - Using TRTIS: {"ip": "localhost", "port": 8001, "protocol": "grpc", "verbose": false, "streaming": false, "shmem": "no", "model_timeout": 30, "use_cupy": false}
[2020-11-09 21:15:06.061817] [pid 314:tid 140079523043072] [AIAA_INFO] (nvmidl.apps.aas.inference.trtis_inference:_init_context) - Creating ScanWindowInferer (roi: [224, 224, 32], sw_batch_size: 1)
[2020-11-09 21:15:06.088266] [pid 314:tid 140079523043072] [AIAA_ERROR] (nvmidl.apps.aas.trtis.trtis_utils:fetch_trtis_model_info) - [ 0] GRPC client failed: 14: Connect Failed
[2020-11-09 21:15:06.088276] [pid 314:tid 140079523043072] Traceback (most recent call last):
[2020-11-09 21:15:06.088278] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_utils.py", line 37, in fetch_trtis_model_info
[2020-11-09 21:15:06.088279] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 551, in get_server_status
[2020-11-09 21:15:06.088281] [pid 314:tid 140079523043072]     self._ctx, byref(cstatus), byref(cstatus_len))))
[2020-11-09 21:15:06.088283] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 238, in _raise_if_error
[2020-11-09 21:15:06.088285] [pid 314:tid 140079523043072]     raise ex
[2020-11-09 21:15:06.088288] [pid 314:tid 140079523043072] tensorrtserver.api.InferenceServerException: [ 0] GRPC client failed: 14: Connect Failed
[2020-11-09 21:15:06.088291] [pid 314:tid 140079523043072] 
[2020-11-09 21:15:11.094451] [pid 314:tid 140079523043072] [AIAA_ERROR] (nvmidl.apps.aas.trtis.trtis_utils:fetch_trtis_model_info) - [ 0] GRPC client failed: 14: Connect Failed
[2020-11-09 21:15:11.094493] [pid 314:tid 140079523043072] Traceback (most recent call last):
[2020-11-09 21:15:11.094504] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_utils.py", line 37, in fetch_trtis_model_info
[2020-11-09 21:15:11.094515] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 551, in get_server_status
[2020-11-09 21:15:11.094525] [pid 314:tid 140079523043072]     self._ctx, byref(cstatus), byref(cstatus_len))))
[2020-11-09 21:15:11.094534] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 238, in _raise_if_error
[2020-11-09 21:15:11.094543] [pid 314:tid 140079523043072]     raise ex
[2020-11-09 21:15:11.094556] [pid 314:tid 140079523043072] tensorrtserver.api.InferenceServerException: [ 0] GRPC client failed: 14: Connect Failed
[2020-11-09 21:15:11.094572] [pid 314:tid 140079523043072] 
[2020-11-09 21:15:16.097841] [pid 314:tid 140079523043072] [AIAA_ERROR] (nvmidl.apps.aas.trtis.trtis_utils:fetch_trtis_model_info) - [ 0] GRPC client failed: 14: channel is in state TRANSIENT_FAILURE
[2020-11-09 21:15:16.097883] [pid 314:tid 140079523043072] Traceback (most recent call last):
[2020-11-09 21:15:16.097894] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_utils.py", line 37, in fetch_trtis_model_info
[2020-11-09 21:15:16.097905] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 551, in get_server_status
[2020-11-09 21:15:16.097915] [pid 314:tid 140079523043072]     self._ctx, byref(cstatus), byref(cstatus_len))))
[2020-11-09 21:15:16.097924] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 238, in _raise_if_error
[2020-11-09 21:15:16.097937] [pid 314:tid 140079523043072]     raise ex
[2020-11-09 21:15:16.097953] [pid 314:tid 140079523043072] tensorrtserver.api.InferenceServerException: [ 0] GRPC client failed: 14: channel is in state TRANSIENT_FAILURE
[2020-11-09 21:15:16.097970] [pid 314:tid 140079523043072] 
[2020-11-09 21:15:21.101831] [pid 314:tid 140079523043072] [AIAA_ERROR] (nvmidl.apps.aas.trtis.trtis_utils:fetch_trtis_model_info) - [ 0] GRPC client failed: 14: channel is in state TRANSIENT_FAILURE
[2020-11-09 21:15:21.101875] [pid 314:tid 140079523043072] Traceback (most recent call last):
[2020-11-09 21:15:21.101886] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_utils.py", line 37, in fetch_trtis_model_info
[2020-11-09 21:15:21.101897] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 551, in get_server_status
[2020-11-09 21:15:21.101907] [pid 314:tid 140079523043072]     self._ctx, byref(cstatus), byref(cstatus_len))))
[2020-11-09 21:15:21.101916] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 238, in _raise_if_error
[2020-11-09 21:15:21.101929] [pid 314:tid 140079523043072]     raise ex
[2020-11-09 21:15:21.101942] [pid 314:tid 140079523043072] tensorrtserver.api.InferenceServerException: [ 0] GRPC client failed: 14: channel is in state TRANSIENT_FAILURE
[2020-11-09 21:15:21.101958] [pid 314:tid 140079523043072] 
[2020-11-09 21:15:26.108041] [pid 314:tid 140079523043072] [AIAA_ERROR] (nvmidl.apps.aas.trtis.trtis_utils:fetch_trtis_model_info) - [ 0] GRPC client failed: 14: channel is in state TRANSIENT_FAILURE
[2020-11-09 21:15:26.108082] [pid 314:tid 140079523043072] Traceback (most recent call last):
[2020-11-09 21:15:26.108093] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_utils.py", line 37, in fetch_trtis_model_info
[2020-11-09 21:15:26.108104] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 551, in get_server_status
[2020-11-09 21:15:26.108114] [pid 314:tid 140079523043072]     self._ctx, byref(cstatus), byref(cstatus_len))))
[2020-11-09 21:15:26.108123] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 238, in _raise_if_error
[2020-11-09 21:15:26.108133] [pid 314:tid 140079523043072]     raise ex
[2020-11-09 21:15:26.108145] [pid 314:tid 140079523043072] tensorrtserver.api.InferenceServerException: [ 0] GRPC client failed: 14: channel is in state TRANSIENT_FAILURE
[2020-11-09 21:15:26.108161] [pid 314:tid 140079523043072] 
[2020-11-09 21:15:31.113816] [pid 314:tid 140079523043072] [AIAA_ERROR] (nvmidl.apps.aas.trtis.trtis_utils:fetch_trtis_model_info) - [ 0] GRPC client failed: 14: channel is in state TRANSIENT_FAILURE
[2020-11-09 21:15:31.113856] [pid 314:tid 140079523043072] Traceback (most recent call last):
[2020-11-09 21:15:31.113866] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_utils.py", line 37, in fetch_trtis_model_info
[2020-11-09 21:15:31.113876] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 551, in get_server_status
[2020-11-09 21:15:31.113886] [pid 314:tid 140079523043072]     self._ctx, byref(cstatus), byref(cstatus_len))))
[2020-11-09 21:15:31.113895] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/tensorrtserver/api/__init__.py", line 238, in _raise_if_error
[2020-11-09 21:15:31.113904] [pid 314:tid 140079523043072]     raise ex
[2020-11-09 21:15:31.113917] [pid 314:tid 140079523043072] tensorrtserver.api.InferenceServerException: [ 0] GRPC client failed: 14: channel is in state TRANSIENT_FAILURE
[2020-11-09 21:15:31.113933] [pid 314:tid 140079523043072] 
[2020-11-09 21:15:35.460252] [pid 316:tid 140084425045760] [AIAA_INFO] (schedule:run) - Running job Every 5 minutes do cleanup_sessions({}) (last run: [never], next run: 2020-11-09 21:15:35)
[2020-11-09 21:15:35.467384] [pid 316:tid 140084425045760] [AIAA_INFO] (nvmidl.apps.aas.actions.sessions:remove_expired) - Removing expired; current ts: 1604956535
[2020-11-09 21:15:35.467421] [pid 316:tid 140084425045760] {"name": "b0417136-225c-11eb-b1f2-0242ac110002", "path": "/workspace/sessions/b0417136-225c-11eb-b1f2-0242ac110002", "image": "/workspace/sessions/b0417136-225c-11eb-b1f2-0242ac110002/tmpycwiow7x.nii.gz", "image_original": "/workspace/sessions/b0417136-225c-11eb-b1f2-0242ac110002/tmpycwiow7x.nii.gz", "meta": {}, "create_ts": 1604906705, "last_access_ts": 1604906785, "expiry": 3600}
[2020-11-09 21:15:35.467441] [pid 316:tid 140084425045760] 
[2020-11-09 21:15:35.469803] [pid 316:tid 140084425045760] [AIAA_INFO] (nvmidl.apps.aas.actions.sessions:remove_expired) - Removing expired; current ts: 1604956535
[2020-11-09 21:15:35.469836] [pid 316:tid 140084425045760] {"name": "b16e3626-225b-11eb-b1f2-0242ac110002", "path": "/workspace/sessions/b16e3626-225b-11eb-b1f2-0242ac110002", "image": "/workspace/sessions/b16e3626-225b-11eb-b1f2-0242ac110002/tmpexvynzum.nii.gz", "image_original": "/workspace/sessions/b16e3626-225b-11eb-b1f2-0242ac110002/tmpexvynzum.nii.gz", "meta": {}, "create_ts": 1604906277, "last_access_ts": 1604906461, "expiry": 3600}
[2020-11-09 21:15:35.469856] [pid 316:tid 140084425045760] 
[2020-11-09 21:15:35.471306] [pid 316:tid 140084425045760] [AIAA_INFO] (nvmidl.apps.aas.actions.sessions:remove_expired) - Removing expired; current ts: 1604956535
[2020-11-09 21:15:35.471337] [pid 316:tid 140084425045760] {"name": "405d2554-225c-11eb-879f-0242ac110002", "path": "/workspace/sessions/405d2554-225c-11eb-879f-0242ac110002", "image": "/workspace/sessions/405d2554-225c-11eb-879f-0242ac110002/tmpguyzqf5h.nii.gz", "image_original": "/workspace/sessions/405d2554-225c-11eb-879f-0242ac110002/tmpguyzqf5h.nii.gz", "meta": {}, "create_ts": 1604906517, "last_access_ts": 1604906546, "expiry": 3600}
[2020-11-09 21:15:35.471356] [pid 316:tid 140084425045760] 
[2020-11-09 21:15:36.127156] [pid 314:tid 140079523043072] [AIAA_ERROR] (nvmidl.apps.aas.www.api.api_v1:handle_error) - Unable to get status for 'clara_train_covid19_ct_lung_seg'
[2020-11-09 21:15:36.127196] [pid 314:tid 140079523043072] Traceback (most recent call last):
[2020-11-09 21:15:36.127207] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1950, in full_dispatch_request
[2020-11-09 21:15:36.127233] [pid 314:tid 140079523043072]     rv = self.dispatch_request()
[2020-11-09 21:15:36.127242] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1936, in dispatch_request
[2020-11-09 21:15:36.127252] [pid 314:tid 140079523043072]     return self.view_functions[rule.endpoint](**req.view_args)
[2020-11-09 21:15:36.127262] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/flask_monitoringdashboard/core/measurement.py", line 127, in wrapper
[2020-11-09 21:15:36.127272] [pid 314:tid 140079523043072]     raise raised_exception
[2020-11-09 21:15:36.127281] [pid 314:tid 140079523043072]   File "/usr/local/lib/python3.6/dist-packages/flask_monitoringdashboard/core/measurement.py", line 107, in evaluate
[2020-11-09 21:15:36.127291] [pid 314:tid 140079523043072]     result = route_handler(*args, **kwargs)
[2020-11-09 21:15:36.127300] [pid 314:tid 140079523043072]   File "apps/aas/www/api/api_v1.py", line 357, in api_v1_inference
[2020-11-09 21:15:36.127310] [pid 314:tid 140079523043072]   File "apps/aas/www/api/api_v1.py", line 263, in run_inference
[2020-11-09 21:15:36.127318] [pid 314:tid 140079523043072]   File "apps/aas/www/api/api_v1.py", line 182, in run_infer
[2020-11-09 21:15:36.127327] [pid 314:tid 140079523043072]   File "apps/aas/actions/inference_engine.py", line 59, in run
[2020-11-09 21:15:36.127337] [pid 314:tid 140079523043072]   File "apps/aas/actions/inference_engine.py", line 153, in _run_inference
[2020-11-09 21:15:36.127346] [pid 314:tid 140079523043072]   File "apps/aas/inference/trtis_inference.py", line 82, in inference
[2020-11-09 21:15:36.127355] [pid 314:tid 140079523043072]   File "apps/aas/inference/trtis_inference.py", line 75, in _init_context
[2020-11-09 21:15:36.127364] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_model_loader.py", line 47, in load_trtis_model
[2020-11-09 21:15:36.127373] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_session.py", line 41, in __init__
[2020-11-09 21:15:36.127382] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_session.py", line 50, in fetch_model_info
[2020-11-09 21:15:36.127391] [pid 314:tid 140079523043072]   File "apps/aas/trtis/trtis_utils.py", line 49, in fetch_trtis_model_info
[2020-11-09 21:15:36.127404] [pid 314:tid 140079523043072] TimeoutError: Unable to get status for 'clara_train_covid19_ct_lung_seg'
[2020-11-09 21:15:36.127420] [pid 314:tid 140079523043072] 

Do you have any advice how to resolve this error?

I've reinstalled "Nvidia Docker 2.0", restarted everything and now the automatic model lung segmentation model works. I'm testing the others now.

Nvidia Docker 2.0 ? u mean rolled back to older version?
@YuanTingHsieh can you help? I guess some of the models clara_train_covid19_* might not have aiaa config..

Thanks for your help.

Nvidia Docker 2.0 ? u mean rolled back to older version?

I did not intend to downgrade anything. I installed nvidia-docker2 as instructed here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-ubuntu-and-debian


I've tested the server more and it seems that after starting the container, the first (or first few) segmentation works, but then I get [AIAA_ERROR] (nvmidl.apps.aas.trtis.trtis_utils:fetch_trtis_model_info) - [ 0] GRPC client failed: 14: Connect Failed error.

For example, here are the logs of starting the container running auto-segmentation using clara_train_covid19_ct_lung_seg successfully twice (it nicely segmented the lungs in the CTChest Slicer sample data set), but then failing when running for the third time: https://1drv.ms/t/s!Arm_AFxB9yqHxKYelX9PF_MEVnTkAQ?e=BLS1Xv. I did not change anything anywhere, just clicked "Start" in Auto-segmentation section.

@SachidanandAlle @YuanTingHsieh do you have any suggestion how to keep trtis running in the container? It stops responding after one or few segmentations.

@YuanTingHsieh we should fix this issue in new container. Anytime it dies due to OOM etc, it should recover automatically.

It would be great if this was fixed. Let us know if there is a version that is ready to be tested.

Hi guys, do you have any update on this? Is a new container with the more robust recovery mechanism available now?

@YuanTingHsieh is it possible to dig into the server and see what's happening after running inference couple of times..
If there are no multiple access happening.. for single user, it should not throw OOO in TRITON if the GPU has enough memory run sngle inference for . If required, please ask @lassoan to provide access to the server (in a seperate email thread) to understand/debug the issue. If needed we can ask TRITON folks to provide their support.

We'll update the default Slicer server within a few weeks. Hopefully with the new hardware and latest Clara server everything will work robustly.

Hi @lassoan ,

Thanks for your effort.
In Clara V4.0, we are using PyTorch to train the model, so we will need to download new models from ngc.
The list of available models are here:

Auto Segmentations:

Annotation models (DExtr3D):

DeepGrow:

You can download them and load them into AIAA server or you can load them using their path:
https://docs.nvidia.com/clara/clara-train-sdk/aiaa/loading_models.html#loading-from-ngc

We are separating Triton inference server outside Clara-Train container and we use docker-compose to run them now:
https://docs.nvidia.com/clara/clara-train-sdk/aiaa/quickstart.html#running-aiaa

As you can see we have a flag "restart: unless-stopped" there to restart Triton server.
So if OOM happens it will get restarted.

Let me know if you need any other help.
Thanks!

The default 3D Slicer segmentation server has been upgraded to Clara 4.0 and it supports all segmentation modes (auto-segmentation, segmentation from boundary points, and deepgrow), with many new models. It will be available the day after #88 is merged.

@YuanTingHsieh with this new server and latest software stack everything seems to work robustly. Thanks for all your help with the investigations.