CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered)
emilwallner opened this issue Ā· 5 comments
š Describe the bug
Hey,
First of all, thanks for creating such a fantastic open-source production server.
I'm reaching out due an unexpected issue I can't solve. I've been running a torch serve server in production for over a year (several million requests per week) and it's been working great, however, a few weeks ago it started crashing every 1-5 days.
I enabled export CUDA_LAUNCH_BLOCKING=1, and it gives me a CUDA error: device-side assert triggered, and CUDA out of memory when I move my data to the GPU. I also log, torch.cuda.max_memory_allocated(), and torch.cuda.memory_allocated().
I thought some unique edge case caused a memory leak, some mismatched shapes or NaN values when I moved to the GPU, or allocating too much memory. However, the models use 6180 MiB / 23028 MiB, and torch.cuda.max_memory_allocated logs around 366 MB.
When I SSH into an instance that has crashed it looks like this:
Screen.Recording.2024-04-25.at.22.35.24.mov
The memory is at 6180 MiB, the GPU utilization flickers between 0-16%, and it gives me the CUDA error: device-side assert triggered, and CUDA out of memory.
Unfortunately, I can't find a way to reproduce the error, it happens at random every 1-5 days, and I have to reset the server and allocate a new instance. I've done everything I can think of to check the data before allocating it to the GPU, and reducing any memory overload or potential memory leak.
Error logs
Installation instructions
torchserve==0.10.0
Docker image: nvcr.io/nvidia/pytorch:22.12-py3
Ubuntu 20.04 including Python 3.8
NVIDIA CUDAĀ® 11.8.0
NVIDIA cuBLAS 11.11.3.6
NVIDIA cuDNN 8.7.0.84
NVIDIA NCCL 2.15.5 (optimized for NVIDIA NVLinkĀ®)
NVIDIA RAPIDSā¢ 22.10.01 (For x86, only these libraries are included: cudf, xgboost, rmm, cuml, and cugraph.)
Apex
rdma-core 36.0
NVIDIA HPC-X 2.13
OpenMPI 4.1.4+
GDRCopy 2.3
TensorBoard 2.9.0
Nsight Compute 2022.3.0.0
Nsight Systems 2022.4.2.1
NVIDIA TensorRTā¢ 8.5.1
Torch-TensorRT 1.1.0a0
NVIDIA DALIĀ® 1.20.0
MAGMA 2.6.2
JupyterLab 2.3.2 including Jupyter-TensorBoard
TransformerEngine 0.3.0
Model Packaing
def create_pil_image(self, image_data):
try:
image = Image.open(io.BytesIO(image_data)).convert("RGB")
return image
except IOError as e:
# If the image data is not valid or not provided, create a blank image.
width, height = 776, 776 # Set desired width and height for the blank image
color = (255, 255, 255) # Set desired color for the blank image (white in this case)
image = Image.new("RGB", (width, height), color)
return image
def preprocess_and_stack_images(self, images):
preprocessed_images = []
for i, img in enumerate(images):
try:
preprocessed_img = self.resize_tensor(img)
if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1:
# Log information about the image that doesn't meet the requirements
logger.info(f"Image {i} does not meet the requirements. Replacing with a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
except Exception as e:
# Log the error message and load a blank image
logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
preprocessed_images.append(preprocessed_img)
images_batch = torch.stack(preprocessed_images, dim=0)
if len(images_batch.shape) == 3:
images_batch = images_batch.unsqueeze(0)
return images_batch
def preprocess(self, data):
images = []
fns = []
texts = []
size = []
merges = []
org_images = []
watermarks = []
white_balance_list = []
auto_color_list = []
temperature_list = []
saturation_list = []
for row in data:
image = row["image"]
fn = self.decode_field(row["fn"])
text = self.decode_field(row["text"])
merged = self.decode_field(row["merged"])
merged = True if merged.lower() == 'true' else False
resolution = self.decode_field(row["resolution"])
white_balance = self.decode_field(row["white_balance"])
auto_color = self.decode_field(row["auto_color"])
temperature = float(self.decode_field(row["temperature"]))
saturation = float(self.decode_field(row["saturation"]))
auto_color = True if auto_color == 'true' else False
white_balance = True if white_balance == 'true' else False
watermark = True if 'watermarked' in resolution else False
if isinstance(image, str):
logger.info(f"Image data should not be a string. Please provide the image data as bytes.")
width, height = 224, 224 # Set desired width and height for the blank image
color = (255, 255, 255) # Set desired color for the blank image (white in this case)
image = Image.new("RGB", (width, height), color)
if isinstance(image, (bytearray, bytes)):
image = self.create_pil_image(image)
image = self.resize_image(image, resolution)
org_images.append(image)
texts.append(text)
images.append(image)
fns.append(fn)
merges.append(merged)
watermarks.append(watermark)
white_balance_list.append(white_balance)
temperature_list.append(temperature)
saturation_list.append(saturation)
auto_color_list.append(auto_color)
texts_raw = self.tokenizer(texts) #type(torch.int32)
texts = self.token_embedding(texts_raw).type(torch.float16)
texts = texts + self.positional_embedding.type(torch.float16)
images_batch = self.preprocess_and_stack_images(images)
The error comes when I move the images_batch to GPU
config.properties
inference_address=http://0.0.0.0:8510
management_address=http://0.0.0.0:8511
metrics_address=https://0.0.0.0:8512
number_of_netty_threads=8
netty_client_threads=8
async_logging=true
enable_metrics_api=false
default_workers_per_model=1
max_request_size=20000000
max_response_size=20000000
job_queue_size=100
model_store=./model_store
load_models=all
models={
"palette_caption": {
"1.0": {
"defaultVersion": true,
"marName": "palette_caption.mar",
"minWorkers": 1,
"maxWorkers": 3,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 180
}
},
"palette_colorizer": {
"1.0": {
"defaultVersion": true,
"marName": "palette_colorizer.mar",
"minWorkers": 2,
"maxWorkers": 4,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 120
}
},
"palette_ref_colorizer": {
"1.0": {
"defaultVersion": true,
"marName": "palette_ref_colorizer.mar",
"minWorkers": 1,
"maxWorkers": 2,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 120
}
}
}
Versions
Pip freeze:
absl-py==1.3.0
aiohttp==3.8.4
aiosignal==1.3.1
aniso8601==9.0.1
annoy==1.17.1
ansi2html==1.9.1
anyio==4.3.0
apex==0.1
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==22.1.0
audioread==3.0.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
blinker==1.7.0
blis==0.7.9
cachetools==5.2.0
catalogue==2.0.8
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
cloudpickle==2.2.0
cmake==3.24.1.1
comm==0.1.2
confection==0.0.3
contourpy==1.0.6
cuda-python @ file:///rapids/cuda_python-11.7.0%2B0.g95a2041.dirty-cp38-cp38-linux_x86_64.whl
cudf @ file:///rapids/cudf-22.10.0a0%2B316.gad1ba132d2.dirty-cp38-cp38-linux_x86_64.whl
cugraph @ file:///rapids/cugraph-22.10.0a0%2B113.g6bbdadf8.dirty-cp38-cp38-linux_x86_64.whl
cuml @ file:///rapids/cuml-22.10.0a0%2B56.g3a8dea659.dirty-cp38-cp38-linux_x86_64.whl
cupy-cuda118 @ file:///rapids/cupy_cuda118-11.0.0-cp38-cp38-linux_x86_64.whl
cycler==0.11.0
cymem==2.0.7
Cython==0.29.32
dask @ file:///rapids/dask-2022.9.2-py3-none-any.whl
dask-cuda @ file:///rapids/dask_cuda-22.10.0a0%2B23.g62a1ee8-py3-none-any.whl
dask-cudf @ file:///rapids/dask_cudf-22.10.0a0%2B316.gad1ba132d2.dirty-py3-none-any.whl
debugpy==1.6.4
decorator==5.1.1
defusedxml==0.7.1
distributed @ file:///rapids/distributed-2022.9.2-py3-none-any.whl
entrypoints==0.4
exceptiongroup==1.0.4
execnet==1.9.0
executing==1.2.0
expecttest==0.1.3
fastapi==0.110.1
fastjsonschema==2.16.2
fastrlock==0.8.1
Flask==3.0.3
Flask-RESTful==0.3.10
fonttools==4.38.0
frozenlist==1.4.1
fsspec==2022.11.0
ftfy==6.1.1
google-auth==2.15.0
google-auth-oauthlib==0.4.6
graphsurgeon @ file:///workspace/TensorRT-8.5.1.7/graphsurgeon/graphsurgeon-0.4.6-py2.py3-none-any.whl
grpcio==1.51.1
gunicorn==20.1.0
h11==0.14.0
HeapDict==1.0.1
httptools==0.6.1
hypothesis==5.35.1
idna==3.4
importlib-metadata==5.1.0
importlib-resources==5.10.1
iniconfig==1.1.1
intel-openmp==2021.4.0
ipykernel==6.19.2
ipython==8.7.0
ipython-genutils==0.2.0
itsdangerous==2.2.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.10
jsonschema==4.17.3
jupyter-tensorboard @ git+https://github.com/cliffwoolley/jupyter_tensorboard.git@ffa7e26138b82549453306e06b535a9ac36db17a
jupyter_client==7.4.8
jupyter_core==5.1.0
jupyterlab==2.3.2
jupyterlab-pygments==0.2.2
jupyterlab-server==1.2.0
jupytext==1.14.4
kiwisolver==1.4.4
kornia==0.7.2
kornia_rs==0.1.3
langcodes==3.3.0
librosa==0.9.2
llvmlite==0.39.1
locket==1.0.0
Markdown==3.4.1
markdown-it-py==2.1.0
MarkupSafe==2.1.1
matplotlib==3.6.2
matplotlib-inline==0.1.6
mdit-py-plugins==0.3.3
mdurl==0.1.2
mistune==2.0.4
mkl==2021.1.1
mkl-devel==2021.1.1
mkl-include==2021.1.1
mock==4.0.3
mpmath==1.2.1
msgpack==1.0.4
multidict==6.0.5
murmurhash==1.0.9
nbclient==0.7.2
nbconvert==7.2.6
nbformat==5.7.0
nest-asyncio==1.5.6
networkx==2.6.3
notebook==6.4.10
numba==0.56.4
numpy==1.22.2
nvgpu==0.9.0
nvidia-dali-cuda110==1.20.0
nvidia-pyindex==1.0.9
nvtx==0.2.5
oauthlib==3.2.2
onnx @ file:///opt/pytorch/pytorch/third_party/onnx
opencv @ file:///opencv-4.6.0/modules/python/package
packaging==22.0
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
partd==1.3.0
pathy==0.10.1
pexpect==4.8.0
pickleshare==0.7.5
pillow==10.2.0
pillow-avif-plugin==1.4.2
pillow-heif==0.14.0
pkgutil_resolve_name==1.3.10
platformdirs==2.6.0
pluggy==1.0.0
polygraphy==0.43.1
pooch==1.6.0
preshed==3.0.8
prettytable==3.5.0
prometheus-client==0.15.0
prompt-toolkit==3.0.36
protobuf==3.20.1
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow @ file:///rapids/pyarrow-9.0.0-cp38-cp38-linux_x86_64.whl
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.10.1
pycocotools @ git+https://github.com/nvidia/cocoapi.git@8b8fd68576675c3ee77402e61672d65a7d826ddf#subdirectory=PythonAPI
pycparser==2.21
pydantic==1.9.2
Pygments==2.13.0
pylibcugraph @ file:///rapids/pylibcugraph-22.10.0a0%2B113.g6bbdadf8.dirty-cp38-cp38-linux_x86_64.whl
pylibraft @ file:///rapids/pylibraft-22.10.0a0%2B81.g08abc72.dirty-cp38-cp38-linux_x86_64.whl
pynvml==11.4.1
pyparsing==3.0.9
pyrsistent==0.19.2
pytest==7.2.0
pytest-rerunfailures==10.3
pytest-shard==0.1.2
pytest-xdist==3.1.0
python-dateutil==2.8.2
python-dotenv==1.0.1
python-hostlist==1.22
python-multipart==0.0.5
pytorch-quantization==2.1.2
pytz==2022.6
PyYAML==6.0
pyzmq==24.0.1
raft-dask @ file:///rapids/raft_dask-22.10.0a0%2B81.g08abc72.dirty-cp38-cp38-linux_x86_64.whl
regex==2022.10.31
requests==2.28.2
requests-oauthlib==1.3.1
resampy==0.4.2
rmm @ file:///rapids/rmm-22.10.0a0%2B38.ge043158.dirty-cp38-cp38-linux_x86_64.whl
rsa==4.9
scikit-learn @ file:///rapids/scikit_learn-0.24.2-cp38-cp38-manylinux2010_x86_64.whl
scipy==1.6.3
Send2Trash==1.8.0
six==1.16.0
smart-open==6.3.0
sniffio==1.3.1
sortedcontainers==2.4.0
soundfile==0.11.0
soupsieve==2.3.2.post1
spacy==3.4.4
spacy-legacy==3.0.10
spacy-loggers==1.0.4
sphinx-glpi-theme==0.3
srsly==2.4.5
stack-data==0.6.2
starlette==0.37.2
sympy==1.11.1
tabulate==0.9.0
tbb==2021.7.1
tblib==1.7.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorrt @ file:///workspace/TensorRT-8.5.1.7/python/tensorrt-8.5.1.7-cp38-none-linux_x86_64.whl
termcolor==2.4.0
terminado==0.17.1
thinc==8.1.5
threadpoolctl==3.1.0
tinycss2==1.2.1
tinydb==4.7.0
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
torch==1.14.0a0+410ce96
torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/py/dist/torch_tensorrt-1.3.0a0-cp38-cp38-linux_x86_64.whl
torchserve==0.10.0
torchtext @ git+https://github.com/pytorch/text@fae8e8cabf7adcbbc2f09c0520216288fd53f33b
torchvision @ file:///opt/pytorch/vision
tornado==6.1
tqdm==4.64.1
traitlets==5.7.1
transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@73166c4e3f6cf0e754045ba22ff461ef96453aeb
treelite @ file:///rapids/treelite-2.4.0-py3-none-manylinux2014_x86_64.whl
treelite-runtime @ file:///rapids/treelite_runtime-2.4.0-py3-none-manylinux2014_x86_64.whl
typer==0.7.0
types-python-dateutil==2.9.0.20240316
typing_extensions==4.11.0
ucx-py @ file:///rapids/ucx_py-0.27.0a0%2B29.ge9e81f8-cp38-cp38-linux_x86_64.whl
uff @ file:///workspace/TensorRT-8.5.1.7/uff/uff-0.6.9-py2.py3-none-any.whl
urllib3==1.26.13
uvicorn==0.20.0
uvloop==0.19.0
wasabi==0.10.1
watchfiles==0.21.0
wcwidth==0.2.5
webencodings==0.5.1
websockets==12.0
Werkzeug==3.0.2
xdoctest==1.0.2
xgboost @ file:///rapids/xgboost-1.6.2-cp38-cp38-linux_x86_64.whl
yarl==1.9.4
zict==2.2.0
zipp==3.11.0
Repro instructions
Unfortunately, I can't find a way to reproduce the error, it randomly appears every 1-5 days.
Possible Solution
There are a few things that are a bit odd about this issue:
- The server has been running fine for over a year, I only made a few updates a few months back, and all of a sudden it started crashing frequently
- I thought it was some edge case that crashed the server, but it only crashes some of the running instances
- It happens randomly every 1-5 days, that's why I assumed it was some memory leak, but I can't find any evidence of it
- I get a device-side assert triggered, and CUDA out of memory, however the available memory seems to be plenty, and I check for any NaN value or wrong shape before placing it on the GPU.
I've run out of ideas, any thought or feedback would be much appreciated.
Hi @emilwallner,
thanks for the extensive issue report.
My thought on this are:
- You're looking at the server after the crash, right? Meaning that the worker process has died, gets restarted and and thus memory is back to normal.
- I can't find the line from your stack trace in your code but I assume that its basically the next line from your code. Detach does not create a copy of the data so you should still be having a single batch on device.
- You're resizing the images with a resolution coming from the requests and then re-resizing the tensor in preprocess_and_stack_images to (3,768,768). Then you're stacking them along the channel dimension creating e.g. (6,768,768) before you add a batch dimension with unsqueeze. Not sure about your model by maybe it does something funky when it gets (1,6,768,768) instead of(2,3,768,768).
- What is your batch size? Did you try using batch_size=1 for some time?
- In the video there are multiple processes on the GPU, do you use multiple worker for the same model?
Thats all I have for now but happy to continue spitballing and iterating over this until you find s solution!
Best
Matthias
Really, really appreciate your input, @mreso!
- The worker crashes and returns 507 and doesn't recover.
- Yeah, I added detach to make sure requires_grad is set to False
- Yeah, that could be it
- I switched the batch size to 1 following your suggestion. Also, I check that it has the correct type, and final batch size.
- Yes, multiple workers per model.
I also realized CUDA_LAUNCH_BLOCKING 1 reduces performance by about 70%, so I'll turn it off for now.
Here's my updated check:
def preprocess_and_stack_images(self, images):
preprocessed_images = []
for i, img in enumerate(images):
try:
preprocessed_img = self.resize_tensor(img)
if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1 or preprocessed_img.dtype != torch.float32:
# Log information about the image that doesn't meet the requirements
logger.info(f"Image {i} does not meet the requirements. Replacing with a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
except Exception as e:
# Log the error message and load a blank image
logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
preprocessed_images.append(preprocessed_img)
images_batch = torch.stack(preprocessed_images, dim=0)
if len(images_batch.shape) == 3:
images_batch = images_batch.unsqueeze(0)
# Second test: Check if the size is (1, 3, 768, 768)
if images_batch.shape != (1, 3, 768, 768):
# Log information about the batch that doesn't meet the requirements
logger.info(f"Batch size {images_batch.shape} does not match the required shape (1, 3, 768, 768). Replacing with a blank batch.")
images_batch = torch.zeros((1, 3, 768, 768))
return images_batch
Again, really appreciate the brainstorming ā letās keep at it until we crack this!
Yeah, performance will suffer significant from CUDA_LAUNCH_BLOCKING as kernels will not run asynchronously. So only activate if really necessary for debugging.
You could try to run the model in a notebook with a (1,6,768,768) input and observe the memory usage compared to (2,3,768,768). Wondering why this actually seem to to work in the first place.
I havenāt tried the (1,6,768,768) input yet, but since our model is based on three channels, it should throw an error during execution.
Now, I double-check the size (1,3,768,768), dtype, and ensured the values are in the correct range. Despite that, Iām still hitting a CUDA error: device-side assert triggered when moving the batch with images_batch = images_batch.to(self.device).detach()
Got any more suggestions on what might be causing this?