Segment Fault Happens when Applying Gradient Update on TransferCompound on CUDA in Debug Mode
Zhaoxian-Wu opened this issue · 6 comments
Zhaoxian-Wu commented
Description
When I try to run optimizer.step()
on TransferCompound in Debug mode, a segment fault occurs. It happens for the CUDA version.
How to reproduce
I followed the following steps:
- Compile the code in the debug mode
conda create -n aihwkit-cuda-dev python=3.10 -y
conda activate aihwkit-cuda-dev
git clone https://github.com/IBM/aihwkit.git ; cd aihwkit
pip install -r requirements.txt
conda install mkl mkl-include -y
export CXX=/usr/bin/g++
export CC=/usr/bin/gcc
export MKLROOT=$CONDA_PREFIX
export CMAKE_PREFIX_PATH=$CONDA_PREFIX
# export CUDA_VERSION=11.3
export CUDA_VERSION=11.1
export CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
export CUDA_TOOLKIT_ROOT_DIR=${CUDA_HOME}
export CUDA_LIB_PATH=${CUDA_HOME}/lib64
export CUDA_INCLUDE_DIRS=${CUDA_HOME}/include
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
make build_inplace_cuda flags="-DRPU_DEBUG=ON"
- run the Python script
main.py
(provided below)
(aihwkit-cuda-dev) MrFive@server:~/Desktop/aihwkit$ python main.py
/home/MrFive/Desktop/aihwkit/./src/aihwkit/__init__.py
RPUSimple<float>(3,2)
rpu.cpp:264 : RPUSimple constructed.
rpu_pulsed.cpp:96 : RPUPulsed constructed
rpu_pulsed.cpp:190 : BL = 31, A = 1.79605, B = 1.79605
RPUSimple<float>(3,2)
rpu.cpp:341 : RPUSimple copy constructed.
cuda_util.cu:455 : Create context on GPU -1 with shared stream (on id 0)
cuda_util.cu:426 : Init context...
cuda_util.cu:434 : Create context on GPU 0
cuda_util.cu:245 : GET BLAS env.
cuda_util.cu:259 : CUBLAS Host initialized.
cuda_util.cu:1085 : Set (hsize,P,W,H): 2, 512, 8, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1255 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
rpucuda.cu:93 : RPUCudaSimple constructed from RPUSimple on shared stream
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
rpucuda_pulsed.cu:64 : RPUCudaPulsed constructed
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 8, 512, 8, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 96, 512, 96, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 48, 512, 48, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1255 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 96, 512, 96, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 48, 512, 48, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1255 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 16, 512, 16, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
RPUPulsed<float>[Transfer(2): SoftBoundsReference -> SoftBoundsReference](3,2)
rpu_pulsed.cpp:143 : RPUPulsed DESTRUCTED
rpu.cpp:288 : RPUSimple DESTRUCTED
cuda_util.cu:813 : Get SHARED float buffer ID 0, size 2, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:813 : Get SHARED float buffer ID 1, size 3, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1315 : Copy to host (hsize,P,W,H): 4, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:831 : Release SHARED float buffer ID 0, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:831 : Release SHARED float buffer ID 1, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:813 : Get SHARED float buffer ID 0, size 3, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:813 : Get SHARED float buffer ID 1, size 2, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:831 : Release SHARED float buffer ID 0, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:831 : Release SHARED float buffer ID 1, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 12, 512, 48, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 24, 512, 96, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 6, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 12, 512, 48, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 24, 512, 96, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 6, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 12, 512, 48, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 2, 512, 8, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 4, 512, 16, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1281 : Assign device (S, P,W,H): 6, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
bit_line_maker.cu:1541 : BLM init BL buffers with batch 1 and BL 31.
cuda_util.cu:1085 : Set (hsize,P,W,H): 2, 512, 8, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 2, 512, 16, 1
cuda_util.cu:1085 : Set (hsize,P,W,H): 3, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
Segmentation fault (core dumped)
Expected behavior
The code can run without error
Other information
main.py
import sys
sys.path.insert(0, './src')
import torch
import aihwkit
print(aihwkit.__file__)
# Imports from aihwkit.
from aihwkit.nn import AnalogLinear
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.configs import (
UnitCellRPUConfig,
TransferCompound,
SoftBoundsReferenceDevice)
rpu_config = UnitCellRPUConfig(
device=TransferCompound(
unit_cell_devices=[
SoftBoundsReferenceDevice(),
SoftBoundsReferenceDevice(),
]
)
)
in_dim = 2
model = AnalogLinear(2, 3, bias=True, rpu_config=rpu_config)
opt = AnalogSGD(model.parameters(), lr=0.1)
x = torch.ones(in_dim)
x = x.cuda()
model.cuda()
opt.zero_grad()
pred = model(x)
loss = pred.norm()**2
loss.backward()
opt.step()
- Pytorch version: 2.1.2+cu121
- Package version: 0.8.0
- OS: Ubuntu 20.04.2
- Python version: Python 3.10
- Conda version (or N/A): conda 23.10.0
maljoras commented
@Zhaoxian-Wu indeed, debug mode might not be working with python. The debug mode is only used for C++ environments.