Segmentation Fault Error Running CrystalFormer Sampling
Closed this issue · 4 comments
Osgood001 commented
Description:
I am experiencing a Segmentation fault
when running the CrystalFormer script on my CUDA 12.1 environment. Below are the steps I followed and the relevant script outputs:
Expected Behavior:
- The script should execute without errors, completing its intended task.
Actual Behavior:
- After correcting the script by removing the unrecognized option, the script still resulted in a
Segmentation fault
.
python /home/CrystalFormer/src/main.py --optimizer none --test_path /home/CrystalFormer/data/mini.csv --restore_path /data/model/epoch_003800.pkl --spacegroup 160 --num_samples 1000 --batchsize 1000 --temperature 1.0
2024-06-27 16:41:50.440252: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.1 which is ol
der than the ptxas CUDA version (12.5.40). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages. number of available cpu: 8
num_io_process should not exceed number of available cpu, reset to 8
/opt/mamba/envs/crystalgpt/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompa
tible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork()
25 Pmm2 5
g, a, w, m, symbol, x: 25 3 4 1 1d 25 Pmm2 5
g, a, w, m, symbol, x: 25 82 4 1 1d [0.5 0.5 0.55085454]
g, a, w, m, symbol, x: 25 14 1 1 1a [0. 0. 0.28531504]
[0.5 0.5 0.50596338]
g, a, w, m, symbol, x: 25 7 2 1 1b g, a, w, m, symbol, x: 25 74 1 1 1a [0. 0.5 0.15366418]
[0. 0. 0.00040354]
g, a, w, m, symbol, x: 25 8 3 1 1c g, a, w, m, symbol, x: 25 7 3 1 1c[0.5 0. 0.47407905]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5 0. 0.49964437]
[0.5 0.5 0.03555206]g, a, w, m, symbol, x:
==================================
...
===================================
221 Pm-3m 3
g, a, w, m, symbol, x: 221 29 1 1 1a [0. 0. 0.]
g, a, w, m, symbol, x: 221 33 2 1 1b [0.5 0.5 0.5]
g, a, w, m, symbol, x: 221 7 3 3 3c [0. 0.5 0.5]
['1a' '1b' '3c'] [29 33 7] [1 2 3] 5
===================================
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
(21, 119)
Segmentation fault
Reproduction Steps:
-
Follow the provided bash scripts in order.
-
Observe the error after the corrected script execution.
-
Environment Activation & GPU Check:
conda activate crystalgpt
-
CUDA and JAX Version Check:
nvcc --version
-
JAX Installation:
pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
-
Requirements Installation:
pip install -r requirements.txt
-
Model Download & Rename:
wget "https://drive.usercontent.google.com/u/0/uc?id=1koHC6n38BqsY2_z3xHTi40HcFbVesUKd&export=download" mv downloaded_model_file epoch_003800.pkl
-
Initial Script Execution with Error:
python /home/CrystalFormer/src/main.py --optimizer none --test_path /home/CrystalFormer/data/mini.csv --restore_path /data/model/epoch_003800.pkl --spacegroup 160 --num_samples 1000 --batchsize 1000 --temperature 1.0 --use_foriloop
No --use_foriloop
available.
- Corrected Script Execution Resulting in Segmentation Fault:
python /home/CrystalFormer/src/main.py --optimizer none --test_path /home/CrystalFormer/data/mini.csv --restore_path /data/model/epoch_003800.pkl --spacegroup 160 --num_samples 1000 --batchsize 1000 --temperature 1.0
Notes:
- The script outputs warnings regarding CUDA version mismatch and potential deadlock before the
Segmentation fault
.
Attachments:
nvidia-smi
Thu Jun 27 16:47:26 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:09.0 Off | 0 |
| N/A 31C P0 27W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
zdcao121 commented
- The
--use_foriloop
parameter has been deprecated, I will update the doc ASAP - While using
--restore_path
, you just need to specify the path to the folder which contains the parameter file, so you need change--restore_path /data/model/epoch_003800.pkl
to--restore_path /data/model/
Osgood001 commented
similar error message persists:
python /home/CrystalFormer/src/main.py --optimizer none --test_path /home/CrystalFormer/data/mini.csv --restore_path /data/model/ --spacegroup 160 --num_samples 1000 --batchsize 1000 --temperature 1.0
2024-06-27 21:19:01.921329: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.1 which is older
than the ptxas CUDA version (12.5.40). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages. number of available cpu: 8
num_io_process should not exceed number of available cpu, reset to 8
/opt/mamba/envs/crystalgpt/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatibl
e with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork()
25 Pmm2 5
g, a, w, m, symbol, x: 25 3 4 1 1d [0.5 0.5 0.55085454]
25 Pmm2 g, a, w, m, symbol, x:5 25
14 1 1 1a g, a, w, m, symbol, x: 25 81 4 1 1d [0. 0. 0.28531504]
g, a, w, m, symbol, x: 25 7 2 1 1b [0. 0.5 0.15366418]
[0.5 0.5 0.5057673]g, a, w, m, symbol, x:
25 8 3 1 1c g, a, w, m, symbol, x: 25 44 1 1 [0.5 0. 0.47407905]1a
g, a, w, m, symbol, x: 25 8 4 1 1d [0. 0. 0.00033725]
[0.5 0.5 0.03555206]
g, a, w, m, symbol, x: 25 8 3 1 1c ['1a' '1b' '1c' '1d' '1d'] [14 7 8 3 8][0.5 0. 0.49976346]
[1 2 3 4 4] 5
g, a, w, m, symbol, x: 25 8 4 1 1d ===================================
[0.5 0.5 0.00473415]
g, a, w, m, symbol, x: 25 9 2 1 1b [0. 0.5 0.49946523]
['1a' '1b' '1c' '1d' '1d'] [44 9 8 81 8] [1 2 3 4 4] 5
===================================
25 Pmm2 5
g, a, w, m, symbol, x: 25 82 4 1 1d [0.5 0.5 0.50596338]
g, a, w, m, symbol, x: 25 74 1 1 1a [0. 0. 0.00040354]
g, a, w, m, symbol, x: 25 7 3 1 1c [0.5 0. 0.49964437]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5 0.5 0.00486688]
g, a, w, m, symbol, x: 25 9 2 1 1b [0. 0.5 0.49949773]
['1a' '1b' '1c' '1d' '1d'] [74 9 7 82 8] [1 2 3 4 4] 5
...
25 Pmm2 5
g, a, w, m, symbol, x: 25 56 1 1 1a [0. 0. 0.00096878]
g, a, w, m, symbol, x: 25 44 4 1 1d [0.5 0.5 0.50393304]
g, a, w, m, symbol, x: 25 8 3 1 1c [0.5 0. 0.50006739]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5 0.5 0.00410809]
g, a, w, m, symbol, x: 25 9 2 1 1b [0. 0.5 0.50043428]
['1a' '1b' '1c' '1d' '1d'] [56 9 8 44 8] [1 2 3 4 4] 5
===================================
25 Pmm2 5
g, a, w, m, symbol, x: 25 39 1 1 1a [0. 0. 0.91067428]
g, a, w, m, symbol, x: 25 49 4 1 1d [0.5 0.5 0.57043078]
g, a, w, m, symbol, x: 25 16 2 1 1b [0. 0.5 0.36299416]
g, a, w, m, symbol, x: 25 8 3 1 1c [0.5 0. 0.69224078]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5 0.5 0.04031656]
['1a' '1b' '1c' '1d' '1d'] [39 16 8 49 8] [1 2 3 4 4] 5
===================================
99 P4mm 4
g, a, w, m, symbol, x: 99 27 2 991 1bP4mm 4
g, a, w, m, symbol, x: 99 72[0.5 0.5 0.43551539] 2
1 1b g, a, w, m, symbol, x: 99 74 1 1 1a [0. 0. 0.0935678]
g, a, w, m, symbol, x: 99 8 3 2 2c [0.5 0.5 0.52762345]
g, a, w, m, symbol, x: 99 13 1 1 1a [0.5 0. 0.36321747]
[0. 0. 0.23811871]
g, a, w, m, symbol, x:g, a, w, m, symbol, x: 9999 88 23 12 1b2c [0.5 0.5 0.92262586]
['1a' '1b' '1b' '2c'] [74 27 8 8] [0.5 0. 0.41935511]
g, a, w, m, symbol, x: 99 8[1 2 2 3] 2 51
1b ===================================
[0.5 0.5 0.99443926]
['1a' '1b' '1b' '2c'] [13 72 8 8] [1 2 2 3] 5
===================================
221 Pm-3m 3
g, a, w, m, symbol, x: 221 3 2 1 1b [0.5 0.5 0.5]
g, a, w, m, symbol, x: 221 13 1 1 1a [0. 0. 0.]
g, a, w, m, symbol, x: 221 8 4 3 3d [0.5 0. 0. ]
['1a' '1b' '3d'] [13 3 8] [1 2 4] 5
===================================
25 Pmm2 5
g, a, w, m, symbol, x: 25 4 4 1 1d [0.5 0.5 0.50295577]
g, a, w, m, symbol, x: 25 33 1 1 1a [0. 0. 0.18006878]
g, a, w, m, symbol, x: 25 7 2 1 1b [0. 0.5 0.42020007]
g, a, w, m, symbol, x: 25 8 3 1 1c [0.5 0. 0.42690466]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5 0.5 0.94932273]
['1a' '1b' '1c' '1d' '1d'] [33 7 8 4 8] [1 2 3 4 4] 5
===================================
221 Pm-3m 3
g, a, w, m, symbol, x: 221 19 2 1 1b [0.5 0.5 0.5]
g, a, w, m, symbol, x: 221 51 1 1 1a [0. 0. 0.]
g, a, w, m, symbol, x: 221 8 4 3 3d [0.5 0. 0. ]
['1a' '1b' '3d'] [51 19 8] [1 2 4] 5
===================================
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
(21, 119)
Segmentation fault
zdcao121 commented
- maybe the
cuda12.1
is incompatible with the latest version ofjax
, you need to try the newer version of cuda likecuda12.5
- the way you used to install
jax
has been deprecated. Please see the new installation guide ofjax
Osgood001 commented
Thanks! Already fixed by switching to cuda12.5
and re-install jax!