deepmodeling/CrystalFormer

Segmentation Fault Error Running CrystalFormer Sampling

Closed this issue · 4 comments

Description:

I am experiencing a Segmentation fault when running the CrystalFormer script on my CUDA 12.1 environment. Below are the steps I followed and the relevant script outputs:

Expected Behavior:

  • The script should execute without errors, completing its intended task.

Actual Behavior:

  • After correcting the script by removing the unrecognized option, the script still resulted in a Segmentation fault.
python /home/CrystalFormer/src/main.py --optimizer none --test_path /home/CrystalFormer/data/mini.csv --restore_path /data/model/epoch_003800.pkl --spacegroup 160 --num_samples 1000  --batchsize 1000 --temperature 1.0
2024-06-27 16:41:50.440252: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.1 which is ol
der than the ptxas CUDA version (12.5.40). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages. number of available cpu:  8
num_io_process should not exceed number of available cpu, reset to  8
/opt/mamba/envs/crystalgpt/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompa
tible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.                                        self.pid = os.fork()
25 Pmm2 5
g, a, w, m, symbol, x: 25 3 4 1 1d 25 Pmm2 5
g, a, w, m, symbol, x: 25 82 4 1 1d [0.5        0.5        0.55085454]
g, a, w, m, symbol, x: 25 14 1 1 1a [0.         0.         0.28531504]
[0.5        0.5        0.50596338]
g, a, w, m, symbol, x: 25 7 2 1 1b g, a, w, m, symbol, x: 25 74 1 1 1a [0.         0.5        0.15366418]
[0.         0.         0.00040354]
g, a, w, m, symbol, x: 25 8 3 1 1c g, a, w, m, symbol, x: 25 7 3 1 1c[0.5        0.         0.47407905] 
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5        0.         0.49964437]
[0.5        0.5        0.03555206]g, a, w, m, symbol, x:
==================================
...
===================================
221 Pm-3m 3
g, a, w, m, symbol, x: 221 29 1 1 1a [0. 0. 0.]
g, a, w, m, symbol, x: 221 33 2 1 1b [0.5 0.5 0.5]
g, a, w, m, symbol, x: 221 7 3 3 3c [0.  0.5 0.5]
['1a' '1b' '3c'] [29 33  7] [1 2 3] 5
===================================
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(21, 119)
Segmentation fault

Reproduction Steps:

  1. Follow the provided bash scripts in order.

  2. Observe the error after the corrected script execution.

  3. Environment Activation & GPU Check:

    conda activate crystalgpt
  4. CUDA and JAX Version Check:

    nvcc --version
  5. JAX Installation:

    pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
  6. Requirements Installation:

    pip install -r requirements.txt
  7. Model Download & Rename:

    wget "https://drive.usercontent.google.com/u/0/uc?id=1koHC6n38BqsY2_z3xHTi40HcFbVesUKd&export=download"
    mv downloaded_model_file epoch_003800.pkl
  8. Initial Script Execution with Error:

    python /home/CrystalFormer/src/main.py --optimizer none --test_path /home/CrystalFormer/data/mini.csv --restore_path /data/model/epoch_003800.pkl --spacegroup 160 --num_samples 1000 --batchsize 1000 --temperature 1.0 --use_foriloop

No --use_foriloop available.

  1. Corrected Script Execution Resulting in Segmentation Fault:
    python /home/CrystalFormer/src/main.py --optimizer none --test_path /home/CrystalFormer/data/mini.csv --restore_path /data/model/epoch_003800.pkl --spacegroup 160 --num_samples 1000 --batchsize 1000 --temperature 1.0

Notes:

  • The script outputs warnings regarding CUDA version mismatch and potential deadlock before the Segmentation fault.

Attachments:

nvidia-smi
Thu Jun 27 16:47:26 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   31C    P0    27W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
  • The --use_foriloop parameter has been deprecated, I will update the doc ASAP
  • While using --restore_path, you just need to specify the path to the folder which contains the parameter file, so you need change --restore_path /data/model/epoch_003800.pkl to --restore_path /data/model/

similar error message persists:

python /home/CrystalFormer/src/main.py --optimizer none --test_path /home/CrystalFormer/data/mini.csv --restore_path /data/model/ --spacegroup 160 --num_samples 1000  --batchsize 1000 --temperature 1.0
2024-06-27 21:19:01.921329: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.1 which is older 
than the ptxas CUDA version (12.5.40). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.             number of available cpu:  8
num_io_process should not exceed number of available cpu, reset to  8
/opt/mamba/envs/crystalgpt/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatibl
e with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.                                                self.pid = os.fork()
25 Pmm2 5
g, a, w, m, symbol, x: 25 3 4 1 1d [0.5        0.5        0.55085454]
25 Pmm2 g, a, w, m, symbol, x:5 25 
14 1 1 1a g, a, w, m, symbol, x: 25 81 4 1 1d [0.         0.         0.28531504]
g, a, w, m, symbol, x: 25 7 2 1 1b [0.         0.5        0.15366418]
[0.5       0.5       0.5057673]g, a, w, m, symbol, x:
 25 8 3 1 1c g, a, w, m, symbol, x: 25 44 1 1 [0.5        0.         0.47407905]1a 
g, a, w, m, symbol, x: 25 8 4 1 1d [0.         0.         0.00033725]
[0.5        0.5        0.03555206]
g, a, w, m, symbol, x: 25 8 3 1 1c ['1a' '1b' '1c' '1d' '1d'] [14  7  8  3  8][0.5        0.         0.49976346] 
[1 2 3 4 4] 5
g, a, w, m, symbol, x: 25 8 4 1 1d ===================================
[0.5        0.5        0.00473415]
g, a, w, m, symbol, x: 25 9 2 1 1b [0.         0.5        0.49946523]
['1a' '1b' '1c' '1d' '1d'] [44  9  8 81  8] [1 2 3 4 4] 5
===================================
25 Pmm2 5
g, a, w, m, symbol, x: 25 82 4 1 1d [0.5        0.5        0.50596338]
g, a, w, m, symbol, x: 25 74 1 1 1a [0.         0.         0.00040354]
g, a, w, m, symbol, x: 25 7 3 1 1c [0.5        0.         0.49964437]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5        0.5        0.00486688]
g, a, w, m, symbol, x: 25 9 2 1 1b [0.         0.5        0.49949773]
['1a' '1b' '1c' '1d' '1d'] [74  9  7 82  8] [1 2 3 4 4] 5
...
25 Pmm2 5
g, a, w, m, symbol, x: 25 56 1 1 1a [0.         0.         0.00096878]
g, a, w, m, symbol, x: 25 44 4 1 1d [0.5        0.5        0.50393304]
g, a, w, m, symbol, x: 25 8 3 1 1c [0.5        0.         0.50006739]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5        0.5        0.00410809]
g, a, w, m, symbol, x: 25 9 2 1 1b [0.         0.5        0.50043428]
['1a' '1b' '1c' '1d' '1d'] [56  9  8 44  8] [1 2 3 4 4] 5
===================================
25 Pmm2 5
g, a, w, m, symbol, x: 25 39 1 1 1a [0.         0.         0.91067428]
g, a, w, m, symbol, x: 25 49 4 1 1d [0.5        0.5        0.57043078]
g, a, w, m, symbol, x: 25 16 2 1 1b [0.         0.5        0.36299416]
g, a, w, m, symbol, x: 25 8 3 1 1c [0.5        0.         0.69224078]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5        0.5        0.04031656]
['1a' '1b' '1c' '1d' '1d'] [39 16  8 49  8] [1 2 3 4 4] 5
===================================
99 P4mm 4
g, a, w, m, symbol, x: 99 27 2 991  1bP4mm  4
g, a, w, m, symbol, x: 99 72[0.5        0.5        0.43551539] 2
 1 1b g, a, w, m, symbol, x: 99 74 1 1 1a [0.        0.        0.0935678]
g, a, w, m, symbol, x: 99 8 3 2 2c [0.5        0.5        0.52762345]
g, a, w, m, symbol, x: 99 13 1 1 1a [0.5        0.         0.36321747]
[0.         0.         0.23811871]
g, a, w, m, symbol, x:g, a, w, m, symbol, x:  9999  88  23  12  1b2c  [0.5        0.5        0.92262586]
['1a' '1b' '1b' '2c'] [74 27  8  8] [0.5        0.         0.41935511]
g, a, w, m, symbol, x: 99 8[1 2 2 3] 2  51
 1b ===================================
[0.5        0.5        0.99443926]
['1a' '1b' '1b' '2c'] [13 72  8  8] [1 2 2 3] 5
===================================
221 Pm-3m 3
g, a, w, m, symbol, x: 221 3 2 1 1b [0.5 0.5 0.5]
g, a, w, m, symbol, x: 221 13 1 1 1a [0. 0. 0.]
g, a, w, m, symbol, x: 221 8 4 3 3d [0.5 0.  0. ]
['1a' '1b' '3d'] [13  3  8] [1 2 4] 5
===================================
25 Pmm2 5
g, a, w, m, symbol, x: 25 4 4 1 1d [0.5        0.5        0.50295577]
g, a, w, m, symbol, x: 25 33 1 1 1a [0.         0.         0.18006878]
g, a, w, m, symbol, x: 25 7 2 1 1b [0.         0.5        0.42020007]
g, a, w, m, symbol, x: 25 8 3 1 1c [0.5        0.         0.42690466]
g, a, w, m, symbol, x: 25 8 4 1 1d [0.5        0.5        0.94932273]
['1a' '1b' '1c' '1d' '1d'] [33  7  8  4  8] [1 2 3 4 4] 5
===================================
221 Pm-3m 3
g, a, w, m, symbol, x: 221 19 2 1 1b [0.5 0.5 0.5]
g, a, w, m, symbol, x: 221 51 1 1 1a [0. 0. 0.]
g, a, w, m, symbol, x: 221 8 4 3 3d [0.5 0.  0. ]
['1a' '1b' '3d'] [51 19  8] [1 2 4] 5
===================================
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(21, 119)
Segmentation fault
  • maybe the cuda12.1 is incompatible with the latest version of jax, you need to try the newer version of cuda like cuda12.5
  • the way you used to install jax has been deprecated. Please see the new installation guide of jax

Thanks! Already fixed by switching to cuda12.5 and re-install jax!