k2-fsa/icefall

Plans to make installation simpler

Closed this issue · 14 comments

Hello,

Thanks for the whole k2-icefall-sherpa-lhotse framework. I have used them successfully in the past (~ 1 year ago) but it seems every time I tried to get back to it with latest updates I hit new issues in the installation process. I guess for someone with root access it is relatively easy to install, but without it there is always one step or another failing, and once you solve it another one comes. I think the main bottleneck seems to be that k2 need non conda installation of cuda and cudnn.

This time (with latest icefall as of today) I followed

  1. https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#cuda-12-1 and here there seems already to be missing the definition of export CUDAToolkit_INCLUDE_DIR=$CUDA_HOME/targets/x86_64-linux/include in the activate cuda script which makes the k2 installation failed later.
  2. installing pytorch with pip, that was easy
  3. tried and failed to install k2 from source (first because of the issue I mentioned in 1. but then after adding that it still fails with Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR) (found version "12.1.66")
  4. installed k2 from wheel with wget https://huggingface.co/csukuangfj/k2/resolve/main/ubuntu-cuda/k2-1.24.4.dev20240301+cuda12.1.torch2.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl which seemed to work
  5. installing lhotse which I had not had issues before. Here I get another error related to lilcom - I found what seemed to be a solution danpovey/lilcom#50 (comment)
  6. tried to install lilcom from conda which also failed because lilcom-1.4-py3.10 requires python >=3.10,<3.11.0a0 and my environment is python3.12

so it is quite difficult to get all the dependencies of icefall, and because every time you fix an error you get another one, not user friendly at all.

The simplest to reach a maximum number of users would definitely be to allow for conda environment, where you could simply provide a yaml environment file to install everything in one command. It seems that this is not currently possible because of cudnn dependency mostly?

Is there any plan on making the installation process easier? Thanks!

I guess for someone with root access it is relatively easy to install

I think non of our dependencies need root access.

Is there any plan on making the installation process easier?

You can use docker, see https://github.com/k2-fsa/icefall/tree/master/docker

We normally install dependencies by pip, conda is not recomended.

As for the 5 and 6, will have a look and fix them.

Using python3.10 the installation looks like it is done (following the steps of installation guide), however, when trying to run the yesno training I am getting the following:

2024-03-20 10:31:29,424 INFO [asr_datamodule.py:255] About to get test cuts
Could not load library libcudnn_cnn_train.so.8. Error: /home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
...
Traceback (most recent call last):
  File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 575, in <module>
    main()
  File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 571, in main
    run(rank=0, world_size=1, args=args)
  File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 536, in run
    train_one_epoch(
  File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 417, in train_one_epoch
    loss.backward()
  File "/home/pe.honnet/Projects/tl_icefall/venv/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/pe.honnet/Projects/tl_icefall/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: GET was unable to find an engine to execute this computation

I also checked that the cuda version I installed locally should be compatible with the installed drivers:

# got this installer cuda_12.1.0_530.30.02_linux.run
nvidia-smi 
Wed Mar 20 10:44:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+

Another attempt, using cuda 11.8 instead and what looks like a successful installation:

$ ./tdnn/train.py 
2024-03-20 11:07:31,740 INFO [train.py:481] Training started
2024-03-20 11:07:31,740 INFO [train.py:482] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'seed': 42, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': False, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.23.0.dev+git.d3106cf.clean', 'torch-version': '2.2.1+cu118', 'torch-cuda-available': True, 'torch-cuda-version': '11.8', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'ea92fc3-clean', 'icefall-git-date': 'Tue Mar 19 14:57:16 2024', 'icefall-path': '/home/pe.honnet/Projects/tl_icefall', 'k2-path': '/home/pe.honnet/Projects/tl_icefall/venv2/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/home/pe.honnet/Projects/tl_icefall/venv2/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'tlzhsrv010', 'IP address': '127.0.1.1'}}
2024-03-20 11:07:31,741 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2024-03-20 11:07:31,742 INFO [train.py:495] device: cuda:0
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:146] About to get train cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:247] About to get train cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:149] About to create train dataset
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:201] Using SimpleCutSampler.
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:207] About to create train dataloader
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:220] About to get test cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:255] About to get test cuts
Segmentation fault (core dumped)

For the segmentation fault, please see
#674


By the way, you are the first one with so many issues setting up the icefall environment for the past 6 months.

It would be great if you could tell us the exact commands you have run and tell us whether you have followed strictly
the installation doc for both k2 and icefall.

For the installation based on cuda 11.8 here is the full history:

318  python3.10 -m venv venv2
  319  source venv2/bin/activate
  320  cd cuda/
  321  wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
  322  chmod +x cuda_11.8.0_520.61.05_linux.run 
  323  ./cuda_11.8.0_520.61.05_linux.run   --silent   --toolkit   --installpath=$PWD/cuda-11.8.0   --no-opengl-libs   --no-drm   --no-man-page
  324  wget https://huggingface.co/csukuangfj/cudnn/resolve/main/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz
  325  tar xvf cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz --strip-components=1 -C  $PWD/cuda-11.8.0
  326  cd ..
  327  cp activate-cuda-12.1.sh activate-cuda-11.8.sh 
  328  nano activate-cuda-11.8.sh 
  329  source activate-cuda-11.8.sh 
  330  which nvcc
  331  nvcc --version
  332  pip3 install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
  333  cd k2_wheel/
  334  wget https://huggingface.co/csukuangfj/k2/resolve/main/ubuntu-cuda/k2-1.24.4.dev20240223+cuda11.8.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 
  335  ls
  336  pip install k2-1.24.4.dev20240223+cuda11.8.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 
  337  cd ..
  338  pip install git+https://github.com/lhotse-speech/lhotse
  339  pip install -r requirements.txt 
  340  export PYTHONPATH=$PWD:$PYTHONPATH
  341  cd egs/yesno/ASR/
  342  ./prepare.sh 
  343  ./tdnn/train.py 

In the other case (cuda 12.1) I used the same approach but based on cuda 12.1 (from https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#cuda-12-1) and adapted k2 wheel. Here is a new attempt (with same error as reported in previous comment):

  360  python3.10 -m venv venv3
  361  source venv3/bin/activate
  362  cd cuda/
  363  ./cuda_12.1.0_530.30.02_linux.run   --silent   --toolkit   --installpath=$PWD/cuda-12.1.0   --no-opengl-libs   --no-drm   --no-man-page
  364  tar xvf cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz --strip-components=1 -C $PWD/cuda-12.1.0
  365  cd ..
  366  source activate-cuda-12.1.sh 
  367  pip install torch torchaudio
  368  cd k2_wheel
  369  pip install k2-1.24.4.dev20240301+cuda12.1.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  370  pip install git+https://github.com/lhotse-speech/lhotse
  371  cd ..
  372  pip install -r requirements.txt 
  373  export PYTHONPATH=$PWD:$PYTHONPATH
  374  cd egs/yesno/ASR/
  375  ./prepare.sh 
  376  ./tdnn/train.py 

Thanks! Yous commands look good.

tried and failed to install k2 from source (first because of the issue I mentioned in 1. but then after adding that it still fails with Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR) (found version "12.1.66")

Could you give the complete error logs for this?


installing lhotse which I had not had issues before. Here I get another error related to lilcom

Could you give the complete error logs for this?

OK, so I retried the same thing (i.e. installing k2 from source). If I follow the instructions in https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#set-environment-variables-for-cuda-12-1
Then, running

git clone https://github.com/k2-fsa/k2.git
cd k2
export K2_MAKE_ARGS="-j6"
python3 setup.py install

I am first getting the error in the log file error_k2_from_source_1.log

I fixed it by adding this line to the activate-cuda script

export CUDAToolkit_INCLUDE_DIR=$CUDA_HOME/targets/x86_64-linux/include

Then, trying to install again k2 from source I get the error in the log file error_k2_from_source_2.log

-- Unable to find cuda_runtime.h in "/home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/include" for CUDAToolkit_INCLUDE_DIR.
-- Unable to find cublas_v2.h in either "" or "/home/pe.honnet/Projects/tl_icefall/math_libs/include"

although in $CUDAToolkit_INCLUDE_DIR there is cuda_runtime.h and cublas_v2.h.

error_k2_from_source_1.log
error_k2_from_source_2.log

Regarding the lhotse issue, it was the lilcom issue solved by danpovey/lilcom#50 (comment)
The reason was that I was first creating a conda environment to then create a virtualenv (because there was no python3.10-venv installed on the system. I asked the admin to add it since and got rid of the conda solution).

Regarding the seg fault, I am not able to solve it with the solution you shared - the data preparation works, but it fails in the training.

Regarding the seg fault, I am not able to solve it with the solution you shared - the data preparation works, but it fails in the training.

Are you able to use the command in the above posted link to find out more?

Sure this is the output log (with the cuda 11.8 environment - with cuda 12.1 I have the error I had reported before)
gdb_output.log

Here is a finding from this comment https://discuss.pytorch.org/t/could-not-load-library-libcudnn-cnn-train-so-8-but-im-sure-that-i-have-set-the-right-ld-library-path/190277/3
I have simply removed the link libcudnn_cnn_train.so.8 in the folder .../cuda-12.1.0/lib and seemed to be able to run the tdnn/train.py script.
It is surprising that no one else got that error before (I had the same error on two different servers with old and recent GPUs).

@csukuangfj I am closing this issue as in the end I was able to make it work, but I think that my last comment (about removing libcudnn_cnn_train.so.8) may be something to keep in mind as other people will probably face it too.