Plans to make installation simpler
Closed this issue · 14 comments
Hello,
Thanks for the whole k2-icefall-sherpa-lhotse framework. I have used them successfully in the past (~ 1 year ago) but it seems every time I tried to get back to it with latest updates I hit new issues in the installation process. I guess for someone with root access it is relatively easy to install, but without it there is always one step or another failing, and once you solve it another one comes. I think the main bottleneck seems to be that k2 need non conda installation of cuda and cudnn.
This time (with latest icefall as of today) I followed
- https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#cuda-12-1 and here there seems already to be missing the definition of
export CUDAToolkit_INCLUDE_DIR=$CUDA_HOME/targets/x86_64-linux/include
in the activate cuda script which makes the k2 installation failed later. - installing pytorch with pip, that was easy
- tried and failed to install k2 from source (first because of the issue I mentioned in 1. but then after adding that it still fails with
Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR) (found version "12.1.66")
- installed k2 from wheel with
wget https://huggingface.co/csukuangfj/k2/resolve/main/ubuntu-cuda/k2-1.24.4.dev20240301+cuda12.1.torch2.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
which seemed to work - installing lhotse which I had not had issues before. Here I get another error related to
lilcom
- I found what seemed to be a solution danpovey/lilcom#50 (comment) - tried to install lilcom from conda which also failed because
lilcom-1.4-py3.10 requires python >=3.10,<3.11.0a0
and my environment is python3.12
so it is quite difficult to get all the dependencies of icefall, and because every time you fix an error you get another one, not user friendly at all.
The simplest to reach a maximum number of users would definitely be to allow for conda environment, where you could simply provide a yaml environment file to install everything in one command. It seems that this is not currently possible because of cudnn dependency mostly?
Is there any plan on making the installation process easier? Thanks!
I guess for someone with root access it is relatively easy to install
I think non of our dependencies need root access.
Is there any plan on making the installation process easier?
You can use docker, see https://github.com/k2-fsa/icefall/tree/master/docker
We normally install dependencies by pip, conda is not recomended.
As for the 5 and 6, will have a look and fix them.
Using python3.10 the installation looks like it is done (following the steps of installation guide), however, when trying to run the yesno
training I am getting the following:
2024-03-20 10:31:29,424 INFO [asr_datamodule.py:255] About to get test cuts
Could not load library libcudnn_cnn_train.so.8. Error: /home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
...
Traceback (most recent call last):
File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 575, in <module>
main()
File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 571, in main
run(rank=0, world_size=1, args=args)
File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 536, in run
train_one_epoch(
File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 417, in train_one_epoch
loss.backward()
File "/home/pe.honnet/Projects/tl_icefall/venv/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/pe.honnet/Projects/tl_icefall/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: GET was unable to find an engine to execute this computation
I also checked that the cuda version I installed locally should be compatible with the installed drivers:
# got this installer cuda_12.1.0_530.30.02_linux.run
nvidia-smi
Wed Mar 20 10:44:54 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
Another attempt, using cuda 11.8 instead and what looks like a successful installation:
$ ./tdnn/train.py
2024-03-20 11:07:31,740 INFO [train.py:481] Training started
2024-03-20 11:07:31,740 INFO [train.py:482] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'seed': 42, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': False, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.23.0.dev+git.d3106cf.clean', 'torch-version': '2.2.1+cu118', 'torch-cuda-available': True, 'torch-cuda-version': '11.8', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'ea92fc3-clean', 'icefall-git-date': 'Tue Mar 19 14:57:16 2024', 'icefall-path': '/home/pe.honnet/Projects/tl_icefall', 'k2-path': '/home/pe.honnet/Projects/tl_icefall/venv2/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/home/pe.honnet/Projects/tl_icefall/venv2/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'tlzhsrv010', 'IP address': '127.0.1.1'}}
2024-03-20 11:07:31,741 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2024-03-20 11:07:31,742 INFO [train.py:495] device: cuda:0
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:146] About to get train cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:247] About to get train cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:149] About to create train dataset
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:201] Using SimpleCutSampler.
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:207] About to create train dataloader
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:220] About to get test cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:255] About to get test cuts
Segmentation fault (core dumped)
For the segmentation fault, please see
#674
By the way, you are the first one with so many issues setting up the icefall environment for the past 6 months.
It would be great if you could tell us the exact commands you have run and tell us whether you have followed strictly
the installation doc for both k2 and icefall.
For the installation based on cuda 11.8 here is the full history:
318 python3.10 -m venv venv2
319 source venv2/bin/activate
320 cd cuda/
321 wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
322 chmod +x cuda_11.8.0_520.61.05_linux.run
323 ./cuda_11.8.0_520.61.05_linux.run --silent --toolkit --installpath=$PWD/cuda-11.8.0 --no-opengl-libs --no-drm --no-man-page
324 wget https://huggingface.co/csukuangfj/cudnn/resolve/main/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz
325 tar xvf cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz --strip-components=1 -C $PWD/cuda-11.8.0
326 cd ..
327 cp activate-cuda-12.1.sh activate-cuda-11.8.sh
328 nano activate-cuda-11.8.sh
329 source activate-cuda-11.8.sh
330 which nvcc
331 nvcc --version
332 pip3 install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
333 cd k2_wheel/
334 wget https://huggingface.co/csukuangfj/k2/resolve/main/ubuntu-cuda/k2-1.24.4.dev20240223+cuda11.8.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
335 ls
336 pip install k2-1.24.4.dev20240223+cuda11.8.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
337 cd ..
338 pip install git+https://github.com/lhotse-speech/lhotse
339 pip install -r requirements.txt
340 export PYTHONPATH=$PWD:$PYTHONPATH
341 cd egs/yesno/ASR/
342 ./prepare.sh
343 ./tdnn/train.py
In the other case (cuda 12.1) I used the same approach but based on cuda 12.1 (from https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#cuda-12-1) and adapted k2 wheel. Here is a new attempt (with same error as reported in previous comment):
360 python3.10 -m venv venv3
361 source venv3/bin/activate
362 cd cuda/
363 ./cuda_12.1.0_530.30.02_linux.run --silent --toolkit --installpath=$PWD/cuda-12.1.0 --no-opengl-libs --no-drm --no-man-page
364 tar xvf cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz --strip-components=1 -C $PWD/cuda-12.1.0
365 cd ..
366 source activate-cuda-12.1.sh
367 pip install torch torchaudio
368 cd k2_wheel
369 pip install k2-1.24.4.dev20240301+cuda12.1.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
370 pip install git+https://github.com/lhotse-speech/lhotse
371 cd ..
372 pip install -r requirements.txt
373 export PYTHONPATH=$PWD:$PYTHONPATH
374 cd egs/yesno/ASR/
375 ./prepare.sh
376 ./tdnn/train.py
Thanks! Yous commands look good.
tried and failed to install k2 from source (first because of the issue I mentioned in 1. but then after adding that it still fails with Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR) (found version "12.1.66")
Could you give the complete error logs for this?
installing lhotse which I had not had issues before. Here I get another error related to lilcom
Could you give the complete error logs for this?
OK, so I retried the same thing (i.e. installing k2 from source). If I follow the instructions in https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#set-environment-variables-for-cuda-12-1
Then, running
git clone https://github.com/k2-fsa/k2.git
cd k2
export K2_MAKE_ARGS="-j6"
python3 setup.py install
I am first getting the error in the log file error_k2_from_source_1.log
I fixed it by adding this line to the activate-cuda script
export CUDAToolkit_INCLUDE_DIR=$CUDA_HOME/targets/x86_64-linux/include
Then, trying to install again k2 from source I get the error in the log file error_k2_from_source_2.log
-- Unable to find cuda_runtime.h in "/home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/include" for CUDAToolkit_INCLUDE_DIR.
-- Unable to find cublas_v2.h in either "" or "/home/pe.honnet/Projects/tl_icefall/math_libs/include"
although in $CUDAToolkit_INCLUDE_DIR there is cuda_runtime.h
and cublas_v2.h
.
Regarding the lhotse issue, it was the lilcom issue solved by danpovey/lilcom#50 (comment)
The reason was that I was first creating a conda environment to then create a virtualenv (because there was no python3.10-venv
installed on the system. I asked the admin to add it since and got rid of the conda solution).
Regarding the seg fault, I am not able to solve it with the solution you shared - the data preparation works, but it fails in the training.
Regarding the seg fault, I am not able to solve it with the solution you shared - the data preparation works, but it fails in the training.
Are you able to use the command in the above posted link to find out more?
Sure this is the output log (with the cuda 11.8 environment - with cuda 12.1 I have the error I had reported before)
gdb_output.log
Here is a finding from this comment https://discuss.pytorch.org/t/could-not-load-library-libcudnn-cnn-train-so-8-but-im-sure-that-i-have-set-the-right-ld-library-path/190277/3
I have simply removed the link libcudnn_cnn_train.so.8
in the folder .../cuda-12.1.0/lib
and seemed to be able to run the tdnn/train.py
script.
It is surprising that no one else got that error before (I had the same error on two different servers with old and recent GPUs).
@csukuangfj I am closing this issue as in the end I was able to make it work, but I think that my last comment (about removing libcudnn_cnn_train.so.8
) may be something to keep in mind as other people will probably face it too.