bghira/SimpleTuner

Dependencies broken for fresh quickstart install

aa956 opened this issue · 12 comments

aa956 commented

For debian 12, fresh install according to the Flux Quickstart.

release branch is currently at a9fde80

$ git clone --branch=release https://github.com/bghira/SimpleTuner.git simpletuner-flux
$ cd simpletuner-flux
$ python3.11 -m venv .venv
$ source .venv/bin/activate
$ pip install -U poetry pip
$ poetry config virtualenvs.create false
$ poetry install

Breaks with:

  - Installing pytorch-triton (3.1.0+cf34004b8a): Failed

  RuntimeError

  Hash for pytorch-triton (3.1.0+cf34004b8a) from archive pytorch_triton-3.1.0+cf34004b8a-cp311-cp311-linux_x86_64.whl not found in known hashes (was: sha256:b4a64c048b090cb9781de1e1cc9022d03195b867450644d3bb3207be2190e3a3)

  at ~/.local/share/pipx/venvs/poetry/lib/python3.11/site-packages/poetry/installation/executor.py:812 in _validate_archive_hash
      808│ 
      809│         archive_hash = f"{hash_type}:{get_file_hash(archive, hash_type)}"
      810│ 
      811│         if archive_hash not in known_hashes:
    → 812│             raise RuntimeError(
      813│                 f"Hash for {package} from archive {archive.name} not found in"
      814│                 f" known hashes (was: {archive_hash})"
      815│             )
      816│ 

Cannot install pytorch-triton.

  - Installing sentencepiece (0.2.0)
  - Installing tensorboard (2.18.0)
  - Installing torch-optimi (0.2.1)
  - Installing torchao (0.5.0+cu124)
  - Installing torchaudio (2.5.0.dev20240929+cu124)
  - Installing torchmetrics (1.4.2)
  - Installing torchsde (0.2.6)
  - Installing triton (3.0.0)
  - Installing triton-library (1.0.0rc4)
  - Installing wandb (0.18.2)
Warning: The file chosen for install of aiohappyeyeballs 2.4.2 (aiohappyeyeballs-2.4.2-py3-none-any.whl) is yanked. Reason for being yanked: Regression: https://github.com/aio-libs/aiohappyeyeballs/issues/100

Will try to investigate further with manual triton package upgrades/downgrades but this is the state at the moment.

May this be some problem with my machines' pip/poetry caches, referencing "not found in known hashes"?
In addition to some of the packages pulling yanked depndency.

you're on python3.12, which isn't supported. this is also mentioned in the guide, right

this error usually occurs with the wrong python version, but not really sure why it happens there. i do several fresh installs regularly, but the recommended distro is Ubuntu Noble.

aa956 commented

Possible, unfortunately OS reinstall is a little too drastic change to try it now.

Regarding python versions - I've tried with the OS 3.11.2 and pyenv's 3.11.9 and 3.10.14 - errors are the same.

This issue will probably resolve itself as soon as some torch and/or pytorch-triton versions will be updated during development, so no point to invest your time here.

As a temporary workaround for anyone having the same issue, I've installed offending packages using pip inside the venv:

$ pip install aiohappyeyeballs
$ pip install pytorch-triton --index-url=https://download.pytorch.org/whl/nightly/cu124

maybe pytorch republished this file. it doesn't list a last-modified time on their list:
image

aa956 commented

Most probably. The joy of using nightly builds :)

Anyway, got everything working after manually installing packages (inside the .venv of course) from pytorch-nightly source in pyproejct.toml:

pip install --upgrade torch torchvision torchaudio pytorch-triton --index-url=https://download.pytorch.org/whl/nightly/cu124

Afterwards train.sh run poetry install that downgraded some torch packages but there were no more errors and training is running now.

Probably OK to close the issue with wontfix?
If anyone needs these kind of workarounds this issue can be found by search.

yep it's weird because no matter what i do it's not wanting to reassess the checksum 🤔 this is a mistake from the pytorch project almost certainly

ran the fix
pip install --upgrade torch torchvision torchaudio pytorch-triton --index-url=https://download.pytorch.org/whl/nightly/cu124
still getting this:

/workspace/SimpleTuner/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'

Everything seems to run fine until:

Write embeds to disk:   0%|                                                                            | 0/1 [00:00<?, ?it/s]

Processing prompts:   0%|                                                                              | 0/1 [00:15<?, ?it/s]

which is not really moving. Could this be an issue with the above still?

config.json

maybe something in my config.json is causing this stump on the Write embeds to disk: 0%| ?

you might want to make sure you have nvidia-cuda-toolkit installed and that your container image uses python 3.11 w/ CUDA 12.4

it was all a dream. bad pod. moving along. though CUDA 12.4 on new pod did do the trick.

as we gain more performance improvements we start to rely solely on CUDA 12.4 or newer

unfortunately is the way the world goes, CUDA 11 is >2 years old and the ancient torch images with CUDA 11.4 don't even support the Ada family (4090). but CUDA 12.4 is pretty nice, and has support for a broad series of hardware including some crappy Maxwell stuff you wouldn't even want to train models on.

image

i'm going to make install/nvidia-nightly follow the latest nightly build for those who desire the performance increase and revert the main branch back to 2.4.1 (soon 2.5)

to use the prev deps, use poetry -C install/nvidia-nightly install