reproducing the pre-training
Closed this issue · 8 comments
Hyperparameter groups: [{'weight_decay': 0.0}]
[2024-04-25 11:09:58,048][main][INFO] - Optimizer group 0 | 10 tensors | weight_decay 0.1
[2024-04-25 11:09:58,048][main][INFO] - Optimizer group 1 | 9 tensors | weight_decay 0.0
Sanity Checking: 0it [00:00, ?it/s]
Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
I'm getting the above error when reproducing the pre-training, what is the reason for it?
sorry,I'm running the following code。
python -m train
experiment=hg38/hg38
callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500
dataset.max_length=1024
dataset.batch_size=1024
dataset.mlm=true
dataset.mlm_probability=0.15
dataset.rc_aug=false
model=caduceus
model.config.d_model=128
model.config.n_layer=4
model.config.bidirectional=true
model.config.bidirectional_strategy=add
model.config.bidirectional_weight_tie=true
model.config.rcps=true
optimizer.lr="8e-3"
train.global_batch_size=8
trainer.max_steps=10000
+trainer.val_check_interval=10000
wandb=null
And if you need more error information, please contact me? thanks
Can you perhaps provide a bit more information? I am not sure I see what error you are referring to above.
The erro file:
Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=1024', 'dataset.mlm=true', 'dataset.mlm_probability=0.15', 'dataset.rc_aug=false', 'model=caduceus', 'model.config.d_model=64', 'model.config.n_layer=1', 'model.config.bidirectional=true', 'model.config.bidirectional_strategy=add', 'model.config.bidirectional_weight_tie=true', 'model.config.rcps=true', 'optimizer.lr=8e-3', 'train.global_batch_size=8', 'trainer.max_steps=10000', '+trainer.val_check_interval=100', 'wandb=null']
The out file:
[2024-04-25 10:39:41,224][src.dataloaders.genomics][INFO] - HG38Using Char-level tokenizer
finish self.tokenizer
sta init_datasets
Hyperparameter groups: [{'weight_decay': 0.0}]
[2024-04-25 10:39:43,731][main][INFO] - Optimizer group 0 | 10 tensors | weight_decay 0.1
[2024-04-25 10:39:43,731][main][INFO] - Optimizer group 1 | 9 tensors | weight_decay 0.0
Sanity Checking: 0it [00:00, ?it/s]
Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
OSError: [Errno 8] Exec format error: '.conda/envs/caduceus/lib/python3.10/site-packages/triton/compiler/../third_party/cuda/bin/ptxas'
and if i use run_pretrain_caduceus.sh
this is output and erro
config:
target: caduceus.configuration_caduceus.CaduceusConfig
d_model: 256
n_layer: 8
vocab_size: 12
ssm_cfg:
d_state: 16
d_conv: 4
expand: 2
dt_rank: auto
dt_min: 0.001
dt_max: 0.1
dt_init: random
dt_scale: 1.0
dt_init_floor: 0.0001
conv_bias: true
bias: false
use_fast_path: true
rms_norm: true
fused_add_norm: true
residual_in_fp32: false
pad_vocab_size_multiple: 8
norm_epsilon: 1.0e-05
initializer_cfg:
initializer_range: 0.02
rescale_prenorm_residual: true
n_residuals_per_layer: 1
bidirectional: true
bidirectional_strategy: add
bidirectional_weight_tie: true
rcps: true
complement_map: null
[2024-04-24 21:34:23,794][main][WARNING] - Sleeping for 36 seconds
[2024-04-24 21:34:59,853][main][WARNING] - Sleeping for 60 seconds
[2024-04-24 21:35:59,939][main][WARNING] - Sleeping for 38 seconds
Are you torch.compile
-ing the model? Do you know why the triton benchmarking code is being triggered. I do not think I see this in my logs
Here's the installation package in my environment
I have installed trition-2.1.0,but my system architecture is aarch64, I'm not sure if that's the reason?
Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 51_gnu
abseil-cpp 20211102.0 h22f4aa5_0
accelerate 0.29.3 pypi_0 pypi
aiohttp 3.9.3 py310h998d150_0
aiosignal 1.2.0 pyhd3eb1b0_0
antlr-python-runtime 4.9.3 pyhd8ed1ab_1 conda-forge
anyio 4.2.0 py310hd43f75c_0
appdirs 1.4.4 pyhd3eb1b0_0
argon2-cffi 21.3.0 pyhd3eb1b0_0
argon2-cffi-bindings 21.2.0 py310h2f4d8fa_0
arrow-cpp 14.0.2 h001d45f_1
asttokens 2.0.5 pyhd3eb1b0_0
async-lru 2.0.4 py310hd43f75c_0
async-timeout 4.0.3 py310hd43f75c_0
attrs 23.1.0 py310hd43f75c_0
aws-c-auth 0.6.19 h998d150_0
aws-c-cal 0.5.20 h6ac735f_0
aws-c-common 0.8.5 h998d150_0
aws-c-compression 0.2.16 h998d150_0
aws-c-event-stream 0.2.15 h419075a_0
aws-c-http 0.6.25 h998d150_0
aws-c-io 0.13.10 h998d150_0
aws-c-mqtt 0.7.13 h998d150_0
aws-c-s3 0.1.51 h6ac735f_0
aws-c-sdkutils 0.1.6 h998d150_0
aws-checksums 0.1.13 h998d150_0
aws-crt-cpp 0.18.16 h419075a_0
aws-sdk-cpp 1.10.55 h3140d82_0
babel 2.11.0 py310hd43f75c_0
beautifulsoup4 4.12.2 py310hd43f75c_0
biopython 1.79 py310h7cee911_1 conda-forge
blas 1.0 openblas
bleach 4.1.0 pyhd3eb1b0_0
boost-cpp 1.82.0 hb8fdbf2_2
bottleneck 1.3.7 py310hf6ef57e_0
brotli 1.0.9 h998d150_7
brotli-bin 1.0.9 h998d150_7
brotli-python 1.0.9 py310h419075a_7
bzip2 1.0.8 h998d150_5
c-ares 1.19.1 h998d150_0
ca-certificates 2024.3.11 hd43f75c_0
cached-property 1.5.2 py_0
cachetools 4.2.2 pyhd3eb1b0_0
causal-conv1d 1.2.0.post2 pypi_0 pypi
certifi 2024.2.2 py310hd43f75c_0
cffi 1.16.0 py310h998d150_0
charset-normalizer 2.0.4 pyhd3eb1b0_0
click 8.1.7 py310hd43f75c_0
colorama 0.4.6 py310hd43f75c_0
comm 0.2.1 py310hd43f75c_0
contourpy 1.2.0 py310hb8fdbf2_0
cycler 0.11.0 pyhd3eb1b0_0
datasets 2.12.0 py310hd43f75c_0 anaconda
debugpy 1.6.7 py310h419075a_0
decorator 5.1.1 pyhd3eb1b0_0
defusedxml 0.7.1 pyhd3eb1b0_0
dill 0.3.6 py310hd43f75c_0
discrete-key-value-bottleneck-pytorch 0.1.1 pypi_0 pypi
docker-pycreds 0.4.0 pyhd3eb1b0_0
einops 0.7.0 pyhd8ed1ab_1 conda-forge
einx 0.2.1 pypi_0 pypi
enformer-pytorch 0.8.8 pypi_0 pypi
exceptiongroup 1.2.0 py310hd43f75c_0
executing 0.8.3 pyhd3eb1b0_0
filelock 3.13.1 py310hd43f75c_0
flash-attn 2.5.7 pypi_0 pypi
fonttools 4.51.0 py310h998d150_0
freetype 2.12.1 h6df46f4_0
frozendict 2.4.2 pypi_0 pypi
frozenlist 1.4.0 py310h998d150_0
fsspec 2023.9.2 py310hd43f75c_0 anaconda
future 0.18.3 py310hd43f75c_0
gdown 5.1.0 pypi_0 pypi
genomic-benchmarks 0.0.9 pypi_0 pypi
gflags 2.2.2 h419075a_1
git-lfs 3.5.1 h8af1aa0_0 conda-forge
gitdb 4.0.7 pyhd3eb1b0_0
gitpython 3.1.37 py310hd43f75c_0
glog 0.5.0 h419075a_1
grpc-cpp 1.48.2 hdefc9b7_1
h11 0.14.0 py310hd43f75c_0
h5py 3.11.0 nompi_py310h7a20aa2_100 conda-forge
hdf5 1.14.3 nompi_ha486f32_100 conda-forge
httpcore 1.0.2 py310hd43f75c_0
httpx 0.26.0 py310hd43f75c_0
huggingface-hub 0.19.4 pypi_0 pypi
huggingface_hub 0.20.3 py310hd43f75c_0
hydra-core 1.3.2 pyhd8ed1ab_0 conda-forge
icu 73.1 h419075a_0
idna 3.4 py310hd43f75c_0
importlib-metadata 7.0.1 py310hd43f75c_0
importlib_metadata 7.0.1 hd3eb1b0_0
importlib_resources 6.1.1 py310hd43f75c_1
iniconfig 1.1.1 pyhd3eb1b0_0
ipdb 0.13.13 pyhd8ed1ab_0 conda-forge
ipykernel 6.28.0 py310hd43f75c_0
ipython 8.20.0 py310hd43f75c_0
jedi 0.18.1 py310hd43f75c_1
jinja2 3.1.3 py310hd43f75c_0
joblib 1.2.0 py310hd43f75c_0
jpeg 9e h998d150_1
json5 0.9.6 pyhd3eb1b0_0
jsonschema 4.19.2 py310hd43f75c_0
jsonschema-specifications 2023.7.1 py310hd43f75c_0
jupyter-lsp 2.2.0 py310hd43f75c_0
jupyter_client 8.6.0 py310hd43f75c_0
jupyter_core 5.5.0 py310hd43f75c_0
jupyter_events 0.8.0 py310hd43f75c_0
jupyter_server 2.10.0 py310hd43f75c_0
jupyter_server_terminals 0.4.4 py310hd43f75c_1
jupyterlab 4.1.6 pyhd8ed1ab_0 conda-forge
jupyterlab_pygments 0.1.2 py_0
jupyterlab_server 2.25.1 py310hd43f75c_0
kiwisolver 1.4.4 py310h419075a_0
krb5 1.20.1 h2e2fba8_1
lcms2 2.12 h5246980_0
ld_impl_linux-aarch64 2.38 h8131f2d_1
lerc 3.0 h22f4aa5_0
libaec 1.1.3 h2f0025b_0 conda-forge
libboost 1.82.0 hda0696e_2
libbrotlicommon 1.0.9 h998d150_7
libbrotlidec 1.0.9 h998d150_7
libbrotlienc 1.0.9 h998d150_7
libcurl 8.5.0 hfa2bbb0_0
libdeflate 1.17 h998d150_1
libedit 3.1.20230828 h998d150_0
libev 4.33 hfd63f10_1
libevent 2.1.12 h6ac735f_1
libffi 3.4.4 h419075a_0
libgcc-ng 13.2.0 hf8544c7_5 conda-forge
libgfortran-ng 13.2.0 he9431aa_5 conda-forge
libgfortran5 13.2.0 h582850c_5 conda-forge
libgomp 13.2.0 hf8544c7_5 conda-forge
libnghttp2 1.57.0 hb788212_0
libnsl 2.0.1 h31becfc_0 conda-forge
libopenblas 0.3.21 hc2e42e2_0
libpng 1.6.39 h998d150_0
libprotobuf 3.20.3 h94b7715_0
libsodium 1.0.18 hfd63f10_0
libsqlite 3.45.3 h194ca79_0 conda-forge
libssh2 1.10.0 h6ac735f_2
libstdcxx-ng 13.2.0 h9a76618_5 conda-forge
libthrift 0.15.0 hb2e9abc_2
libtiff 4.5.1 h419075a_0
libuuid 2.38.1 hb4cce97_0 conda-forge
libwebp-base 1.3.2 h998d150_0
libxcrypt 4.4.36 h31becfc_1 conda-forge
libzlib 1.2.13 h31becfc_5 conda-forge
lightning-utilities 0.9.0 py310hd43f75c_0
lz4-c 1.9.4 h419075a_0
mamba-ssm 1.2.0.post1 pypi_0 pypi
markdown-it-py 2.2.0 py310hd43f75c_1
markupsafe 2.1.3 py310h998d150_0
matplotlib 3.8.4 py310hbbe02a8_0 conda-forge
matplotlib-base 3.8.4 py310hfb1e5ee_0
matplotlib-inline 0.1.6 py310hd43f75c_0
mdurl 0.1.0 py310hd43f75c_0
mistune 2.0.4 py310hd43f75c_0
mpmath 1.3.0 pypi_0 pypi
multidict 6.0.4 py310h998d150_0
multiprocess 0.70.14 py310hd43f75c_0 anaconda
nbclient 0.8.0 py310hd43f75c_0
nbconvert 7.10.0 py310hd43f75c_0
nbformat 5.9.2 py310hd43f75c_0
ncurses 6.4.20240210 h0425590_0 conda-forge
nest-asyncio 1.6.0 py310hd43f75c_0
networkx 3.3 pypi_0 pypi
ninja 1.11.1.1 pypi_0 pypi
ninja-base 1.10.2 h59a28a9_5
notebook 7.1.3 pyhd8ed1ab_0 conda-forge
notebook-shim 0.2.3 py310hd43f75c_0
numexpr 2.8.7 py310hbc6faf5_0
numpy 1.26.4 py310he45c16d_0
numpy-base 1.26.4 py310h15d264d_0
nvidia-ml-py 12.535.133 py310hd43f75c_0
nvitop 1.3.2 py310h4c7bcd0_0 conda-forge
omegaconf 2.3.0 pyhd8ed1ab_0 conda-forge
openjpeg 2.4.0 hf3eb033_0
openssl 3.2.1 h31becfc_1 conda-forge
orc 1.7.4 h7ed1058_1
overrides 7.4.0 py310hd43f75c_0
packaging 23.2 py310hd43f75c_0
pandas 2.2.2 py310hf9cab1f_0 conda-forge
pandocfilters 1.5.0 pyhd3eb1b0_0
parso 0.8.3 pyhd3eb1b0_0
pathtools 0.1.2 pyhd3eb1b0_1
patsy 0.5.3 py310hd43f75c_0
pexpect 4.8.0 pyhd3eb1b0_3
pillow 10.2.0 py310h998d150_0
pip 23.3.1 py310hd43f75c_0
platformdirs 3.10.0 py310hd43f75c_0
pluggy 1.5.0 pyhd8ed1ab_0 conda-forge
polars 0.20.22 pypi_0 pypi
portalocker 2.3.0 py310hd43f75c_1
prometheus_client 0.14.1 py310hd43f75c_0
prompt-toolkit 3.0.43 py310hd43f75c_0
prompt_toolkit 3.0.43 hd3eb1b0_0
protobuf 3.20.3 py310h419075a_0
psutil 5.9.0 py310h998d150_0
ptyprocess 0.7.0 pyhd3eb1b0_2
pure_eval 0.2.2 pyhd3eb1b0_0
pyarrow 14.0.2 py310hcc88a3e_0
pycparser 2.21 pyhd3eb1b0_0
pyfaidx 0.8.1.1 pyhdfd78af_0 bioconda
pygments 2.15.1 py310hd43f75c_1
pyparsing 3.0.9 py310hd43f75c_0
pysocks 1.7.1 py310hd43f75c_0
pytest 8.1.1 pyhd8ed1ab_0 conda-forge
python 3.10.14 hbbe8eec_0_cpython conda-forge
python-dateutil 2.8.2 pyhd3eb1b0_0
python-fastjsonschema 2.16.2 py310hd43f75c_0
python-json-logger 2.0.7 py310hd43f75c_0
python-tzdata 2023.3 pyhd3eb1b0_0
python-xxhash 2.0.2 py310h998d150_1
python_abi 3.10 2_cp310 conda-forge
pytorch-lightning 1.9.0 pyhd3eb1b0_1 forklift
pytz 2023.3.post1 py310hd43f75c_0
pyvcf3 1.0.3 pyhdfd78af_0 bioconda
pyyaml 6.0.1 py310h998d150_0
pyzmq 25.1.2 py310h419075a_0
re2 2022.04.01 h22f4aa5_0
readline 8.2 h998d150_0
redis-py 5.0.4 pyhd8ed1ab_0 conda-forge
referencing 0.30.2 py310hd43f75c_0
regex 2023.10.3 py310h998d150_0
requests 2.31.0 py310hd43f75c_1
responses 0.13.3 pyhd3eb1b0_0
rfc3339-validator 0.1.4 py310hd43f75c_0
rfc3986-validator 0.1.1 py310hd43f75c_0
rich 13.7.1 pyhd8ed1ab_0 conda-forge
rpds-py 0.10.6 py310h7f3cb11_0
s2n 1.3.27 h6ac735f_0
safetensors 0.4.2 py310hdd6b545_0
scikit-learn 1.4.2 py310hc266c7b_0 conda-forge
scipy 1.12.0 py310he45c16d_0
seaborn 0.13.2 hd8ed1ab_0 conda-forge
seaborn-base 0.13.2 pyhd8ed1ab_0 conda-forge
send2trash 1.8.2 py310hd43f75c_0
sentry-sdk 1.9.0 py310hd43f75c_0
setproctitle 1.2.2 py310h2f4d8fa_0
setuptools 68.2.2 py310hd43f75c_0
six 1.16.0 pyhd3eb1b0_1
smmap 4.0.0 pyhd3eb1b0_0
snappy 1.1.10 h419075a_1
sniffio 1.3.0 py310hd43f75c_0
soupsieve 2.5 py310hd43f75c_0
sqlite 3.41.2 h998d150_0
stack_data 0.2.0 pyhd3eb1b0_0
statsmodels 0.14.0 py310hf6ef57e_0
sympy 1.12 pypi_0 pypi
termcolor 2.1.0 py310hd43f75c_0
terminado 0.17.1 py310hd43f75c_0
threadpoolctl 2.2.0 pyh0d69192_0
timm 0.9.16 pyhd8ed1ab_0 conda-forge
tinycss2 1.2.1 py310hd43f75c_0
tk 8.6.13 h194ca79_0 conda-forge
tokenizers 0.15.1 py310hb4c1b22_0
toml 0.10.2 pyhd3eb1b0_0
tomli 2.0.1 py310hd43f75c_0
torch 2.0.1+cu118 pypi_0 pypi
torchaudio 2.0.2+cu118 pypi_0 pypi
torchdata 0.5.1 pyh2db4395_0 conda-forge
torchmetrics 1.3.2 pyhd8ed1ab_0 conda-forge
torchtext 0.17.0a0+f3b7a01 pypi_0 pypi
torchvision 0.15.2+cu118 pypi_0 pypi
tornado 6.3.3 py310h998d150_0
tqdm 4.66.2 pyhd8ed1ab_0 conda-forge
traitlets 5.7.1 py310hd43f75c_0
transformers 4.39.3 pyhd8ed1ab_0 conda-forge
triton 2.1.0 pypi_0 pypi
typing-extensions 4.9.0 py310hd43f75c_1
typing_extensions 4.9.0 py310hd43f75c_1
tzdata 2024a h04d1e81_0
unicodedata2 15.1.0 py310h998d150_0
urllib3 2.1.0 py310hd43f75c_1
utf8proc 2.6.1 h998d150_1
vector-quantize-pytorch 1.14.7 pypi_0 pypi
wandb 0.13.10 pyhd3eb1b0_0 forklift
wcwidth 0.2.5 pyhd3eb1b0_0
webencodings 0.5.1 py310hd43f75c_1
websocket-client 0.58.0 py310hd43f75c_4
wheel 0.41.2 py310hd43f75c_0
xxhash 0.8.0 h2f4d8fa_3
xz 5.4.6 h998d150_0
yaml 0.2.5 hfd63f10_0
yarl 1.9.3 py310h998d150_0
zeromq 4.3.5 h419075a_0
zipp 3.17.0 py310hd43f75c_0
zlib 1.2.13 h31becfc_5 conda-forge
zstd 1.5.5 h6a09583_0
Apologies, but I am not sure what is causing your issue. Perhaps try a fresh env created using the yaml file in this repo?