kuleshov-group/caduceus

Questions about experimental code

Closed this issue · 5 comments

Hello, I'm very interested in your model, about the genome benchmark, I operated through your guidance, and there were 2 problems - the dataloader length is 0 and the loss is infinite, I don't know if this is normal, can you help confirm what the reason is?
Q1:
RUN:

python -m train
experiment=hg38/genomic_benchmark
callbacks.model_checkpoint_every_n_steps.every_n_train_steps=5000
dataset.dataset_name="dummy_mouse_enhancers_ensembl"
dataset.train_val_split_seed=1
dataset.batch_size=128
dataset.rc_aug=false
+dataset.conjoin_train=false
+dataset.conjoin_test=false
loader.num_workers=2
model=caduceus
model.name=dna_embedding_caduceus
+model.config_path=""
+model.conjoin_test=false
+decoder.conjoin_train=true
+decoder.conjoin_test=false
optimizer.lr="1e-3"
trainer.max_epochs=10
train.pretrained_model_path="<path to .ckpt file>"
wandb=null
ERROR:
63a3b2e1c7a3bbe3a703caaff47a150

Q2:
RUN:

python -m train
experiment=hg38/hg38
callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500
dataset.max_length=1024
dataset.batch_size=1024
dataset.mlm=true
dataset.mlm_probability=0.15
dataset.rc_aug=false
model=caduceus
model.config.d_model=128
model.config.n_layer=4
model.config.bidirectional=true
model.config.bidirectional_strategy=add
model.config.bidirectional_weight_tie=true
model.config.rcps=true
optimizer.lr="8e-3"
train.global_batch_size=8
trainer.max_steps=10000
+trainer.val_check_interval=10000
wandb=null
ERROR:
Result

Regarding Q1, this is an error I haven't hit before. Can you provide a bit more of the console output. Also it looks like these two fields are empty in the command you used to launch. They need to be filled with arguments that correspond to a pre-trained model.

+model.config_path=""
train.pretrained_model_path="<path to .ckpt file>" `

Regarding Q2, can you post the LR and training loss graphs from wandb? Did the model ever hit a nan loss during training?

Q1:Sorry, it's my fault. The code I uploaded has issues, and here are more error screenshots.
RUN:
python -m train
experiment=hg38/genomic_benchmark
callbacks.model_checkpoint_every_n_steps.every_n_train_steps=5000
dataset.dataset_name="human_nontata_promoters"
dataset.train_val_split_seed=2
dataset.batch_size=128
dataset.rc_aug=false
+dataset.conjoin_train=false
+dataset.conjoin_test=false
loader.num_workers=2
model=caduceus
model.name=dna_embedding_caduceus
+model.config_path="/home/gyc/caduceus-main/outputs/2024-03-11/20-21-19-995417/model_config.json"
+model.conjoin_test=false
+decoder.conjoin_train=true
+decoder.conjoin_test=false
optimizer.lr="1e-3"
trainer.max_epochs=10
train.pretrained_model_path="/home/gyc/caduceus-main/outputs/2024-03-11/20-21-19-995417/checkpoints/last.ckpt"
wandb=null
ERROR:
20dbb86255d3d98ce62af97445921a4
b0541d22a1390ab6afa60874e996d18

I just tried running this and did not hit the division by zero error. Can you confirm that data was properly downloaded to ./data/genomic_benchmark/human_nontata_promoters/ by the genomics-benchmark library:

This directory should look like this

data/genomic_benchmark/human_nontata_promoters/
├── test
│   ├── negative
│   └── positive
└── train
    ├── negative
    └── positive

these directories should contain .txt files with sequences.

Thanks for the reminder, I've successfully run your code and it works great!

Glad to hear it!