Remora model finetuning

Question

Remora model finetuning

Closed this issue 3 months ago · 9 comments

Hello,

I am trying to finetune this model to call 6mA (dna_r10.4.1_e8.2_5khz_400bps_sup_v4.2.0_6ma_v2.pt), but independently from how many layers I try to freeze (I've also tried with 0) I run into this error. Any idea on which could be the issue?

Traceback (most recent call last):
  File "./bin/remora", line 8, in <module>
    sys.exit(run())
  File "./lib/python3.10/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "./lib/python3.10/site-packages/remora/parsers.py", line 857, in run_model_train
    train_model(
  File "./lib/python3.10/site-packages/remora/train_model.py", line 180, in train_model
    ckpt, model = model_util.continue_from_checkpoint(
  File "./lib/python3.10/site-packages/remora/model_util.py", line 247, in continue_from_checkpoint
    if ckpt["state_dict"] is None:
  File "./lib/python3.10/site-packages/torch/jit/_script.py", line 862, in __getitem__
    return self.forward_magic_method("__getitem__", idx)
  File "./lib/python3.10/site-packages/torch/jit/_script.py", line 855, in forward_magic_method
    raise NotImplementedError()
NotImplementedError

For reference, I am using remora v. 3.1.0 and this is the command:

remora model train \
	${wd}/data/prepData/train_dataset.jsn \
	--model ${wd}/data/ONT/ConvLSTM_w_ref.py \
	--finetune-path ${wd}/dna_r10.4.1_e8.2_5khz_400bps_sup_v4.2.0_6ma_v2.pt \
	--device 0 \
	--chunk-context 50 50 \
	--output-path ${wd}/data/models/train_results

Answer 1 · 2024-05-03T17:36:06.000Z

The resume feature requires a checkpoint model not a torch script model (optimized for inference, and lacking training status). You can recreate the checkpoint with the following snippet. I will try to add this to the core API to make this a bit simpler.

model, model_metadata = model_util._raw_load_torchscript_model(model_path)
model_metadata["epoch"] = 0
state_dict = model.state_dict()
if "total_ops" in state_dict.keys():
    state_dict.pop("total_ops", None)
if "total_params" in state_dict.keys():
    state_dict.pop("total_params", None)
model_metadata["state_dict"] = state_dict
model_metadata["opt"] = None
torch.save(
    model_metadata,
    out_path,
)

Answer 2 · 2024-05-10T11:17:51.000Z

Thanks for clarifying!

I've managed to produce the checkpoint file to use in the remora training command. Yet, I am running into this other error now. Any idea from where the issue is coming from?

******************** WARNING [11:35:38.198:MainProcess:MainThread:train_model.py:196]: Size mismatch between pretrained model and selected size. Using pretrained model size. ********************
Traceback (most recent call last):
  File "./bin/remora", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "./lib/python3.12/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "./lib/python3.12/site-packages/remora/parsers.py", line 857, in run_model_train
    train_model(
  File "./lib/python3.12/site-packages/remora/train_model.py", line 202, in train_model
    raise RemoraError(
remora.RemoraError: The chunk context of the pre-trained model and the dataset do not match.

Reference command line:

remora model train \
	${wd}/data/prepData/train_dataset.jsn \
	--model ${wd}/data/ONT/ConvLSTM_w_ref.py \
	--finetune-path ${wd}/model_checkpoint.pth \
	--device 0 \
	--chunk-context 50 50 \
	--output-path ${wd}/data/models/train_results

Answer 3 · 2024-05-21T08:43:52.000Z

Training from a checkpoint file requires that the same data input size be used. Setting the --chunk-context to the same value as the pre-trained model should resolve this issue. We will look into setting these parameters automatically in from the pre-trained model in the future. The best way to check this value is to either load the model using the python API or export the .pt file to a dorado model using the remora model export command and viewing the metadata in the config.toml file produced. I will flag up making a remora model inspect command to print out this information a bit more easily in the future.

Answer 4 · 2024-05-22T11:06:10.000Z

I have double checked, both the pre-trained model and the training dataset have the same chunk_context [100,100]. changing this parameter in the remora model train commant, does not solve the issue.

Answer 5 · 2024-05-22T11:36:01.000Z

Could you post the exact command and error message here to help resolve the issue?

Answer 6 · 2024-05-22T13:27:05.000Z

here the full command line I used:

remora dataset prepare \
	--output-path ${wd}data/prepData/MOCK_6mA \
	--refine-kmer-level-table ${wd}data/ONT/9mer_levels_v1.txt \
	--refine-rough-rescale \
	--motif A 0 \
	--mod-base-control \
	--max-chunks-per-read 20 \
	--num-extract-alignment-workers 24 \
	--num-extract-chunks-workers 24 \
	--chunk-context 100 100 \
	--kmer-context-bases 4 4 \
	${wd}data/6mA_unmeth.pod5 \
	${wd}data/6mA_unmeth.pass.bam

remora dataset prepare \
	--output-path ${wd}data/prepData/MOD_6mA \
	--refine-kmer-level-table ${wd}data/ONT/9mer_levels_v1.txt \
	--refine-rough-rescale \
	--motif A 0 \
	--mod-base a 6mA \
	--max-chunks-per-read 20 \
	--num-extract-alignment-workers 24 \
	--num-extract-chunks-workers 24 \
	--chunk-context 100 100 \
	--kmer-context-bases 4 4 \
	${wd}data/7_6mA.pod5 \
	${wd}data/7_6mA.pass.bam

remora dataset make_config \
	${wd}data/prepData/train_dataset.jsn \
	${wd}data/prepData/MOCK_6mA \
	${wd}data/prepData/MOD_6mA \
	--dataset-weights 1 1 \
	--log-filename ${wd}data/prepData/train_dataset.log

python make_checkpoint.py \
	--model dna_r10.4.1_e8.2_5khz_400bps_sup_v4.2.0_6ma_v2.pt \
	--output ${wd}data/models/checkpoint_model.pth

remora model train \
	${wd}data/prepData/train_dataset.jsn \
	--model ${wd}data/ONT/ConvLSTM_w_ref.py \
	--finetune-path ${wd}data/models/checkpoint_model.pth \
	--freeze-num-layers 15 \
	--device 0 \
	--chunk-context 100 100 \
	--output-path ${wd}data/models/train_results_freeze15 \
	--kmer-context-bases 4 4

make_checkpoint.py is the snipped you provided me, taking the .pt file in input and returning the model checkpoint as output

here the full error message arising after the last command:

[11:30:12.563] Seed selected is 1106960644
[11:30:12.637] Loading dataset from Remora dataset config
[11:30:15.316] Dataset summary:
                     size : 33,433,400
     modified_base_labels : True
                mod_bases : ['a']
           mod_long_names : ['6mA']
       kmer_context_bases : (4, 4)
            chunk_context : (100, 100)
                   motifs : [('A', 0)]
           reverse_signal : False
 chunk_extract_base_start : False
     chunk_extract_offset : 0
          sig_map_refiner : Loaded 9-mer table with 7 central position. Rough re-scaling will be executed.

[11:30:15.317] Loading model
[11:30:15.467] Model structure:
network(
  (sig_conv1): Conv1d(1, 4, kernel_size=(5,), stride=(1,))
  (sig_bn1): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (sig_conv2): Conv1d(4, 16, kernel_size=(5,), stride=(1,))
  (sig_bn2): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (sig_conv3): Conv1d(16, 64, kernel_size=(9,), stride=(3,))
  (sig_bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (seq_conv1): Conv1d(36, 16, kernel_size=(5,), stride=(1,))
  (seq_bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (seq_conv2): Conv1d(16, 64, kernel_size=(13,), stride=(3,))
  (seq_bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (merge_conv1): Conv1d(128, 64, kernel_size=(5,), stride=(1,))
  (merge_bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (lstm1): LSTM(64, 64)
  (lstm2): LSTM(64, 64)
  (fc): Linear(in_features=64, out_features=2, bias=True)
  (dropout): Dropout(p=0.3, inplace=False)
)
******************** WARNING [11:30:15.499:MainProcess:MainThread:train_model.py:196]: Size mismatch between pretrained model and selected size. Using pretrained model size. ********************
Traceback (most recent call last):
  File "./envs/remora_v3.1.0/bin/remora", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "./envs/remora_v3.1.0/lib/python3.12/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "./envs/remora_v3.1.0/lib/python3.12/site-packages/remora/parsers.py", line 857, in run_model_train
    train_model(
  File "./envs/remora_v3.1.0/lib/python3.12/site-packages/remora/train_model.py", line 202, in train_model
    raise RemoraError(
remora.RemoraError: The chunk context of the pre-trained model and the dataset do not match.

Answer 7 · 2024-05-31T23:03:59.000Z

I've made some minor changes around this logic in the latest version. Could you upgrade and report if this is resolved in the latest version?

Answer 8 · 2024-06-05T09:54:00.000Z

Update using the new released remora v3.2.0. I run again all the command mentioned above, but I got a very similar error:

[11:47:09.706] Seed selected is 442297807
[11:47:09.790] Loading dataset from Remora dataset config
[11:47:09.849] Dataset summary:
                     size : 33,432,925
     modified_base_labels : True
                mod_bases : ['a']
           mod_long_names : ['6mA']
       kmer_context_bases : (4, 4)
            chunk_context : (100, 100)
                   motifs : [('A', 0)]
           reverse_signal : False
 chunk_extract_base_start : False
     chunk_extract_offset : 0
               pa_scaling : None
          sig_map_refiner : Loaded 9-mer table with 7 central position. Rough re-scaling will be executed.

[11:47:09.850] Loading model
[11:47:09.949] Model structure:
network(
  (sig_conv1): Conv1d(1, 4, kernel_size=(5,), stride=(1,))
  (sig_bn1): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (sig_conv2): Conv1d(4, 16, kernel_size=(5,), stride=(1,))
  (sig_bn2): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (sig_conv3): Conv1d(16, 64, kernel_size=(9,), stride=(3,))
  (sig_bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (seq_conv1): Conv1d(36, 16, kernel_size=(5,), stride=(1,))
  (seq_bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (seq_conv2): Conv1d(16, 64, kernel_size=(13,), stride=(3,))
  (seq_bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (merge_conv1): Conv1d(128, 64, kernel_size=(5,), stride=(1,))
  (merge_bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (lstm1): LSTM(64, 64)
  (lstm2): LSTM(64, 64)
  (fc): Linear(in_features=64, out_features=2, bias=True)
  (dropout): Dropout(p=0.3, inplace=False)
)
[11:47:09.949] Gradients will be clipped (by value) at 0.00 MADs above the median of the last 1000 gradient maximums.
******************** WARNING [11:47:10.002:MainProcess:MainThread:train_model.py:289]: Size mismatch between pretrained model and selected size. Using pretrained model size. ********************
Traceback (most recent call last):
  File "./envs/remora_v3.2.0/bin/remora", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "./envs/remora_v3.2.0/lib/python3.12/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "./envs/remora_v3.2.0/lib/python3.12/site-packages/remora/parsers.py", line 1008, in run_model_train
    train_model(
  File "./envs/remora_v3.2.0/lib/python3.12/site-packages/remora/train_model.py", line 295, in train_model
    raise RemoraError(
remora.RemoraError: The chunk context of the pre-trained model and the dataset do not match.

Answer 9 · 2024-08-17T04:19:48.000Z

This was simply an issue of the values being lists and not tuples. I have fixed this in a script and added this to the repository under scripts/convert_pt_to_checkpoint.py. This should resolve this issue, but please open this issue if you have any further questions.