KeyError of `embedding_dict` during training
Yunkun-Zhang opened this issue · 1 comments
Thank you for sharing the code of your marvelous work!
I follow the instructions from ./dataset/README.md
, download and unzip PrimeKG data, and process with the scripts in ./dataset
. When I try to run the training script with a single GPU, I get the following exception:
Traceback (most recent call last):
File ".../BioBridge/train_bind.py", line 190, in <module>
fire.Fire(main)
File ".../python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File ".../python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File ".../python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File ".../BioBridge/train_bind.py", line 182, in main
trainer.train()
File ".../python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File ".../python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File ".../python3.10/site-packages/accelerate/data_loader.py", line 451, in __iter__
current_batch = next(dataloader_iter)
File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File ".../python3.10/site-packages/torch/_utils.py", line 694, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File ".../python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File ".../python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File ".../python3.10/site-packages/transformers/trainer_utils.py", line 772, in __call__
return self.data_collator(features)
File ".../BioBridge/src/collator.py", line 18, in __call__
tail_emb = torch.tensor(embedding_dict[row["y_index"]])
KeyError: 14736
It seems there is something wrong with data processing?
Hi Yunkun, thanks for your question.
This is because the used UniMol model fails to encode the SMILES strings of the drug
#node_index: 14186, 14736, 14737, 19929, 19948, 20103, 20271, 20832
#drugbank id: 'DB00515', 'DB00526', 'DB00958', 'DB08276', 'DB01929', 'DB04156', 'DB04100', 'DB13145'
I have updated the dataset processing pipeline so these drugs will be dropped from the training data. You can also try to encode these drugs and add them to embedding_dict.pkl
using UniMol here: https://github.com/dptech-corp/Uni-Mol/tree/main/unimol_tools using unimol representations.