RyanWangZf/BioBridge

KeyError of `embedding_dict` during training

Yunkun-Zhang opened this issue · 1 comments

Thank you for sharing the code of your marvelous work!

I follow the instructions from ./dataset/README.md, download and unzip PrimeKG data, and process with the scripts in ./dataset. When I try to run the training script with a single GPU, I get the following exception:

Traceback (most recent call last):
  File ".../BioBridge/train_bind.py", line 190, in <module>
    fire.Fire(main)
  File ".../python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File ".../python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File ".../python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File ".../BioBridge/train_bind.py", line 182, in main
    trainer.train()
  File ".../python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File ".../python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File ".../python3.10/site-packages/accelerate/data_loader.py", line 451, in __iter__
    current_batch = next(dataloader_iter)
  File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File ".../python3.10/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".../python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File ".../python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File ".../python3.10/site-packages/transformers/trainer_utils.py", line 772, in __call__
    return self.data_collator(features)
  File ".../BioBridge/src/collator.py", line 18, in __call__
    tail_emb = torch.tensor(embedding_dict[row["y_index"]])
KeyError: 14736

It seems there is something wrong with data processing?

Hi Yunkun, thanks for your question.

This is because the used UniMol model fails to encode the SMILES strings of the drug

#node_index: 14186, 14736, 14737, 19929, 19948, 20103, 20271, 20832
#drugbank id: 'DB00515', 'DB00526', 'DB00958', 'DB08276', 'DB01929', 'DB04156', 'DB04100', 'DB13145'

I have updated the dataset processing pipeline so these drugs will be dropped from the training data. You can also try to encode these drugs and add them to embedding_dict.pkl using UniMol here: https://github.com/dptech-corp/Uni-Mol/tree/main/unimol_tools using unimol representations.