KeyError of `embedding_dict` during training

Question

KeyError of `embedding_dict` during training

Yunkun-Zhang opened this issue 8 months ago · 1 comments

Thank you for sharing the code of your marvelous work!

I follow the instructions from ./dataset/README.md, download and unzip PrimeKG data, and process with the scripts in ./dataset. When I try to run the training script with a single GPU, I get the following exception:

Traceback (most recent call last):
  File ".../BioBridge/train_bind.py", line 190, in <module>
    fire.Fire(main)
  File ".../python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File ".../python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File ".../python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File ".../BioBridge/train_bind.py", line 182, in main
    trainer.train()
  File ".../python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File ".../python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File ".../python3.10/site-packages/accelerate/data_loader.py", line 451, in __iter__
    current_batch = next(dataloader_iter)
  File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File ".../python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File ".../python3.10/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".../python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File ".../python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File ".../python3.10/site-packages/transformers/trainer_utils.py", line 772, in __call__
    return self.data_collator(features)
  File ".../BioBridge/src/collator.py", line 18, in __call__
    tail_emb = torch.tensor(embedding_dict[row["y_index"]])
KeyError: 14736

It seems there is something wrong with data processing?

Answer 1 · 2024-01-22T17:38:18.000Z

Hi Yunkun, thanks for your question.

This is because the used UniMol model fails to encode the SMILES strings of the drug

#node_index: 14186, 14736, 14737, 19929, 19948, 20103, 20271, 20832
#drugbank id: 'DB00515', 'DB00526', 'DB00958', 'DB08276', 'DB01929', 'DB04156', 'DB04100', 'DB13145'

I have updated the dataset processing pipeline so these drugs will be dropped from the training data. You can also try to encode these drugs and add them to embedding_dict.pkl using UniMol here: https://github.com/dptech-corp/Uni-Mol/tree/main/unimol_tools using unimol representations.