HFAiLab/alphafold-optimized

dataset issue

maowayne123 opened this issue · 1 comments

thanks for your great work. but i meet some bug.

i used the dataset, ffdataset format, provided by hfai.

and I got the (pdb_code, mmcif_string, bfd_hits, mgnify_hits, pdb70_hits, uniref90_hits) information from ffrecord dataloader. However, for the sample '2ljb', the contant from ffrecord file is different to that from original openfold dataset. Specifically, there are A and B chain in ffrecord dataset, but there are A,B,C,D four chain in the original dataset.
As a result, when i am start the training, i will recieve a error':
Traceback (most recent call last):
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/wayne/project/folding/alphafold-optimized/train_fold.py", line 100, in main
for idx, batch in enumerate(dataloader):
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_modules.py", line 485, in _batch_prop_gen
for batch in iterator:
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/ffrecord-1.4.0+0cebd18-py3.8-linux-x86_64.egg/ffrecord/torch/dataloader.py", line 155, in fetch
data = self.dataset[indexes]
File "/home/wayne/project/folding/alphafold-optimized/hfaidataset.py", line 58, in getitem
mmcifdata = self.transform(*mmcifdata)
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_modules.py", line 151, in call
data = self._parse_mmcif(
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_modules.py", line 130, in _parse_mmcif
data = self.data_pipeline.process_mmcif_hfai(
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_pipeline.py", line 820, in process_mmcif_hfai
mmcif_feats = make_mmcif_features(mmcif, chain_id)
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_pipeline.py", line 97, in make_mmcif_features
input_sequence = mmcif_object.chain_to_seqres[chain_id]
KeyError: 'D'

Because the file ID is 2ljb_D, the transform function try to index the D chain from 2ljb.cif but there is not D chain in the cif file. i check the original dataset from openfold, there should be a D chain in the mmcif. so i think there are something wrong with the ffrecord dataset. So how to fix the bug? thank you very much.

BTW, i found a cache function, 'cachePath = f"../full_dataset/mmcif_parse_cache/{file_id}.pkl"', is added into hfai implement. i am worry about it will cause heavy io operation and make the ffrecord dataset meaningless. am i wrong? thanks!

the random seed i used is 4242022