instructions for generating vocab.pkl?
Opened this issue ยท 15 comments
Hi. Would it be possible for the authors to either upload vocab.pkl (for the pretrained model), or give instructions and code about how to generate the vocab.pkl file from the CHEMBL24 dataset (or any other dataset used)? Thanks
Hi. I'm sorry but I lost access to the resources.
Does this issue help?
#11
Well, I meant to mention this comment. I'm glad if it helps.
#11 (comment)
Hi Shion,
I have tried to generate the vocab.pkl from ChEMBL 24. Using default parameters in the build_vocab.py file, I get a vocabulary size of 75.
If I am not mistaken, this is not compatible with the pretrained model provided:
size mismatch for embed.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.bias: copying a param with shape torch.Size([45]) from checkpoint, the shape in current model is torch.Size([75]).
Thanks!
M
Thanks for reporting.
That's strange. Then I might have used different parameters... I'm sorry that it's not set properly.
Does it help?
#19
Hi..! Unfortunately the vocab.pkl file from #19 does not help either...
size mismatch for embed.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([50, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([50, 256]).
size mismatch for out.bias: copying a param with shape torch.Size([45]) from checkpoint, the shape in current model is torch.Size([50]).
I was able to reproduce the vocab.pkl
with following steps
-
Download chemble_24 data form chemble_24_1 with name
chembl_24_1_chemreps.txt.gz
, this is the same data as mentioned by the author in this issue. -
Then open the
01_data_prepare.ipynb
file and start running from the following cell
-
After obtaining the
csv
file run thebuild_corpus.py
, I have only changed the file reading location and the pandas dataframe column to obtain SMILES. Running this file will take some time.
-
After obtaining the
data/chembl24_corpus.txt
by running above, run thebuild_vocab.py
file
-
Now this vocab will have
len(vocab)==45
, I am attaching the obtained result below
PS: don't forget to change the the n_layers from 3 to 4 - trfm = TrfmSeq2seq(len(vocab), 256, len(vocab), 4)
Thanks
Regards,
Dinabandhu
@dinabandhu50
Thank you so much!!
copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the
have you solved this mismatch problem ?
It indeed solves the mismatch problem
where could I fin 01_data_prepare.ipynb ? Thanks.
I find the data_prepare.ipynb, however, I still have a problem in step of runing the build_corpus.py. At the beginning It shows i don't have the utils module, then I install it with pip install utils. However, when I run it again, it shows the error "cannot import name 'split' from 'utils'". I use Python3 to run this command, do you have any suggestion on it? Thanks.
Where did you get the "01_data_prepare.ipnyb'?
Where did you get the "01_data_prepare.ipnyb'?
I think that is 'prepare_data.ipynb' in 'experiments' folder
I could make file from that file