allenai/specter

Create preprocessed training files: metadata.json is missing ids in the train.txt, test.txt and val.txt

shauryr opened this issue ยท 6 comments

When I run the following -

python specter/data_utils/create_training_files.py \
--data-dir data/training \
--metadata data/training/metadata.json \
--outdir data/preprocessed/

I get done getting triplets, success rate:0.00%

and my data-metrics.json looks like -

{
  "train": 0,
  "val": 0,
  "test": 0
}

I debugged the code and found that at line
there is a key error when self.metadata is called.
Looks like the ids in train.txt, val.txt and test.txt are not in the metadata.json file

Please help and share the correct metadata.json file

I got the same problem.
It seems that metadata.json requires 'paper_id' in addition to 'title' and 'abstract'.

The sample metadata file was updated and this should be fixed now. Let us know if you still have issues.

I still have the same problem. Apparently, most paper_ids do not match. For example:

2020-10-27 11:38:16,851,851 ERROR [create_training_files.py:358] '1a090df137014acab572aa5dc23449b270db64b4'
2020-10-27 11:38:16,852,852 INFO [create_training_files.py:362] done getting triplets, success rate:0.00%,total: 15

@armancohan any updates here?

yrrah commented

The data.json contains many ids that don't exist in metadata.json
I made up a new data.json that works
data.txt

@yrrah thanks for the solution. It works for me!