allenai/specter

Training data and code to reproduce model training

urmeya opened this issue · 3 comments

Can you please release the training data and code to reproduce model training?
What is the expected timeline?

We've already released the pretrained model weights and instructions on how to use it (please see the README.md)

But if you are interested in training the model from scratch here are some instructions.

./scripts/run-exp-simple.sh -c experiment_configs/simple.jsonnet \
-s [output-dir] --num-epochs [num-epochs] --batch-size [batch-size] \
--train-path [path-to-train.pkl] --dev-path [path-to-dev.pkl] \
--cuda-device 0 

We will be releasing some instructions on using the model on custom data in the future, but in the meantime if you want to use our pickled training data you can download it from here (the files are relatively large):
train: link [15.6G]
validation: link [3.5G]

We've already released the pretrained model weights and instructions on how to use it (please see the README.md)

But if you are interested in training the model from scratch here are some instructions.

./scripts/run-exp-simple.sh -c experiment_configs/simple.jsonnet \
-s [output-dir] --num-epochs [num-epochs] --batch-size [batch-size] \
--train-path [path-to-train.pkl] --dev-path [path-to-dev.pkl] \
--cuda-device 0 

We will be releasing some instructions on using the model on custom data in the future, but in the meantime if you want to use our pickled training data you can download it from here (the files are relatively large): train: link [15.6G] validation: link [3.5G]

Do these links also contain paper abstracts @armancohan ? I unpickled, and these were the fields -

{'source_title': <allennlp.data.fields.text_field.TextField object at 0x7f991f313490>, 'pos_title': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d73650>, 'neg_title': <allennlp.data.fields.text_field.TextField object at 0x7f98b3da0890>, 'source_venue': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d2cf50>, 'pos_venue': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d321d0>, 'neg_venue': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d324d0>, 'source_paper_id': <allennlp.data.fields.metadata_field.MetadataField object at 0x7f98b3d32810>, 'pos_paper_id': <allennlp.data.fields.metadata_field.MetadataField object at 0x7f98b3d328d0>, 'neg_paper_id': <allennlp.data.fields.metadata_field.MetadataField object at 0x7f98b3d32950>, 'source_authors': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d329d0>, 'source_author_positions': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d32c10>, 'pos_authors': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d32e50>, 'pos_author_positions': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d36050>, 'neg_authors': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d36250>, 'neg_author_positions': <allennlp.data.fields.text_field.TextField object at 0x7f98b3d36410>, 'data_source': <allennlp.data.fields.metadata_field.MetadataField object at 0x7f98b3d36610>}

@armancohan Can you provide the dataset including data.json, metadata.json? Just like what you discussed in the readme file.