This repo contains the code and data used for the analysis in Large Language Models Struggle to Learn Long-Tail Knowledge
As part of this project, we ran the DBPedia Spotlight entity linker on large-scale LM pre-training datasets such as The Pile, C4, ROOTS, OpenWebText, and Wikipedia. The entity linking data is hosted on HuggingFace Hub. To get this data, you can either download it from their web UI, download it with Python using the instructions from here, or run git clone https://huggingface.co/datasets/nkandpa2/pretraining_entities
to clone the HuggingFace repo (note that this requires git lfs to be installed to actually download the data).
There are five files:
the_pile_entity_map.npz
c4_entity_map.npz
roots_entity_map.npz
openwebtext_entity_map.npz
wikipedia_entity_map.npz
Each file can be loaded with numpy.load
and is a dictionary mapping from DBPedia URI strings to numpy arrays of pre-training dataset indices where that entity occurs.
To analyze the effect of pre-training entities on QA performance, we entity link Natural Questions and Trivia QA. The QA entity linking data is hosted on HuggingFace Hub. To get this data, you can either download it from their web UI, download it with Python using the instructions from here, or run git clone https://huggingface.co/datasets/nkandpa2/qa_entities
to clone the HuggingFace repo (note that this requires git lfs to be installed to actually download the data).
There are four files:
nq_train_entities.jsonl
nq_validation_entities.jsonl
trivia_qa_unfiltered.nocontext_train_entities.jsonl
trivia_qa_unfiltered.nocontext_validation_entities.jsonl
The Natural Questions files are in the order of the nq_open dataset and the Trivia QA files are in the order of the unfiltered.nocontext
split of the trivia_qa dataset.
These are each jsonlines files. Each line in the file is a dictionary with the following structure:
{
'q_entities': <list of entities found in the question>,
'a_entities': <list of entities found in the answer aliases>
}
The entities in the q_entities
and a_entities
lists are also dictionaries with the structure:
{
'URI': <dbpedia entity URI>,
'support': <dbpedia support>,
'types': <dbpedia types>,
'surfaceForm': <string in Q or A that was linked to this entity>,
'offset': <location in Q or A where surface form is found>,
'similarityScore': <dbpedia similarity score>,
'percentageOfSecondRank': <dbpedia percentage of second ranked entity>
}
- The code and instructions for running the entity linker can be found in
entity_linking/
- The code and instructions for running the relevant document counting heuristic can be found in
relevant_docs/