Is there any notion of relevance or rank?
JohnGiorgi opened this issue · 7 comments
Hi @complementizer,
Just wondering if there's any notion of "relevance" or "rank" for each of the documents in a cluster (and if so, is this how the 10 and the 100 documents of WCEP-10 and WCEP-100 were chosen?). I thought it might be the "probability"
field in the jsonl
files, but I noticed that sometimes one article has multiple "probability"
's, so I am not sure.
Hi @JohnGiorgi,
The documents in WCEP-10 and WCEP-100 were selected as follows: for each event/cluster we included all original source articles from the WCEP website and topped them up (to reach 10 or 100) with automatically added related articles from CommonCrawl. These related articles where shuffled, i.e. WCEP-100 shouldn't have less relevant articles on average than WCEP-10.
There is a measure of relevance/rank that was used to filter these related articles in the first place: We used a logistic regression classifier with simple features to decide whether an <article, event/summary> pair match (described in paper). We included all articles that obtained a probability of at least 0.9 by the classifier. If an article matched multiple events/summaries, we only assigned it to the higher-scored one. Each article in the dataset that was collected from CommonCrawl has an events
list where each entry has an event id
and a probability
given to this <article, event> pair. The id
associated with the highest probability in this list should match the event/cluster that the article is stored under.
If you need to rank the articles by relevance you can identify the probability that is linked to the id of the current cluster. I hope this helps - let me know if this was not clear or if you have more questions! I'll add information about the jsonl structure in the readme.
Thanks, @complementizer! That is very helpful. The reason I ask is not that I want to sort the documents in order of relevance but because I wanted to figure out if they were already sorted in order of relevance, particularly in the copies of WCEP-10 that are being used to train MDS models (e.g. see ccdv/WCEP-10, which originally comes from the PRIMERA authors).
I did a quick-and-dirty analysis and found that in 51% 42% of the examples of these copies of WCEP-10, the documents were sorted in order of relevance (where relevance is the probability score of the article for the event as you described above). If the documents of each example are shuffled randomly, then only ~15% of examples are sorted according to this probability.
Do you have any intuition on why this many examples would be sorted?
Interesting, seems like I remember this wrong! Thanks @JohnGiorgi let me check again and come back to you!
No worries, thanks for double-checking! Here is the code I used to arrive at that 51% 43% number (not very pretty, but I think it works!). I ran it right in the wcep-getting-started.ipynb
notebook.
# Install the datasets library to load WCEP-10 from HuggingFace
!pip install datasets
import random
from datasets import load_dataset
def sanitize_text(text: str, lowercase: bool = False) -> str:
"""Cleans text by removing whitespace, newlines and tabs and (optionally) lowercasing."""
sanitized_text = " ".join(text.strip().split())
sanitized_text = sanitized_text.lower() if lowercase else sanitized_text
return sanitized_text
# Load the full WCEP data, where each example is a dictionary of probabilities keyed by documents
doc_probs = []
for example in val_data:
id_ = example["id"]
doc_probs.append({})
for article in example["articles"]:
if "events" not in article:
continue
# Get the probability that this doc belongs to this event
probabilities = [event["probability"] for event in article["events"] if event["id"] == id_]
key = sanitize_text(article["text"][:100], lowercase=True)
doc_probs[-1][key] = probabilities[0]
# Load WCEP-10 from HuggingFace, keep track of which examples have documents sorted by relevance
sorted_docs = []
wcep_10 = load_dataset("ccdv/WCEP-10", "list", split="validation")
# Sanity check that these correspond to the same examples
assert(len(wcep_10["document"]) == len(doc_probs))
for i, docs in enumerate(wcep_10["document"]):
probs = []
for doc in docs:
key = sanitize_text(doc[:100], lowercase=True)
if key in doc_probs[i]:
probs.append(round(doc_probs[i][key], 4))
sorted_docs.append(probs == sorted(probs, reverse=True))
# Report the fraction of examples where documents are sorted by relevance
round(sum(sorted_docs) / len(sorted_docs), 4)
@JohnGiorgi Thx for sharing. In my released WCEP-100 validation dataset (used in the notebook) the documents are only sorted by the probability in 21 out of 1020 examples:
n_sorted = 0
for c in val_data:
probs = []
for a in c["articles"]:
if "events" in a:
events = sorted(a["events"], key=lambda x: x["probability"], reverse=True)
assert events == a["events"] # check if they're already sorted like this
probs.append(events[0]["probability"])
if probs == sorted(probs, reverse=True):
n_sorted += 1
print(n_sorted, len(val_data))
# 21 1020
If you shuffle them you'll get a similar count.
The documents in ccdv/WCEP-10 seem to have a different order and are partly sorted by that probability as you showed. I don't really know how the PRIMERA authors derived their version of the dataset! Maybe they used some similarity measure to sort documents and it only agrees with our classifier in 43% of the clusters, but I'm not sure.
Ahh, okay, thanks @complementizer! That was helpful in narrowing this down. I will ask the PRIMERA authors if they remember how their copy of WCEP-10 got (partially) sorted. Update: the 10 documents were selected according to those probability scores: allenai/PRIMER#16 (comment). I guess my only remaining confusion is why only 40-50% of examples are sorted, but I will have to ask the PRIMERA authors.
One final question, if that's okay... there are articles without event
keys in the original WCEP data. Are these special? Asking because in the PRIMERA copy of WCEP-10 they always come first.
@JohnGiorgi Ah good to know, thanks for checking! Yes the docs without events are special because they are the original source articles cited by human editors under each summary. There is usually 1 per cluster, but sometimes more. Each article has an "origin" field in the jsonl dataset which is either "WCEP" (original source) or "CommonCrawl" (automatically added). Note that the original articles are not always the most similar in word overlap to the summary. Still not sure about the unsorted clusters in the Primera version since the original articles don't have an event probability.
I'll add more information on how to use the dataset (and how it was already used) and about the format in the readme and provide a 10-article version to download.