Alex-Fabbri/Multi-News

Discrepancy between documents and reference

Alvin-Dey opened this issue · 1 comments

For some of the documents encountered while parsing through the raw data. There is no match between the documents and their corresponding reference summary .
For example : Topic : 16490
Documents : Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain. ||||| This will appear next to all of your comments NEWLINE_CHAR NEWLINE_CHAR This will NOT appear anywhere on Newser |||||

Summary : – An online retailer has pulled a costume from its website that depicted Holocaust victim Anne Frank. Screenshots of the costume for sale at HalloweenCostumes.com posted to social media show a smiling girl wearing World War II-era clothing and a beret, the AP reports. The costume was quickly criticized on Twitter. Per the Arizona Republic, the description that accompanied the photo called Frank a hero and noted "we can always learn from the struggles of history." Carlos Galindo-Elvira, who leads the Anti-Defamation League's Arizona office, said on Twitter that the costume trivializes the memory of Frank, known from the diary she wrote while in hiding from the Nazis during the war. "There r better ways 2 commemorate Anne Frank," he wrote. A spokesman tweeted Sunday that the costume had been pulled from the site. He explained that the company sells costumes for activities other than Halloween, like "school projects and plays," and he apologized for any offense caused by the costume. Fun.com, based in North Mankato, Minn., runs the website.

Thank you for pointing that out. It looks like those two messages were returned when retrieving some of the source documents. Unfortunately, that's a problem with scraped datasets, as pointed out in Section 3.3 of the paper Neural Text Summarization: A Critical Evaluation. Please let me know if you find any other irregularities, and I will upload an updated version.

I updated the README with a link to the src files with these documents removes (also removed the ||||| which appears at the end of each example because it's not necessary). Below are the statistics for the #docs/example for train, validation and test; some of the examples now have only a single doc with the filtering.
image