/cc-citations

Scientific articles using or citing Common Crawl data

Primary LanguageJupyter Notebook

Common Crawl Citations – BibTeX Database

BibTex files are in bib/

Note: work in progress, still contains only a fraction of recent articles

Fields Specific for Common Crawl

The following non-standard fields are used to add information how the publications relate to Common Crawl:

cc-author-affiliation
affiliation of the authors
cc-class
classification of the publication: domain of research, topics, keywords
cc-snippet
snippet citing Common Crawl
cc-dataset-used
subset of Common Crawl used, e.g., CC-MAIN-2016-07
cc-derived-dataset-about
the publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-used
a dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-cited
a derived dataset is cited but not used

Formatting and Export of Citations

The Makefile contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: bibtex2html, bibclean, bibtool.

Citations from Google Scholar Alerts

As an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See gscholar_alerts.