The purpose of this tool is to enable quick and precise extractions of large sets of newspaper articles from online archives, to enable various language-related tasks. See the Motivation behind the project section for more info.
News:
- La Repubblica news scraping is up and running!
The dependencies of the project are the following:
Python 3.5 or higher
Certifi (for web certificates)
Urllib3 (to perform web requests)
tqdm (for progress bars in console)
BeautifulSoup (for scraping)
Simply run pip install -r requirements.txt
to install all the dependencies automatically.
As for now, three modes of execution are supported:
-
python run.py filename newspaper_name day month year
will generate a CSV file namedfilename
containing all articles scraped from newspaperǹewspaper_name
's archives for dateday/month/year
. -
python run.py filename newspaper_name day month year page_num
does the same, but for a specific page of the archives of that day. -
python run.py filename newspaper_name day month year day_end month_end year_end
is like the first command, but extract all articles in betweenday/month/year
andday_end/month_end/year_end
.
Examples:
# Scrapes all articles published on March 8, 2019 on
# La Repubblica and saves them in 8-march-articles.csv.
python run.py 8-march-articles.csv LaRepubblica 8 3 2019
# Scrapes all articles in the second page of March 8, 2019 archives
# of La Repubblica and saves them in 8-march-articles.csv.
python run.py 8-march-page-2-articles.csv LaRepubblica 8 3 2019 2
# Scrapes all articles published in March 2019 on
# La Repubblica and saves them in march-articles.csv.
python run.py march-articles.csv LaRepubblica 1 3 2019 1 4 2019
While very effective article scraping tools already exist (most notably the Newspaper3k library by codelucas), they are mostly specialized in the scraping of single articles, using general scraping directives that generally perform well, but often cannot harvest all the different metadata from all media websites in a precise way.
In this sense, Newscrapy provides an easily-customizable interface to perform ad-hoc article scraping: by creating new Article
and Extractor
child classes for a specific newspaper, one can specify how articles of the latter should be extracted from its archives and which peculiar information should be retained from them.
Currently supported: