UBC HistoryLab

This reposiotry contains all the code required to download all newspapers from a csv file and then filter the content from the sources to filter the articles out based on keywords.

Sources Supported

Currently the following newspaper sources are supported:

  • "ChroniclingAmerica" : chroniclingamerica.loc.gov
  • "BC" : open.library.ubc.ca
  • "Oregon" : oregonnews.uoregon.edu
  • "NewYork" : nyshistoricnewspapers.org
  • "Georgia" : gahistoricnewspapers.galileo.usg.edu
  • "Newspaper" : newspaper.com

Downloading full articles

To download the articles from any specific source, you need a csv file that has only 1 column with the header TxtURL from that source. Let us assume that the file is called Links_BC.csv

Create an empty folder called raw_data_[Source_Name]. For eg: to download newspapers from BC you would need to create the empty folder called raw_data_BC.

In the parser folder, use the parser.py file to download the newspapers

> python parser.py <csv_file_with_links> <source>

Filtering articles with keywords

To filter the articles, we need to first create an empty folder with the same name as the raw data except with the prefix "output_".

Next we need to use the script article_parser.py with the keywords to filter the articles.

Full Example

Here is a full example of downloading and parsing articles where you are trying to download the links in Links.csv file from source BC to find articles with keywords hello and world. The 2nd keyword is optional.

> mkdir raw_data_BC
> python parser.py Links.csv BC
> mkdir output_raw_data_BC
> python article_parser.py output_raw_data_BC hello world