This script is intended to download, filter and concatenate data files from the Dun & Bradstreet "Global Archive" (see the user guide). The data is stored in Princeton servers, and this script will only work if you are connected to the Princeton network (either directly or through a VPN; see below).
This script has been tested on Windows 10, macOS 10.15, Manjaro Linux and Raspberry Pi OS. Before running, make sure your machine satisfies the following requirements:
- Firefox (note: it should work on Chrome, but it has not been tested)
- Python 3
- Python 3 modules: requests, browser_cookie3, beautifulsoup4, pandas, progressbar2
This script relies on the access cookies stored in the browser (i.e. Firefox). In order for it to run correctly,
- Browse to https://vpn.princeton.edu/https-443/dss2.princeton.edu/dandb/dandbarchives/LINK/ in Firefox.
- If you get a Central Authentication Service (CAS) login page, use your Princeton credentials to log in. If you can access the link (i.e. you see the file directory), you are ready to run the script.
- In the following page you should get a Duo Prompt. Login via your preferred method, but make sure you check "Remember me for 90 days".
- If you see the file directory, you are done.
- Double check the browser stored your credentials by restarting it and visiting the link again. If you can access the link directly, you are ready to run the script.
- Clone this repository into your local machine and change directory into it (detailed guide). The easiest way is to type the following two commands in your terminal:
git clone https://github.com/acarril/DB-Global-Archive.git
cd DB-Global-Archive
- Run
DB_scrape.py
giving the file index URL as its argument. Make sure to include the final slash. For example, to download the data corresponding to Africa, type
python3 DB_scrape.py 'https://dss2.princeton.edu/dandb/dandbarchives/LINK/AF/'
- The script will output its progress. Once it has finished, it will write a
csv
file in the directory (e.g.AF.csv
).
python3 DB_scrape.py "<url>" [writedisk] [suboperation]
You can carry out all the operations in memory if running python3 DB_scrape.py "<url>"
. This will fetch all the links from the URL, read the data, filter it, and append it to a final dataset that will only be written in disk in the final step. This is best when there is little available disk space, and it is also faster. However, it can be memory intensive, and any error during the operation will result in loss of all progress.
Alternatively, the script can download and write the zip files into the disk, as well as expanding the zip files and writing the corresponding CSV files into the disk, by running python3 DB_scrape.py "<url>" writedisk
.
This is less memory intensive and is more robust to errors (as progress is literally saved into the disk), but utilizes more disk space.
This method can be further customized by passing an additional argument with a sub-operation. These can be one of the following:
download
, which only downloads the ZIP files from the specified URLfilter
, which takes ZIP files already downloaded in the directory and filters the corresponding CSV files, writing them into the diskjoin
, which takes CSV files in the current directory and joins them into one.