/parallel-cloud-downloader

A distributed (patent) downloader (and metadata extractor) that runs across several Linux instances on the cloud.

Primary LanguageShellBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Parallel Cloud Downloader

A distributed (patent) downloader (and metadata extractor) that runs across several Linux instances on the cloud.

This was developed to assist a researcher in downloading terrabytes of data from a patent repository and extracting the metadata. While it's use was specific, it could be easily adapted to download, and process, other large datasets in parallel.

It's design assumes that it is being run on Linux (Centos 6.5 was used) with one server running an NFS server (or other form of shared file system) and acting as the master, while other Linux instances run as slaves, coordinating with the master via the shared file system. It also assumes that the scripts are in the same directory (generally the same directory as it is run from), and that the data is download into a sub directory of this.

Source Files

The scripts in the src directory are the main components of the system.

graballrt.sh

graballrt is the main program and controls the other components. This is the program to start the downloads going.

Usage: graballrt.sh [-v] [-u base-URL] [-m maxjobs] [-x extract-dir] csvfile.

This script downloads multiple series of zipped patent files from the patent repository, according to the ranges specified in the csv file, which is the script's only compulsary argument. For each range, zipped files are downloaded into its directory, while extracted files are placed in the specified sub-directory. The optional -m parameter gives the maximum number of parallel downloads, the default is 20. The -v flag is for verbose (debugging or monitoring) output. Other parameters are self explanatory.

To run it in parallel across several instances, it should be run on each one using the same csv file.

rtdl.sh

The rtdl.sh script downloads a series of zipped patent files from the patent repository. The -v flag is for verbose (debugging or monitoring) output. Other parameters are self explanatory. It is called by graballrt.sh.

Usage: rtdl [-v] [-u base-URL] [-s suffix] [-d directory] [-m maxjobs] first [last]

get_rt_archive.sh

The get_rt_archive.sh script downloads a zipped patent file from a patent archive. The -v flag is for verbose (debugging or monitoring) output. It is called by rtdl.sh.

Usage: get_rt_archive.sh [-v] baseurl, directory, file, suffix, index, urlcount

unpackrt.sh

The unpackrt.sh script extracts the patent metadata (*.tsv files) and optionally (if given the -z flag) zips the resulting data up into an archive. It is called by graballrt.sh.

Usage: unpackrt.sh [-z] series-dir dest-dir

kill_graball.sh

The kill_graball.sh script script gracefully shuts down the running graballrt.sh process allowing it to complete the current downloads before exiting.

Usage: kill_graball.sh

onfull_graball.sh

The onfull_graball.sh script is used to deal with when the disk becomes close to full, and is generally called by a periodic cron job. Note that it must be run in the same directory as the downloads.

It takes 3 parameters, with the first specifying the action to be taken in case the disk is considered full, being one of 'stop' to simply stop further downloads, 'delete' to delete the oldest download, or 'kill' to abort the graballrt.sh program (this can only be done from the server). The second parameter specifies the filesystem to monitor, for example /data1, and the last specifies the fullness threshold percentage (integer), for example '45' for 45%.

Usage: onfull_graball.sh (stop | delete | kill) DISK THRESHOLD

dlmon.sh

dlmon.sh is a simple script for monitoring the progress of the downloads. It takes 2 optional parameters, the first being number of latest '.error' files to watch (default 1), and the second is the number of lines in those files to show (default is 10).

Usage: dlmon.sh [nfiles [nlines]]

Example crontab

There are 2 example crontabs in the crontab directory. These monitor disk space and either delete the oldest (rtdl-master_crontab.sh) or exit after current downloads (rtdl-slave_crontab.sh).

Example Download List

An example list of patent series to download is provided by rt_download_list.csv in the download_list directory.