/pubarchiver

Package up microPublication.org and other journals for archiving into Portico and PMC

Primary LanguagePythonOtherNOASSERTION

PubArchiver

A program that creates archives of articles from specific journal sites (currently microPublication and Prompt) for sending to Portico and PMC.

Authors: Michael Hucka, Tom Morrell
Repository: https://github.com/caltechlibrary/pubarchiver
License: BSD/MIT derivative – see the LICENSE file for more information

License Python Latest release DOI PyPI

Table of Contents

☀ Introduction

The Caltech Library is the publisher of a few academic journals and provides services for them. The services include archiving in a dark archive (specifically, Portico) as well as submitting articles to PMC. The archiving process involves pulling down articles from the journals and packaging them up in a format suitable for sending to the archives. PubArchiver is a program to help automate this process.

✺ Installation

There are multiple ways of installing PubArchiver. Please choose the alternative that suits you.

Alternative 1: installing PubArchiver using pipx

You can use pipx to install PubArchiver. Pipx will install it into a separate Python environment that isolates the dependencies needed by PubArchiver from other Python programs on your system, and yet the resulting pubarchiver command wil be executable from any shell – like any normal program on your computer. If you do not already have pipx on your system, it can be installed in a variety of easy ways and it is best to consult Pipx's installation guide for instructions. Once you have pipx on your system, you can install PubArchiver with the following command:

pipx install pubarchiver

Pipx can also let you run PubArchiver directly using pipx run pubarchiver, although in that case, you must always prefix every pubarchiver command with pipx run. Consult the documentation for pipx run for more information.

Alternative 2: installing PubArchiver using pip

The instructions below assume you have a Python 3 interpreter installed on your computer. Note that the default on macOS at least through 10.14 (Mojave) is Python 2 – please first install Python version 3 and familiarize yourself with running Python programs on your system before proceeding further.

On Linux, macOS, and Windows operating systems, you should be able to install pubarchiver with pip for Python 3. To install pubarchiver from the Python package repository (PyPI), run the following command:

python3 -m pip install pubarchiver

As an alternative to getting it from PyPI, you can use pip to install pubarchiver directly from GitHub:

python3 -m pip install git+https://github.com/calitechlibrary/pubarchiver.git

If you already installed PubArchiver once before, and want to update to the latest version, add --upgrade to the end of either command line above.

Alternative 3: installing PubArchiver from sources

If you prefer to install PubArchiver directly from the source code, you can do that too. To get a copy of the files, you can clone the GitHub repository:

git clone https://github.com/caltechlibrary/pubarchiver

Alternatively, you can download the files as a ZIP archive using this link directly from your browser using this link: https://github.com/caltechlibrary/pubarchiver/archive/refs/heads/main.zip

Next, after getting a copy of the files, run setup.py inside the code directory:

cd pubarchiver
python3 setup.py install

▶︎ Usage

PubArchiver is a command-line program. The installation process should put a program named pubarchiver in a location normally searched by your shell interpreter. For help with usage at any time, run pubarchiver with the option --help (or -h for short).

pubarchiver -h

Basic usage

Options to pubarchiver use a dash (-) as the prefix character on macOS and Linux, and forward slash (/) on Windows.

The journal whose articles are to be archived must be indicated using the required option --journal (or -j for short). To see a list of supported journals, you can use --journal list like this:

pubarchiver --journal list

If not given any additional options besides a --journal option to select the journal, pubarchiver will proceed to contact the journal website as well as either DataCite or Crossref, and create an archive containing articles and their metadata for all articles published to date by the journal. The options below can be used to select articles and influence other pubarchiver behaviors.

Printing information without doing anything

The option --list-dois (or -l for short) can be used to obtain a list of all DOIs for all articles published by the selected journal. When --list-dois is used, pubarchiver prints the list to the terminal and exits without doing further work. This can be useful if you intend to use the --doi-file option discussed below.

If given the option --preview (or -p for short), pubarchiver will only print a list of articles it will archive and stop short of creating the archive. This is useful to see what would be produced without actually doing it. Note, however, that because it does not attempt to download the articles and associated files, it cannot report errors that might occur when actually creating an archive. Consequently, do not use the output of --preview as a prediction of eventual success or failure.

Selecting the archive format and archive output location

The value supplied after the option --dest (or -d for short) can be used to tell pubarchiver the intended destination where the archive will be sent. The option changes the structure and content of the archive created by pubarchiver. The possible alternatives are portico and pmc. Portico is assumed to be the default destination if no --dest option is given.

By default, pubarchiver will write its output to a new subdirectory it creates in the directory from which pubarchiver is being run. The option --output-dir (or /o on Windows) can be used to select a different location. For example,

pubarchiver --journal micropublication --output-dir /tmp/micropub

The materials for each article will be written to an individual subdirectory named after the DOI of the article. The output for each article will consist of an XML metadata file describing the article, the article itself in PDF format, and (if the journal provides JATS) a subdirectory named jats containing the article in JATS XML format along with any image that may appear in the article. The image is always converted to uncompressed TIFF format, because it is considered a good preservation format. The PMC structure follows the naming and delivery specifications defined at https://www.ncbi.nlm.nih.gov/pmc/pub/filespec-delivery/.

Unless the option --no-zip (or -Z for short) is given, the output will be archived in ZIP format. If the output structure (as determine by the --dest option) is being generated for PMC, each article will be put into its own individual ZIP archive; else, the default action is to put the collected output of all articles into a single ZIP archive file. The option --no-zip makes pubarchiver leave the output unarchived in the directory determined by the --output-dir option.

Selecting a subset of articles

If the option --after-date is given, pubarchiver will download only articles whose publication dates are after the given date. Valid date descriptors are those accepted by the Python dateparser library. Make sure to enclose descriptions within single or double quotes. Examples:

  pubarchiver --after-date "2014-08-29"   ....
  pubarchiver --after-date "12 Dec 2014"  ....
  pubarchiver --after-date "July 4, 2013"  ....
  pubarchiver --after-date "2 weeks ago"  ....

The option --doi-file (or -f for short) can be used to tell pubarchiver to read a file containing DOIs and only fetch those particular articles instead of asking the journal for all articles. The format of the file indicated after the --doi-file option must be a simple text file containing one DOI per line.

The selection by date performed by the --after-date option is performed after reading the list of articles using the --doi-file option if present, and thus can be used to filter by date the articles whose DOIs are provided.

Writing a report

As it works, pubarchiver writes information to the terminal about the articles it puts into the archive, including whether any problems are encountered. To save this information to a file, use the option --rep-file (or -r for short), which will make pubarchiver write a report file. By default, the format of the report file is CSV; the option --rep-fmt (or -s for short) can be used to select csv or html (or both) as the format. The title of the report will be based on the current date, unless the option --rep-title (or -t for short) is used to supply a different title.

Additional command-line options

When pubarchiver downloads the JATS XML version of articles from the journal site, it will by default validate the XML content against the JATS DTD. To skip the XML validation step, use the option --no-check (or -X for short).

pubarchiver will print informational messages as it works. To reduce messages to only warnings and errors, use the option --quiet (or -q for short). Also, output is color-coded by default unless the --no-color option (or -C for short) is given; this option can be helpful if the color control sequences create problems for your terminal emulator.

If given the --debug option (or -@ for short), this program will output a detailed real-time trace of what it is doing. The output will be written to the given destination, which can be a dash character (-) to indicate console output, or a file path.

If given the --version option (or -V for short), this program will print version information and exit without doing anything else.

Return values

This program exits with a return code of 0 if no problems are encountered while fetching data from the server. It returns a nonzero value otherwise, following conventions for use in shells such as bash which only understand return code values of 0 to 255. If no network is detected, it returns a value of 1; if it is interrupted (e.g., using ctrl-c) it returns a value of 2; if it encounters a fatal error, it returns a value of 3. If it encounters any non-fatal problems (such as a missing PDF file or JATS validation error), it returns a nonzero value equal to 100 + the number of articles that had failures. Summarizing the possible return codes:

Return value Meaning
0 No errors were encountered – success
1 No network detected – cannot proceed
2 The user interrupted program execution
3 An exception or fatal error occurred
100 + n Encountered non-fatal problems on a total of n articles

Summary of command-line options

The following table summarizes all the command line options available. (Note: on Windows computers, / must be used as the prefix character instead of -):

Short      Long form opt       Meaning Default
-aA --after-dateA Only get articles published after date A Get all articles
-C --no-color Don't color-code info messages Color-code terminal output
-dD --destD Destination for archive: Portico or PMC Portico
-fF --doi-fileF Only get articles whose DOIs are in file F Get all articles
-jJ --journalJ Work with journal J
-l --list-dois Print a list of all known DOIs & exit Do other actions instead
-oO --output-dirO Write output in directory O Write in current dir
-p --preview Preview what would be archived & exit Obtain the articles
-q --quiet Only print important messages Be chatty while working
-rR --rep-fileR Write list of article & results in file R Don't write a report
-sS --rep-fmtS With -r, write either html or csv csv
-tT --rep-titleT With -r, use T as the report title Use the current date
-V --version Print program version info & exit Do other actions instead
-X --no-check Don't validate JATS XML files Validate JATS XML
-Z --no-zip Don't put output into one ZIP archive ZIP up the output
-@OUT --debugOUT Debugging mode; write trace to OUT Normal mode

⬥   Enclose the date in quotes if it contains space characters; e.g., "12 Dec 2014".
★   Required argument.
⚑   To write to the console, use the character - (a single dash) as the value of OUT; otherwise, OUT must be the name of a file where the output should be written.

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

♬ Contributing

We would be happy to receive your help and participation with enhancing pubarchiver! Please visit the guidelines for contributing for some tips on getting started.

☥ License

Copyright © 2019-2022, Caltech. This software is freely distributed under a BSD 3-clause license. Please see the LICENSE file for more information.

❡ Authors and history

Tom Morrell developed the original algorithm for extracting metadata from DataCite and creating XML files for use with Portico submissions of microPublication.org articles. Starting with that original script, Mike Hucka created the much-expanded Microarchiver program (later renamed to PubArchiver).

♥︎ Acknowledgments

The vector artwork used as a starting point for the logo for this repository was created by Cuby Design for the Noun Project. It is licensed under the Creative Commons Attribution 3.0 Unported license. The vector graphics was modified by Mike Hucka to change the color.

Nick Stiffler from the microPublication.org team helped figure out the requirements for PMC output (introduced in Microarchiver version 1.9), helped guide development of Microarchiver/PubArchiver, and engaged in many discussions about microPublication.org's needs.

PubArchiver makes use of numerous open-source packages, without which it would have been effectively impossible to develop PubArchiver with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:

  • Beautiful Soup – an HTML parsing library
  • bun – a set of basic user interface classes and functions
  • commonpy – a collection of commonly-useful Python functions
  • crossrefapi – a python library that implements the Crossref API
  • dateparser – parser for human-readable dates
  • humanize – make numbers more easily readable by humans
  • lxml – an XML parsing library for Python
  • Pillow – a fork of the Python Imaging Library
  • plac – a command line argument parser
  • recordclass – a mutable version of Python named tuples
  • setuptools – library for setup.py
  • sidetrack – simple debug logging/tracing package
  • slack-cli – a command-line interface to Slack written in Bash
  • urllib3 – a powerful HTTP library for Python
  • xmltodict – a module to make working with XML feel like working with JSON

Finally, we are grateful for computing & institutional resources made available by the California Institute of Technology.