open-meta-extraction

This repo will be removed in the future

Open Extraction Works

A set of services to spider and extract metadata (abstracts, titles, authors, pdf links) from given URLs.

Services may be run individually as command line applications, or as service that runs on a schedule (using PM2)

Overview

This project provides a set of services to take the URL of a webpage for a research paper, and extract metadata for that paper. Metadata includes the abstract, title, authors, and a URL for a PDF version of the paper. Spidering is done using an automated Chrome browser. Once a URL is loaded into the browser, a series of extraction rules are applied. The extractor checks for commonly used metadata schemas, including Highwire Press, Dublin Core, OpenGraph, along with non-standard variations of the most common schemas, and a growing list of journal and domain-specific methods of embedding metadata in the head or body of a webpage. Once a suitable schema is identified, the metadata fields are saved to a local file system, then returned to the caller. If changes are made to the spidering and/or extractor, such that re-running the system produces different results, that fact is returned along with the results.

Dev/Production machine setup

Requirements

node >= v16.15
- I install via n (https://github.com/tj/n) or other node version management
rush
- > npm install -g @microsoft/rush
pm2
- > npm install -g pm2
Chrome dependencies (runs headless via puppeteer)
- > sudo apt install libnss3-dev libatk1.0-0 libatk-bridge2.0-0 libcups2 libgbm1 libpangocairo-1.0-0 libgtk-3-0
HTML Tidy
- > sudo apt install tidy
dotenv (optional, helps with development)
MongoDB, tested against v4.4.14

Project install/update/building

Run ‘rush’ commands from project root

Initial installation
- > rush install
Update dependencies after version bump in package.json
- > rush update
Build/Full rebuild
- > rush build
- > rush rebuild
Run tests
- > rush test

Config file(s)

Create at least one of the three config files in project root (or in an ancestor directory):

config-test.json: to run tests
config-dev.json:  to run command line apps locally with a dev database instance
config-prod.json: for pm2 production deployment

Format:

{
    "openreview": {
        "restApi": "https://api.openreview.net",
        "restUser": "openreview-username",
        "restPassword": "openreview-password"
    },
    "mongodb": {
        "connectionUrl": "mongodb://localhost:27017/",
        "dbName": "meta-extract-(dev|prod|test)"
    }
}

Running from command line

All functionality is available from the command line via bin/cli in the root folder.

> bin/cli --env dev
run> node ./packages/services/dist/src/cli
cli <command>

Commands:
  cli extraction-summary  Show A Summary of Spidering/Extraction Progress
  cli run-relay-fetch     Fetch new OpenReview URLs into local DB for
                          spidering/extraction
  cli run-relay-extract   Spider new URLs, extract metadata, and POST results
                          back to OpenReview API
  cli mongo-tools         Create/Delete/Update Mongo Database
  cli pm2-restart         Notification/Restart scheduler
  cli test-scheduler      Testing app for scheduler
  cli scheduler           Notification/Restart scheduler
  cli echo                Echo message to stdout, once or repeating
  cli preflight-check     Run checks and stop pm2 apps if everything looks okay
  cli spider-url          spider the give URL, save results in corpus
  cli extract-url         spider and extract field from given URL
  cli extract-urls        spider and extract field from list of URLs in given
                          file

Options:
  --version  Show version number                                       [boolean]
  --help     Show help                                                 [boolean]

Examples

Spider a URL and save results to local filesystem (delete any previously downloaded files via --clean)
> ./bin/cli spider-url --corpus-root local.corpus.d --url 'https://doi.org/10.3389/fncom.2014.00139' --clean

Spider, then extract metadata from given URL, filesystem only (no mongo db)
> ./bin/cli extract-url --corpus-root local.corpus.d --url 'https://arxiv.org/abs/2204.09028' --log-level debug --clean

Drop/recreate collections in mongo db
> ./bin/cli --env=dev mongo-tools --clean

Fetch a batch of URLs from notes via OpenReview API, store in mongo
> ./bin/cli --env=dev run-relay-fetch --offset 100 --count 100

Spider/extract any unprocessed URLs in mongo, optionally posting results back to OpenReview API
> ./bin/cli --env=dev run-relay-extract --post-results=false

Show extraction stats for dev database
> ./bin/cli --env=dev extraction-summary

Running with PM2

PM2 wrapper script will set *_ENV evironment variables, flush pm2 logs, then run pm2 with correct *.ecosystem.json and tail the logfiles.

> bin/pm2-control
PM2 Control
Usage: bin/pm2-control [--(no-)verbose] [--(no-)dry-run] [--env <ENVMODE>] [--start] [--reset] [--restart] [-h|--help]
        --env: Env Mode; Required. Can be one of: 'dev', 'test' and 'prod' (default: 'unspecified')
        --start: Start pm2 with *-ecosystem.config
        --reset: stop/flush logs/del all
        --restart: reset + start
        -h, --help: Prints help

To restart the system with clean log files:
> bin/pm2-control --env=prod --restart

iesl/open-meta-extraction