arXiv external links

Clearinghouse for relations between arXiv e-prints and external resources

Background

A wide range of requirements and feature requests that we have received from stakeholders and end users involve attaching metadata about relations between arXiv e-prints and external resources. This includes things like links to datasets, code, and other online content, and better support for information about the published version of record.

Including this kind of relational metadata in the core arXiv metadata record is a poor fit given the way that e-prints are versioned, the requirement that secondary metadata be maintainable outside of the submission process, and the requirement that support for secondary metadata be as evolvable and extensible as possible.

Additionally, we need to bring forward into NG the automated routines that we use to harvest relational metadata (e.g. DOIs, journal citations) from other publishing platforms. A shortcoming of the classic system is that the provenance of these kinds of metadata are not tracked, which presents challenges for our partners to interpret and use those metadata downstream.

Goals

Store information about relationships between e-prints and external resources:
- Published versions, e.g. via DOIs
- Datasets
- Code repositories
- Multimedia
- Methods/protocols
- Related works
- Blogs and other websites
- Etc
Track provenance/history of this information.
- When it was added,
- How/by whom
Provide APIs for retrieving this information, adding new relations.
Provide an intuitive user interface for authors to curate these relations for their e-prints.

Requirements

Author-owners can add, edit, deactivate relationships via an html ui. View aggregated relations, view detailed provenance log.
Authorized API clients can add, edit, deactivate relationships via JSON API. Read aggregated relations, read detailed provenance log.
Anonymous users, clients can view/read aggregated relations, provenance of active relations.
Relation data are immutable.
- Add means create new assertion about relation.
- Edit means create new assertion that supercedes a previous assertion.
- Deactivate means create new assertion that a previous relation is incorrect, should be suppressed.
Relation data model includes
- Type of relation
- E-print id and version
- Type of resource
- Canonical identifier for resource (doi, uri, etc)
- Freeform description of relation
- Datetime added
- Client + user who created
- identifier of relation superceded or suppressed
Emits event on Kinesis stream when data is added.
For each resource type, mechanism to verify that resource exists.

Constraints

Flask app that follows the design approach outlined in https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html . Can be deployed as a Docker image, e.g with uWSGI application server
Separate blueprints for API, user interface
Use arXiv base for base templates, error handling, etc
Use arXiv auth for authn/z
API documented with OpenAPI 3 and JSON schema

Code overview

This project follows the general design approach described here.

The application source lives in relations/.

The application factory module relations/factory.py defines the construction of two Flask apps: (1) an API application that provides the REST API, and an UI application that provides views for human users.

Note that these apps use the following general tooling from the arXiv project:

arxiv.base.Base, which adds some useful things like exception handlers, an arxiv URL converter, etc.
The arxiv.users library, which adds tooling for authnz/.

In general, it's a good idea to get comfortable with the arxiv namespaced packages, as there are several useful tools there.

HTTP routing is implemented in the routes module. The API and UI each have their own blueprint. Routing functions don't implement much logic; they are there to provide an interface to the controller functions.

Controller functions do the work of handling requests. They are defined in relations/controllers.py. Controllers orchestrate the real work; they use domain objects and services (below) to carry out work to handle requests.

The service domain is defined in relations/domain.py. The domain is comprised of classes or other structs that define the main concepts of the application, and the core domain logic/rules. See https://arxiv.github.io/arxiv-arxitecture/crosscutting/services.html#data-domain for details.

Service modules can be found in relations/services/. This is where (for example) a Kinesis notification producer would be implemented.

Quick-start

We use Pipenv for dependency management.

pipenv install --dev

You can run either the API or the UI using the Flask development server.

FLASK_APP=ui.py FLASK_DEBUG=1 pipenv run flask run

Dockerfiles are also provided in the root of this repository. These use uWSGI and the corresponding wsgi_[xxx].py entrypoints.

Contributing

Please see the arXiv contributor guidelines for tips on getting started.

Code of Conduct

All contributors are expected to adhere to the arXiv Code of Conduct.

bonotake/arxiv-external-links