/oc_meta

Primary LanguagePythonISC LicenseISC

Run tests Coverage PyPI GitHub code size in bytes

OpenCitations Meta Software

OpenCitations Meta contains bibliographic metadata associated with the documents involved in the citations stored in the OpenCitations infrastructure. The OpenCitations Meta Software performs two main actions: a data curation of the provided CSV files and the generation of new RDF files compliant with the OpenCitations Data Model. An example of a raw CSV input file can be found in example.csv.

Table of Contents

Meta

The Meta process is launched through the meta_process.py file via the prompt command:

    python -m oc_meta.run.meta_process -c <PATH>

Where:

  • -c --config : path to the configuration file.

The configuration file is a YAML file with the following keys (an example can be found in config/meta_config.yaml).

Setting Mandatory Description
triplestore_url Endpoint URL to load the output RDF
input_csv_dir Directory where raw CSV files are stored
base_output_dir The path to the base directory to save all output files
resp_agent A URI string representing the provenance agent which is considered responsible for the RDF graph manipulation
base_iri The base URI of entities on Meta. This setting can be safely left as is
context_path URL where the namespaces and prefixes used in the OpenCitations Data Model are defined. This setting can be safely left as is.
dir_split_number Number of files per folder. dir_split_number's value must be multiple of items_per_file's value. This parameter is useful only if you choose to return the output in json-ld
items_per_file Number of items per file. This parameter is useful only if you choose to return the output in json-ld
default_dir This value is used as the default prefix if no prefix is specified. It is a deprecated parameter, valid only for backward compatibility and can safely be ignored
supplier_prefix A prefix for the sequential number in entities’ URIs. This setting can be safely left as is
rdf_output_in_chunks If True, save all the graphset and provset in one file, and save all the graphset on the triplestore. If False, the graphs are saved according to the usual OpenCitations strategy (the "complex" hierarchy of folders and subfolders for each type of entity)
zip_output_rdf If True, the folder specified in output_rdf_dir must contain zipped JSON files, and the output will be zipped
source Data source URL. This setting can be safely left as is
use_doi_api_service If True, use the DOI API service to check if DOIs are valid
workers_number Number of cores to devote to the Meta process
blazegraph_full_text_search True if Blazegraph was used as a provenance triplestore, and a textual index was built to speed up queries. For more information, see https://github.com/blazegraph/database/wiki/Rebuild_Text_Index_Procedure
fuseki_full_text_search True if Fuseki was used as a provenance triplestore, and a textual index was built to speed up queries. For more information, see https://jena.apache.org/documentation/query/text-query.html
virtuoso_full_text_search True if Virtuoso was used as a provenance triplestore, and a textual index was built to speed up queries. For more information, see https://docs.openlinksw.com/virtuoso/rdfsparqlrulefulltext/
graphdb_connector_name The name of the Lucene connector if GraphDB was used as a provenance triplestore and a textual index was built to speed up queries. For more information, see https://graphdb.ontotext.com/documentation/free/general-full-text-search-with-connectors.html
cache_endpoint Specifies the provenance triplestore URL to use as a cache to make queries on provenance faster
cache_update_endpoint If your cache provenance triplestore uses different endpoints for reading and writing (e.g. GraphDB), specify the endpoint for writing in this parameter

Plugins

Get a DOI-ORCID index

orcid_process.py generates an index between DOIs and the author's ORCIDs using the ORCID Summaries Dump (e.g. ORCID_2019_summaries). The output is a folder containing CSV files with two columns, 'id' and 'value', where 'id' is a DOI or None, and 'value' is an ORCID. This process can be run via the following commad:

    python -m oc_meta.run.orcid_process -s <PATH> -out <PATH> -t <INTEGER> -lm -v

Where:

  • -s --summaries: ORCID summaries dump path, subfolder will be considered too.
  • -out --output: a directory where the output CSV files will be store, that is, the ORCID-DOI index.
  • -t --threshold: threshold after which to update the output, not mandatory. A new file will be generated each time.
  • -lm --low-memory: specify this argument if the available RAM is insufficient to accomplish the task. Warning: the processing time will increase.
  • -v --verbose: show a loading bar, elapsed time and estimated time, not mandatory.

Get a Crossref member-name-prefix index

crossref_publishers_extractor.py generates an index between Crossref members' ids, names and DOI prefixes. The output is a CSV file with three columns, 'id', 'name', and 'prefix'. This process can be run via the following command:

    python -m oc_meta.run.crossref_publishers_extractor -o <PATH>

Where:

  • -o --output: The output CSV file where to store relevant information.

Get raw CSV files from Crossref

This process generates raw CSV files using JSON files from the Crossref data dump (e.g. Crossref Works Dump - August 2019), enriching them with ORCID IDs from the ORCID-DOI Index generated by orcid_process.py. This function is launched through the crossref_process.py file via the prompt command:

    python -m oc_meta.run.crossref_process -cf <PATH> -o <PATH> -out <PATH> -w <PATH> -v

Where:

  • -cf --crossref: Crossref JSON files directory (input files).
  • -p --publishers: CSV file path containing information about publishers (id, name, prefix). This file can be generated via crossref_publishers_extractor.py.
  • -o --orcid: ORCID-DOI index filepath, generated by orcid_process.py.
  • -out --output: directory where CSVs will be stored.
  • -w --wanted: path of a CSV file containing what DOI to process, not mandatory.
  • -v --verbose: show a loading bar, elapsed time and estimated time, not mandatory.

As the parameters are many, you can also specify them via YAML configuration file. In this case, the process is launched via the command:

    python -m oc_meta.run.crossref_process -c <PATH>

Where:

  • -c --config : path to the configuration file.

The configuration file is a YAML file with the following keys (an example can be found in config/crossref_config.yaml.

Setting Mandatory Description
crossref_json_dir Crossref JSON files directory (input files)
output Directory where output CSVs will be stored
orcid_doi_filepath ORCID-DOI index directory. It can be generated via oc_meta.run.orcid_process
wanted_doi_filepath Path of a CSV file containing what DOI to process. This file can be generated via oc_meta.run.coci_process, if COCI's DOIs are needed
verbose Show a loading bar, elapsed time and estimated time. This setting can be safely left as is.

Get IDs from citations

You can get a CSV file containing all the IDs from citation data organized in the CSV format accepted by OpenCitations. This CSV file can be passed as an input to the -wanted argument of crossref_process.py. You can obtain this file by using the get_ids_from_citations.py script, in the following way:

    python -m oc_meta.run.get_ids_from_citations -c <PATH> -out <PATH> -t <INTEGER> -v

Where:

  • -c --citations: the directory containing the citations files, either in CSV or ZIP format
  • -out --output: directory of the output CSV files
  • -t --threshold: number of files to save after
  • -v --verbose: show a loading bar, elapsed time and estimated time, not mandatory.

Generate CSVs from triplestore

This plugin generates CSVs from the Meta triplestore. You can run the csv_generator.py script in the following way:

    python -m oc_meta.run.csv_generator -c <PATH>

Where:

  • -c --config : path to the configuration file. The configuration file is a YAML file with the following keys (an example can be found in config/csv_generator_config.yaml).
Setting Mandatory Description
triplestore_url URL of the endpoint where the data are located
output_csv_dir Directory where the output CSV files will be stored
info_dir The folder where the counters of the various types of entities are stored.
base_iri The base IRI of entities on the triplestore. This setting can be safely left as is
supplier_prefix A prefix for the sequential number in entities’ URIs. This setting can be safely left as is
dir_split_number Number of files per folder. dir_split_number's value must be multiple of items_per_file's value. This setting can be safely left as is
items_per_file Number of items per file. This setting can be safely left as is
verbose Show a loading bar, elapsed time and estimated time. This setting can be safely left as is

Prepare the multiprocess

Before running Meta in multiprocess, it is necessary to prepare the input files. In particular, the CSV files must be divided by publisher, while venues and authors having an identifier must be loaded on the triplestore, in order not to generate duplicates during the multiprocess. These operations can be done by simply running the following script:

    python -m oc_meta.run.prepare_multiprocess -c <PATH>

Where:

  • -c --config : Path to the same configuration file you want to use for Meta.

Afterwards, launch Meta in multi-process by specifying the same configuration file. All the required modifications are done automatically.