/pyinterprod

Python package to run InterPro production procedures

Primary LanguagePythonMIT LicenseMIT

pyinterprod

A centralised Python implementation of InterPro production procedures.

Getting started

Requirements:

  • Python 3.11+, with packages oracledb, mysqlclient, psycopg3, and mundone (link)
  • GCC with the sqlite3.h header

Installation

python setup.py install

Configuration

The pyinterprod package relies on three configuration files:

  • main.conf: contains database connection strings, paths to files provided by/to UniProtKB, and various workflow parameters.
  • members.conf: contains path to files used to update InterPro's member databases (e.g. files containing signatures, HMM files, etc.).
  • analyses.conf: contains settings for the InterProScan match calculation (ipr-calc).

All files can be renamed. main.conf is passed as a command line argument, and the paths to members.conf and analyses.conf are defined in main.conf.

main.conf

The expected format for database connection strings is user/password@host:port/service. For Oracle databases, user/password@service may work as well, depending on tnsnames.ora.

  • oracle
    • ipro-interpro: connection string for the interpro user in the InterPro database
    • ipro-iprscan: connection string for the iprscan user in the InterPro database
    • ipro-uniparc: connection string for the uniparc user in the InterPro database
    • iscn-iprscan: connection string for the iprscan user in the InterProScan database
    • iscn-uniparc: connection string for the uniparc user in the InterProScan database
    • unpr-goapro: connection string for the GOA database
    • unpr-swpread: connection string for the Swiss-Prot database
    • unpr-uapro: connection string for the UniParc production database
    • unpr-uaread: connection string for the UniParc database
  • postgresql:
    • pronto: connection string
  • uniprot:
    • version: release number (e.g. 2019_08)
    • date: date for the public release (e.g. 18-Sep-2019)
    • swiss-prot: path to Swiss-Prot flat file
    • trembl: path to TrEMBL flat file
    • unirule: path to file listing InterPro entries and member database signatures used in UniRule
    • xrefs: path to directory where to export InterPro cross-references (generated for UniProt)
  • emails:
    • server: outgoing server (format: host:port)
    • sender: sender's email address (e.g. user running the workflow)
    • aa: email address of the Automatic Annotation team
    • aa_dev: email address of the Automatic Annotation development team
    • interpro: email address of the InterPro team
    • uniprot_db: email address of the UniProt database team
    • uniprot_db: email address of the UniProt production team
    • unirule: email address of the UniRule team (curators from EMBL-EBI, SIB, and PIR)
    • sib: email address of the Swiss-Prot team
  • misc:
    • analyses: path to the analyses.conf config file
    • members: path to the members.conf config file
    • scheduler: scheduler and queue (format: scheduler:queue, e.g. lsf:production)
    • pronto_url: URL of the Pronto curation application
    • data_dir: directory where to store staging files
    • match_calc_dir: directory where to run InterProScan match calculation
    • temporary_dir: directory for temporary files
    • workflows_dir: directory for workflows SQLite files, and jobs' input/output files

members.conf

Each section corresponds to a member database (or a sequence feature database), e.g.

[profile]
signatures =

Supported properties are:

Name Description
signatures Path to the source of database signatures.
hmm Path to an HMM file, used for databases that employ HMMER3-based models. Required when running ipr-hmm.
fasta Path to sequences used by models, in the FASTA format.
members Path to file containing the clan-signature mapping.
go-terms Path to file or directory of GO annotations. PANTHER and NCBIfam only.
triage Path to file containing the signatures/models to include in InterPro. NCBIfam only.
summary Path to file of summary information. CDD only.
seed Path to file of SEED alignments. Pfam only.
full Path to file of full alignments. Pfam only.
clans Path to file of clan information. Pfam only.
mapping Path to file of model-signature mapping. CATH-Gene3D only.
classes Path to file of information about classes. ELM only.
instances Path to file of information about instances. ELM only.

analyses.conf

The DEFAULT section defines the defaults values for the following properties:

  • job_cpu: number of processes to request when submitting a job.
  • job_mem: the maximum amount of memory a job should be allowed to use (in MB).
  • job_size: the number of sequences to process in each job.
  • job_timeout: the number of hours a job is allowed to run for before being killed. Any value lower than 1 disable the timeout.

The default values can be overridden. For instance, adding the following block under the DEFAULT section ensure that MobiDB-Lite jobs timeout after 48 hours and that PRINTS jobs are allocated 16GB of memory:

[mobidb-lite]
job_timeout = 48

[prints]
job_mem = 16384

Usage

Protein update

Update proteins and matches to the latest private UniProt release.

$ ipr-uniprot [OPTIONS] main.conf

The optional arguments are:

  • -t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)
  • --dry-run: do not run tasks, only list those about to be run

Tasks

Name Description Dependencies
update-uniparc Import UniParc cross-references
taxonomy Import the latest taxonomy data from UniProt
import-ipm-matches Import protein matches from ISPRO update-uniparc
update-ipm-matches Update partitioned table with matches import-ipm-matches
import-ipm-sites Import residue annotations from ISPRO
update-ipm-sites Update partitioned table with site matches import-ipm-sites
update-proteins Import the new Swiss-Prot and TrEMBL proteins, and compare with the current ones
delete-proteins Delete obsolete proteins in all production tables update-proteins
check-proteins Track UniParc sequences (UPI) associated to UniProt entries that need to be imported (e.g. new or updated sequence) delete-proteins, update-uniparc
update-matches Update protein matches for new or updated sequences, run various checks, and track changes in protein counts for InterPro entries update-ipm-matches, check-proteins
update-fmatches Update protein matches for sequence features (e.g. MobiDB-lite, Coils, etc.) update-matches
export-sib Export Oracle tables required by the Swiss-Prot team update-matches
report-changes Report recent integration changes to the UniRule team update-matches
aa-iprscan Build the AA_IPRSCAN table, required by the Automatic Annotation team update-matches
xref-condensed Build the XREF_CONDENSED table for the Automatic Annotation team (contains representations of protein matches for InterPro entries) update-matches
xref-summary Build the XREF_SUMMARY table for the Automatic Annotation team (contains protein matches for integrated member database signatures) report-changes
export-xrefs Export text files containing protein matches for the UniProt database team xref-summary
notify-interpro Notify the InterPro team that all tables required by the Automatic Annotation team are ready, so we can take a snapshot of our database update-fmatches, aa-iprscan, xref-condensed, xref-summary
swissprot-de Export Swiss-Prot descriptions associated to member database signatures in the public release of UniProt (i.e. the release we are updating *from*)
unirule Update the list of signatures used by UniRule, so InterPro curators are warned if they attempt to unintegrated one of these signatures.
update-varsplic Update splice variant matches update-ipm-matches
update-sites Update residue annotations update-ipm-sites, update-matches
Pronto Update the Pronto PostgreSQL table taxonomy, update-fmatches, swissprot-de, unirule
send-report Send reports to curators, and inform them that Pronto is ready Pronto tasks

Member database update

Update models and protein matches for one or more member databases.

Before running the update, this command must be repeated for each member database. -n is the name of the database (case-insensitive), -d is the release date (of the member database), and -v is the release version.

$ ipr-pre-memdb main.conf -n DATABASE -d YYYY-MM-DD -v VERSION

Then, the actual update can be run:

$ ipr-memdb [OPTIONS] main.conf database [database ...]

The optional arguments are:

  • -t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)
  • --dry-run: do not run tasks, only list those about to be run

Tasks
Name Description Dependencies
import-ipm-matches Import protein matches from ISPRO update-uniparc
update-ipm-matches Update partitioned table with matches import-ipm-matches
load-signatures Import member database signatures for the version to update to
track-changes Compare signatures between versions (e.g. name, description, matched proteins) load-signatures
delete-obsoletes Remove signatures that are not in the latest version of the member database(s) track-changes
update-signatures Update metadata for existing signatures, and add new signatures delete-obsoletes
update-matches Update and check matches in production tables update-ipm-matches, update-signatures
update-varsplic Update splice variant matches update-ipm-matches, update-signatures
persist-pfam-a Parse Pfam-A files and store relevant information (only when updating Pfam) update-ipm-matches, update-signatures
persist-pfam-c Parse Pfam-C to store clan information (only when updating Pfam) update-ipm-matches, update-signatures
update-features Update sequence features for non-member databases (e.g. MobiDB-lite, COILS, etc.) update-ipm-matches
update-fmatches Update matches for sequence features update-features
import-ipm-sites Import residue annotations from ISPRO
update-ipm-sites Update partitioned table with site matches import-ipm-sites
update-sites Update residue annotations (if updating a member database with residue annotations) update-ipm-sites, update-matches
Pronto Update the Pronto PostgreSQL tables update-matches
send-report Send reports to curators, and inform them that Pronto is ready Pronto tasks

Pronto

$ ipr-pronto [OPTIONS] main.conf

The optional arguments are:

  • -t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)
  • --dry-run: do not run tasks, only list those about to be run

Tasks

Name Description Dependencies
go-terms Import publications associated to protein annotations
go-constraints Import GO taxonomic constraints
proteins-similarities Import UniProt general annotations (comments) on sequence similarities
proteins-names Import UniProt sequence names
databases Import database information (e.g. version, release date)
proteins Import general information on proteins (e.g. accession, length, species)
init-matches Create the match table (empty)
export-matches Export protein matches for member database signatures init-matches
insert-matches Insert protein matches for member database signatures export-matches
insert-fmatches Insert protein matches for sequence features (AntiFam, etc.) init-matches
index-matches Index and cluster the match table insert-matches, insert-fmatches
insert-signature2proteins Associate member database signatures with UniProt proteins, UniProt descriptions, taxonomic origins, and GO terms export-matches, proteins-names
index-signature2proteins Index the signature2proteins table insert-signature2proteins
signatures Import and compare member database signatures databases, export-matches
taxonomy Import UniProt taxonomy
structures Import structural matches

InterProScan match calculation

$ ipr-calc main.conf [COMMAND] [OPTIONS]

The available commands (and their optional arguments) are:

  • import: import sequences from the UniParc Oracle database
    • --top-up: import new sequences only
  • clean: delete obsolete data
    • -a, --analyses: IDs of analyses to clean (default: all)
  • search: scan sequences using InterProScan
    • --dry-run: show the number of jobs to run and exit
    • -l, --list: list active analyses and exit
    • -a, --analyses: IDs of analyses to run (default: all)
    • -t, --threads: number of monitoring threads (default: 8)
    • --concurrent-jobs: maximum number of concurrently running InterProScan jobs (default: 1000)
    • --max-jobs: maximum number of jobs to run per analysis before exiting (default: disabled)
    • --max-retries: number of times a failed job is resubmitted (default: disabled)
    • --keep none|all|failed: keep input/output files (default: none)

Examples

Import new UniParc sequences:

ipr-calc main.conf import --top-up

Process jobs for analysis 42 only, allow each job to run three times (i.e. restart twice), but keep all temporary files, regardless of the job success/failure:

ipr-calc main.conf search -a 42 --max-retries 2 --keep all

Run 10 jobs per analysis, and keep failed jobs to investigate:

ipr-calc main.conf search --max-retries 10 --keep failed

Clans update

Update clans and run profile-profile alignments.

$ ipr-clans [OPTIONS] main.conf database [database ...]

The optional arguments are:

  • -t, --threads: number of alignment workers
  • -T, --tempdir: directory to use for temporary files

HMMs update

Load HMMs in the database.

$ ipr-hmms main.conf database [database ...]