pyinterprod

A centralised Python implementation of InterPro production procedures.

Getting started

Requirements:

Python 3.11+, with packages oracledb, mysqlclient, psycopg3, and mundone (link)
GCC with the sqlite3.h header

Installation

python setup.py install

Configuration

The pyinterprod package relies on three configuration files:

main.conf: contains database connection strings, paths to files provided by/to UniProtKB, and various workflow parameters.
members.conf: contains path to files used to update InterPro's member databases (e.g. files containing signatures, HMM files, etc.).
analyses.conf: contains settings for the InterProScan match calculation (ipr-calc).

All files can be renamed. main.conf is passed as a command line argument, and the paths to members.conf and analyses.conf are defined in main.conf.

main.conf

The expected format for database connection strings is user/password@host:port/service. For Oracle databases, user/password@service may work as well, depending on tnsnames.ora.

oracle
- ipro-interpro: connection string for the interpro user in the InterPro database
- ipro-iprscan: connection string for the iprscan user in the InterPro database
- ipro-uniparc: connection string for the uniparc user in the InterPro database
- iscn-iprscan: connection string for the iprscan user in the InterProScan database
- iscn-uniparc: connection string for the uniparc user in the InterProScan database
- unpr-goapro: connection string for the GOA database
- unpr-swpread: connection string for the Swiss-Prot database
- unpr-uapro: connection string for the UniParc production database
- unpr-uaread: connection string for the UniParc database
postgresql:
- pronto: connection string
uniprot:
- version: release number (e.g. 2019_08)
- date: date for the public release (e.g. 18-Sep-2019)
- swiss-prot: path to Swiss-Prot flat file
- trembl: path to TrEMBL flat file
- unirule: path to file listing InterPro entries and member database signatures used in UniRule
- xrefs: path to directory where to export InterPro cross-references (generated for UniProt)
emails:
- server: outgoing server (format: host:port)
- sender: sender's email address (e.g. user running the workflow)
- aa: email address of the Automatic Annotation team
- aa_dev: email address of the Automatic Annotation development team
- interpro: email address of the InterPro team
- uniprot_db: email address of the UniProt database team
- uniprot_db: email address of the UniProt production team
- unirule: email address of the UniRule team (curators from EMBL-EBI, SIB, and PIR)
- sib: email address of the Swiss-Prot team
misc:
- analyses: path to the analyses.conf config file
- members: path to the members.conf config file
- scheduler: scheduler and queue (format: scheduler:queue, e.g. lsf:production)
- pronto_url: URL of the Pronto curation application
- data_dir: directory where to store staging files
- match_calc_dir: directory where to run InterProScan match calculation
- temporary_dir: directory for temporary files
- workflows_dir: directory for workflows SQLite files, and jobs' input/output files

members.conf

Each section corresponds to a member database (or a sequence feature database), e.g.

[profile]
signatures =

Supported properties are:

Name	Description
`signatures`	Path to the source of database signatures.
`hmm`	Path to an HMM file, used for databases that employ HMMER3-based models. Required when running `ipr-hmm`.
`fasta`	Path to sequences used by models, in the FASTA format.
`members`	Path to file containing the clan-signature mapping.
`go-terms`	Path to file or directory of GO annotations. PANTHER and NCBIfam only.
`triage`	Path to file containing the signatures/models to include in InterPro. NCBIfam only.
`summary`	Path to file of summary information. CDD only.
`seed`	Path to file of SEED alignments. Pfam only.
`full`	Path to file of full alignments. Pfam only.
`clans`	Path to file of clan information. Pfam only.
`mapping`	Path to file of model-signature mapping. CATH-Gene3D only.
`classes`	Path to file of information about classes. ELM only.
`instances`	Path to file of information about instances. ELM only.

analyses.conf

The DEFAULT section defines the defaults values for the following properties:

job_cpu: number of processes to request when submitting a job.
job_mem: the maximum amount of memory a job should be allowed to use (in MB).
job_size: the number of sequences to process in each job.
job_timeout: the number of hours a job is allowed to run for before being killed. Any value lower than 1 disable the timeout.

The default values can be overridden. For instance, adding the following block under the DEFAULT section ensure that MobiDB-Lite jobs timeout after 48 hours and that PRINTS jobs are allocated 16GB of memory:

[mobidb-lite]
job_timeout = 48

[prints]
job_mem = 16384

Usage

Protein update

Update proteins and matches to the latest private UniProt release.

$ ipr-uniprot [OPTIONS] main.conf

The optional arguments are:

-t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)
--dry-run: do not run tasks, only list those about to be run

Tasks

Name	Description	Dependencies
update-uniparc	Import UniParc cross-references
taxonomy	Import the latest taxonomy data from UniProt
import-ipm-matches	Import protein matches from ISPRO	update-uniparc
update-ipm-matches	Update partitioned table with matches	import-ipm-matches
import-ipm-sites	Import residue annotations from ISPRO
update-ipm-sites	Update partitioned table with site matches	import-ipm-sites
update-proteins	Import the new Swiss-Prot and TrEMBL proteins, and compare with the current ones
delete-proteins	Delete obsolete proteins in all production tables	update-proteins
check-proteins	Track UniParc sequences (UPI) associated to UniProt entries that need to be imported (e.g. new or updated sequence)	delete-proteins, update-uniparc
update-matches	Update protein matches for new or updated sequences, run various checks, and track changes in protein counts for InterPro entries	update-ipm-matches, check-proteins
update-fmatches	Update protein matches for sequence features (e.g. MobiDB-lite, Coils, etc.)	update-matches
export-sib	Export Oracle tables required by the Swiss-Prot team	update-matches
report-changes	Report recent integration changes to the UniRule team	update-matches
aa-iprscan	Build the AA_IPRSCAN table, required by the Automatic Annotation team	update-matches
xref-condensed	Build the XREF_CONDENSED table for the Automatic Annotation team (contains representations of protein matches for InterPro entries)	update-matches
xref-summary	Build the XREF_SUMMARY table for the Automatic Annotation team (contains protein matches for integrated member database signatures)	report-changes
export-xrefs	Export text files containing protein matches for the UniProt database team	xref-summary
notify-interpro	Notify the InterPro team that all tables required by the Automatic Annotation team are ready, so we can take a snapshot of our database	update-fmatches, aa-iprscan, xref-condensed, xref-summary
swissprot-de	Export Swiss-Prot descriptions associated to member database signatures in the public release of UniProt (i.e. the release we are updating from)
unirule	Update the list of signatures used by UniRule, so InterPro curators are warned if they attempt to unintegrated one of these signatures.
update-varsplic	Update splice variant matches	update-ipm-matches
update-sites	Update residue annotations	update-ipm-sites, update-matches
Pronto	Update the Pronto PostgreSQL table	taxonomy, update-fmatches, swissprot-de, unirule
send-report	Send reports to curators, and inform them that Pronto is ready	Pronto tasks

Member database update

Update models and protein matches for one or more member databases.

Before running the update, this command must be repeated for each member database. -n is the name of the database (case-insensitive), -d is the release date (of the member database), and -v is the release version.

$ ipr-pre-memdb main.conf -n DATABASE -d YYYY-MM-DD -v VERSION

Then, the actual update can be run:

$ ipr-memdb [OPTIONS] main.conf database [database ...]

The optional arguments are:

-t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)
--dry-run: do not run tasks, only list those about to be run

Tasks

Name	Description	Dependencies
import-ipm-matches	Import protein matches from ISPRO	update-uniparc
update-ipm-matches	Update partitioned table with matches	import-ipm-matches
load-signatures	Import member database signatures for the version to update to
track-changes	Compare signatures between versions (e.g. name, description, matched proteins)	load-signatures
delete-obsoletes	Remove signatures that are not in the latest version of the member database(s)	track-changes
update-signatures	Update metadata for existing signatures, and add new signatures	delete-obsoletes
update-matches	Update and check matches in production tables	update-ipm-matches, update-signatures
update-varsplic	Update splice variant matches	update-ipm-matches, update-signatures
persist-pfam-a	Parse Pfam-A files and store relevant information (only when updating Pfam)	update-ipm-matches, update-signatures
persist-pfam-c	Parse Pfam-C to store clan information (only when updating Pfam)	update-ipm-matches, update-signatures
update-features	Update sequence features for non-member databases (e.g. MobiDB-lite, COILS, etc.)	update-ipm-matches
update-fmatches	Update matches for sequence features	update-features
import-ipm-sites	Import residue annotations from ISPRO
update-ipm-sites	Update partitioned table with site matches	import-ipm-sites
update-sites	Update residue annotations (if updating a member database with residue annotations)	update-ipm-sites, update-matches
Pronto	Update the Pronto PostgreSQL tables	update-matches
send-report	Send reports to curators, and inform them that Pronto is ready	Pronto tasks

Pronto

$ ipr-pronto [OPTIONS] main.conf

The optional arguments are:

-t, --tasks: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)
--dry-run: do not run tasks, only list those about to be run

Tasks

Name	Description	Dependencies
go-terms	Import publications associated to protein annotations
go-constraints	Import GO taxonomic constraints
proteins-similarities	Import UniProt general annotations (comments) on sequence similarities
proteins-names	Import UniProt sequence names
databases	Import database information (e.g. version, release date)
proteins	Import general information on proteins (e.g. accession, length, species)
init-matches	Create the match table (empty)
export-matches	Export protein matches for member database signatures	init-matches
insert-matches	Insert protein matches for member database signatures	export-matches
insert-fmatches	Insert protein matches for sequence features (AntiFam, etc.)	init-matches
index-matches	Index and cluster the match table	insert-matches, insert-fmatches
insert-signature2proteins	Associate member database signatures with UniProt proteins, UniProt descriptions, taxonomic origins, and GO terms	export-matches, proteins-names
index-signature2proteins	Index the signature2proteins table	insert-signature2proteins
signatures	Import and compare member database signatures	databases, export-matches
taxonomy	Import UniProt taxonomy
structures	Import structural matches

InterProScan match calculation

$ ipr-calc main.conf [COMMAND] [OPTIONS]

The available commands (and their optional arguments) are:

import: import sequences from the UniParc Oracle database
- --top-up: import new sequences only
clean: delete obsolete data
- -a, --analyses: IDs of analyses to clean (default: all)
search: scan sequences using InterProScan
- --dry-run: show the number of jobs to run and exit
- -l, --list: list active analyses and exit
- -a, --analyses: IDs of analyses to run (default: all)
- -t, --threads: number of monitoring threads (default: 8)
- --concurrent-jobs: maximum number of concurrently running InterProScan jobs (default: 1000)
- --max-jobs: maximum number of jobs to run per analysis before exiting (default: disabled)
- --max-retries: number of times a failed job is resubmitted (default: disabled)
- --keep none|all|failed: keep input/output files (default: none)

Examples

Import new UniParc sequences:

ipr-calc main.conf import --top-up

Process jobs for analysis 42 only, allow each job to run three times (i.e. restart twice), but keep all temporary files, regardless of the job success/failure:

ipr-calc main.conf search -a 42 --max-retries 2 --keep all

Run 10 jobs per analysis, and keep failed jobs to investigate:

ipr-calc main.conf search --max-retries 10 --keep failed

Clans update

Update clans and run profile-profile alignments.

$ ipr-clans [OPTIONS] main.conf database [database ...]

The optional arguments are:

-t, --threads: number of alignment workers
-T, --tempdir: directory to use for temporary files

HMMs update

Load HMMs in the database.

$ ipr-hmms main.conf database [database ...]

ProteinsWebTeam/pyinterprod

pyinterprod

Getting started

Requirements:

Installation

Configuration

main.conf

members.conf

analyses.conf

Usage

Protein update

Tasks

Member database update

Tasks

Pronto

Tasks

InterProScan match calculation

Examples

Clans update

HMMs update