/HINT_preprocess

Data pre-processing pipeline for HINT

Primary LanguagePython

HINT_preprocess

Data pre-processing pipeline for HINT

Downloads

Downloaded files will be saved at $SOURCE_DIR specified in the script

# SOURCE_DIR structure
.
|-- uniparc
|-- knowledgebase
    |-- idmapping
    |   |-- idmapping.dat.gz
    |-- complete
    |   |-- uniprot_sprot.fasta.gz  
    |   |-- uniprot_sprot_varsplic.fasta.gz  
    |   |-- uniprot_trembl.fasta.gz
    |-- docs

Data parsing and processing

1. Parse UniProt reference files

  • Run parse_source_data.py to process reference files downloaded from UniProt FTP site
    python parse_source_data.py $SOURCE_DIR/knowledgebase
    
    The following files will be processed:
    • FASTA files (uniprot_sprot.fasta.gz, uniprot_sprot_varsplic.fasta.gz, uniprot_trembl.fasta.gz) - to extract protein meta information
    • species file docs/speclist.txt to extract species taxonomy information
    • secondary-to-primary accession mapping file docs/sec_ac.txt

2. Parse and aggregate protein interaction dataset from different sources

Run prepare_dataset.py to prepare protein interaction data from sources of interest. The following steps will be processed by running the script:

  1. Collect datasets from sources of interest. Active datasets will be downloaded from the source and inactive dataset will be copied from our self-maintained cache directory. Raw data files will be saved at $UPDATE_DIR/data/parseTargets

The following data sources are included (Last update: 2024.6.28)

  • Active:
    • BioGRID
    • IntAct
    • PDB: generate IRES files (listed below) using scripts in pdb_data_prep/
      Note: consider to switch to up-to-date IRES files generated by in-house nightly_script on server
      • ires_perpdb_alltax.txt
      • ires_perpdb_alltax_pdblike.txt (newly added for large PDB structures saved as pdb-bundle TAR format)
  • Inactive:
    • DIP (dip20170205.txt)
    • iRef (All.mitab.03022013.txt)
    • HPRD (BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt)
    • MIPS (mppi xml format)
    |-- static_datasets
        |-- dip20170205.txt
        |-- All.mitab.03022013.txt
        |-- mppi
        |-- BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt
    
  1. Parse raw data and generate initial raw_interactions.txt file saved at $UPDATE_DIR/outputs/

  2. Revise parsed raw_interaction data and fill in UniProt IDs if available in source.Output files will be saved at $UPDATE_DIR/outputs/cache/. The following files are generated in this step:

  • raw_interactions_filled_partial.txt revised raw interaction files with UniProt IDs filled when available in source
  • mapping_targets_by_type.json IDs remain to be mapped to UniProt IDs orgainized by ID types. Supported ID types can be found in constants.py

3. Generate source-to-uniprot ID mapping

Run create_idmapping.py to parse idmapping.dat.gz from UniProt FTP site and generate ID mapping dictionary from source IDs to UniProt IDs

Inputs

  • idmapping.dat.gz
  • mapping_targets_by_type.json

Outputs

  • target_type_to_uprot.json dictionary of ID mapping organized by ID type (if an ID is mapped to multiple UniProt IDs, all UniProt IDs will be kept & concatenated by '|')
  • prot_gene_info.tsv descriptions for each UniProt ID columns: (uprot | UniProtKB-ID | Gene_Name | Gene_ORFName | NCBI_TaxID)
  • target_result.json mapping of target IDs to UniProt when available
    Format: "ID_TYPE|SOURCE_ID": "UNIPROT_ID1(|UNIPROT_ID2|UNIPROT_ID3...)" (Example: '"DIP|DIP-17064N": "Q9TW27"')

4. Proceed to remaining HINT data curation pipeline

Run codes in jupyter notebook: 4-HINT_data_curation.ipynb. The following files will be generated and organized in the following structure in output directory.

|-- taxid2name_short.txt
|-- raw_interactome.txt
|-- HINT_format
    |-- protein_meta.txt
    |-- binary_all.txt
    |-- binary_hq.txt
    |-- both_all.txt
    |-- both_hq.txt
    |-- cocomp_all.txt
    |-- cocomp_hq.txt
    |-- htb_hq.txt
    |-- htc_hq.txt
    |-- lcb_hq.txt
    |-- lcc_hq.txt
    |-- taxa
        |-- HomoSapiens
        |   |-- HomoSapiens_binary_all.txt
        |   |-- HomoSapiens_binary_hq.txt
        |   |-- HomoSapiens_both_all.txt
        |   |-- HomoSapiens_both_hq.txt
        |   |-- HomoSapiens_cocomp_all.txt
        |   |-- HomoSapiens_cocomp_hq.txt
        |   |-- HomoSapiens_htb_hq.txt
        |   |-- HomoSapiens_htc_hq.txt
        |   |-- HomoSapiens_lcb_hq.txt
        |   |-- HomoSapiens_lcc_hq.txt
        |-- MusMusculus
        |   |-- MusMusculus_binary_all.txt
        |   |-- MusMusculus_binary_hq.txt
        |   |-- MusMusculus_both_all.txt
        |   |-- MusMusculus_both_hq.txt
        |   |-- MusMusculus_cocomp_all.txt
        |   |-- MusMusculus_cocomp_hq.txt
        |   |-- MusMusculus_htb_hq.txt
        |   |-- MusMusculus_htc_hq.txt
        |   |-- MusMusculus_lcb_hq.txt
        |   |-- MusMusculus_lcc_hq.txt
        |-- ...

5. Generate Venn diagrams for comparison with last active verson

Run 5-plot_venn.py to generate venn diagrams of all interaction sets for human and yeast. By default the plots will be saved as PDF files in $UPDATE_DIR/figures (Modify path specs in script if necessary).

Handling archived accessions with UniParc

The UniProt Archive (UniParc) is a non-redundant protein sequence archive, containing all new and revised protein sequences from all publicly available sources.

Sometimes, protein IDs in source database are mapped to UniProt IDs that are deleted in current release. In that case, we can retrieve protein information from UniParc data. Raw UniParc files are downloaded from https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/xml/all/ in download.sh script.

Run parse_uniparc.sh to parse raw XML format UniParc files into TAB-separated format and extract inactive entries.

  • Example of parsed file
accession	uniprot	is_reviewed	status	taxa	protein_name	gene_name	seq_length	source_file
UPI00000001D2	Q71UK0	False	Y	10090	Growth factor receptor (Fragment)		61	uniparc_p1
UPI000000075A	Q0NCH3	False	N	587201	Membrane protein	VARV_SAF65_102_152	277	uniparc_p1
UPI00000002D3	Q549A3	False	Y	4232	LIM domain protein PLIM1a	PLIM1a	219	uniparc_p1
UPI0000000563	Q6LCT9	False	Y	9606	Diabetes mellitus type I autoantigen	ICA1	87	uniparc_p1

[Optional] Continue with the second part in parse_uniparc.py script to extract information for target species.