/mirnaome_somatic_mutations

Python scripts to find somatic mutations in miRNA genes based on TCGA data

Primary LanguagePythonMIT LicenseMIT

miRNome somatic mutations

Project containing Python scripts used to analyse somatic mutations in cancer (LUAD and LUSC) patients using somatic mutation data from TCGA

Python scripts may be reused for other data sources with input data prepared as somatic mutation data in TCGA (https://cancergenome.nih.gov/).

Results of all four algorithms available in TCGA database (muse, mutect2, somaticsniper, varscan2) were used.

For conditions to reuse of these scripts please refer to LICENSE file.

Pre-run preparation

All needed Python libraries are gathered in requirements.txt (Python 3 is needed).

Preparing ViennaRNA

Download ViennaRNA distribution from https://www.tbi.univie.ac.at/RNA/#download to chosen folder and install using instructions:

tar -zxvf ViennaRNA-2.4.11.tar.gz
cd ViennaRNA-2.4.11
./configure
make
sudo make install

Input data description

  1. Coordinates

    \t separated file without header with values like:

    chromosome\tstart\tstop\tsequence_name\n

    example:

    chr1	17343	17456	hsa-mir-6859-1
    chr1	30382	30483	hsa-mir-1302-2
    chr1	632306	632428	hsa-mir-6723
    
  2. Confidence file

    Excel file with columns: name, score, start, stop, Strand, id, confidence where name is name of miRNA, score is mirBase confidence score, start and stop are coordinates, Strand is strand with + and - values, id id mirBase_ID and confidence is miRBase confidence label with High and Low values.

    Example row: hsa-mir-1234, 0, 144400086, 144400165, -, MI0006324, Low

    Confidence file may be prepared with a script prepare_confidence_file.py based on four files downloaded from miRBase confidence.txt, confidence_score.txt, aliases.txt and mirna_chromosome_build.

    To run this script use:

    python3 prepare_confidence_file.py ~/Documents/files_for_mirnaome/confidence.txt \
    ~/Documents/files_for_mirnaome/confidence_score.txt \
    ~/Documents/files_for_mirnaome/aliases.txt ~/Documents/files_for_mirnaome/mirna_chromosome_build.txt \
    ~/Documents/files_for_mirnaome/confidence_file.xlsx
    
  3. mirgenedb file

    Genomic coordinates from mirgenedb (http://mirgenedb.org/download) for human. hsa.gff file

  4. Cancer exons

    Text file with names of exons that should be included in the first steps of the analysis (first mutations extraction from vcf files, are included in coordinates file) but are not miRNA genes. Single exon name in line.

    Example:

    EGFR_chr7.20
    EGFR_chr7.21
    EGFR_chr7.22
  5. Localization file

    Excel file with columns: chrom, name, start, stop, orientation, based_on_coordinates, arm, type where name is chrom is chromosome id, name of miRNA localization, start and stop are coordinates, orientation is strand with + and - values, based_on_coordinates states if localization is based on miRBase coordinates or structure prediction, arm is which arm of miRNA this sequence is on and type is type of localization within miRNA precursor.

    Example row: chr1, hsa-mir-6859-1-3p_post-seed, 17369, 17383, -, yes, 3p, post-seed

    Localization file may be prepared with a script prepare_localization_file.py based on two files downloaded from miRBase hairpin.fa, hsa.gff (here saved as hsa.gff.txt to differentiate from hsa.gff from mirgendb) and coordinates file. ViennaRNA is needed (for installation see above). Important: add absolute path to Vienna package

    To run this script use:

    python3 prepare_localization_file.py /home/<user>/Documents/ViennaRNA-2.4.11 ~/Documents/files_for_mirnaome/hairpin.fa \
    ~/Documents/files_for_mirnaome/hsa.gff3.txt\
    ~/Documents/files_for_mirnaome/new_coordinates_all_02.bed ~/Documents/output/Data_LUAD
    
  6. (optional) Chromosome file

    Tab-separated file with regions covered by NGS probes with columns TargetID, Interval, Regions, Size, Databases, Coverage, HighCoverage, LowCoverage

    Example row: HSA-LET-7A-1, chr9:96938239-96938318, 1, 80, CustomRegion, 100.0, 1, 0

Output data description

  1. temp folder

    Contains merged vcf files if there were multiple samples available per single patient.

  2. temp_reference folder

    Temporary files created in create localization file script.

  3. files_summary_count_per_patient.csv

    File with information how many files there are per patient per algorithm. Sanity check: if the merging of vcf files was successful, we should have only ones.

  4. files_summary.csv

    Files summary including user_id, file localization, sample id name and aliQ, and algorithm used.

  5. files_count_per_type.csv

    Count of files per algorithm used.

  6. not_unique_patients.csv

    Patients for which we had multiple files (multiple samples) that were combined in a single vcf.

  7. do_not_use.txt

    Files that were replaced with merged vcf files (stored in temp folder) that will not be used in next steps.

  8. results_muse.csv, results_mutect2.csv, results_somaticsniper.csv, varscan2.csv

    Mutations found in vcf files obtained in each of four algorithms.

  9. results_muse_eval.csv, results_mutect2_eval.csv, results_somaticsniper_eval.csv, varscan2_eval.csv

    Mutations found in vcf files obtained in each of four algorithms after additional evaluation methods based on read counts, SSC, BQ and QSS.

  10. all_mutations_filtered.csv

    Mutations found by each of four algorithms concatenated.

  11. all_mutations_filtered_merge_programs.csv

    All mutations grouped to deduplicate mutations found by multiple algorithms.

  12. all_mutations_filtered_mut_type_gene.csv

    All mutations within miRNA genes with gene information, localization and mutation type.

  13. complex.csv

    Complex mutations are multiple mutations in single miRNA in single patient. Column complex is 1 if mutation is treated as complex.

  14. miRNA_per_chromosome.csv

    How many miRNAs were mutated on single chromosome and how many mutations were found in total on each chromosome.

  15. occur.csv

    How many total mutations, unique mutations and patients with mutation found per gene.

  16. distinct.csv

    Unique mutations description with information how many patients had unique mutations.

  17. distinct_with_loc.csv

    Unique mutations description with localization within miRNA precursor.

  18. (optional) mirnas_outside_probes.csv

    Mutations in what miRNAs were detected in vcf files outside probes-defined regions.

  19. (optional) high_confidence_mirnas_per_chrom.csv

    miRNAs count per chromosome.

  20. (optional) mirnas_per_chrom.csv

    miRNAs mutated (found in vscf) count per chromosome.

  21. (optional) patients_per_chrom.csv

    Patients that had mutations in each chromosome.

How to use it

Example run is prepared in run_mirnaome.sh bash script.

To prepare confidence data run

python3 prepare_confidence_file.py ~/Documents/files_for_mirnaome/confidence.txt \
~/Documents/files_for_mirnaome/confidence_score.txt \
~/Documents/files_for_mirnaome/aliases.txt ~/Documents/files_for_mirnaome/mirna_chromosome_build.txt \
~/Documents/files_for_mirnaome/confidence_file.xlsx

Confidence file can be found in defined directory under defined filename.

To prepare localization data run

python3 prepare_localization_file.py ~/Documents/ViennaRNA-2.4.11 ~/Documents/files_for_mirnaome/hairpin.fa \
~/Documents/files_for_mirnaome/hsa.gff3.txt\
~/Documents/files_for_mirnaome/new_coordinates_all_02.bed ~/Documents/output/Data_LUAD

Localization file can be found in output folder.

To run analysis run

python3 run_mirnaome_analysis.py ~/dane/HNC/DATA_HNC ~/dane/HNC/RESULTS_HNC ./Reference/new_coordinates_all_02.bed \
./Reference/confidence.xlsx ./Reference/hsa.gff ./Reference/cancer_exons.txt

See additional features running

python3 run_mirnaome_analysis.py --help

To skip steps of the analysis (if first steps were already completed) use -s argument adding step from which script should start.

To include chromosome analysis use -c argument adding chromosome file path.

Authors

Martyna O. Urbanek-Trzeciak, Paulina Galka-Marciniak, Piotr Kozlowski

Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704, Poznan, Poland

Citation

Somatic Mutations in miRNA Genes in Lung Cancer—Potential Functional Consequences of Non-Coding Sequence Variants

Paulina Galka-Marciniak1, Martyna Olga Urbanek-Trzeciak1, Paulina Maria Nawrocka1, Agata Dutkiewicz1, Maciej Giefing2, Marzena Anna Lewandowska3,4, and Piotr Kozlowski1

1 Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland

2 Institute of Human Genetics, Polish Academy of Sciences, Poznan, Poland

3 The F. Lukaszczyk Oncology Center, Department of Molecular Oncology and Genetics, Bydgoszcz, Poland

4 The Ludwik Rydygier Collegium Medicum, Department of Thoracic Surgery and Tumours, Nicolaus Copernicus University, Bydgoszcz, Poland

Cancers (MDPI) link to publication:

https://www.mdpi.com/2072-6694/11/6/793

Biorxiv link to preprint:

http://biorxiv.org/cgi/content/short/579011v1

Citation in MLA format

Galka-Marciniak, Paulina, et al. "Somatic Mutations in miRNA Genes in Lung Cancer—Potential Functional Consequences of Non-Coding Sequence Variants." Cancers 11.6 (2019): 793.

Contact

For any issues, please create a GitHub Issue.

Funding

This work was supported by research grants from the Polish National Science Centre [2016/22/A/NZ2/00184 (to P.K.) and 2015/17/N/NZ3/03629 (to M.O. U.-T.)] and the Polish Ministry of Science and Higher Education (support for young investigators to P.G.-M.)