miRNome somatic mutations

Project containing Python scripts used to analyse somatic mutations in cancer (LUAD and LUSC) patients using somatic mutation data from TCGA

Python scripts may be reused for other data sources with input data prepared as somatic mutation data in TCGA (https://cancergenome.nih.gov/).

Results of all four algorithms available in TCGA database (muse, mutect2, somaticsniper, varscan2) were used.

For conditions to reuse of these scripts please refer to LICENSE file.

Pre-run preparation

All needed Python libraries are gathered in requirements.txt (Python 3 is needed).

Preparing ViennaRNA

Download ViennaRNA distribution from https://www.tbi.univie.ac.at/RNA/#download to chosen folder and install using instructions:

tar -zxvf ViennaRNA-2.4.11.tar.gz
cd ViennaRNA-2.4.11
./configure
make
sudo make install

Input data description

Coordinates

\t separated file without header with values like:

chromosome\tstart\tstop\tsequence_name\n

example:
```
chr1	17343	17456	hsa-mir-6859-1
chr1	30382	30483	hsa-mir-1302-2
chr1	632306	632428	hsa-mir-6723
```
Confidence file

Excel file with columns: name, score, start, stop, Strand, id, confidence where name is name of miRNA, score is mirBase confidence score, start and stop are coordinates, Strand is strand with + and - values, id id mirBase_ID and confidence is miRBase confidence label with High and Low values.

Example row: hsa-mir-1234, 0, 144400086, 144400165, -, MI0006324, Low

Confidence file may be prepared with a script prepare_confidence_file.py based on four files downloaded from miRBase confidence.txt, confidence_score.txt, aliases.txt and mirna_chromosome_build.

To run this script use:
```
python3 prepare_confidence_file.py ~/Documents/files_for_mirnaome/confidence.txt \
~/Documents/files_for_mirnaome/confidence_score.txt \
~/Documents/files_for_mirnaome/aliases.txt ~/Documents/files_for_mirnaome/mirna_chromosome_build.txt \
~/Documents/files_for_mirnaome/confidence_file.xlsx
```
mirgenedb file

Genomic coordinates from mirgenedb (http://mirgenedb.org/download) for human. hsa.gff file
Cancer exons

Text file with names of exons that should be included in the first steps of the analysis (first mutations extraction from vcf files, are included in coordinates file) but are not miRNA genes. Single exon name in line.

Example:
```
EGFR_chr7.20
EGFR_chr7.21
EGFR_chr7.22
```
Localization file

Excel file with columns: chrom, name, start, stop, orientation, based_on_coordinates, arm, type where name is chrom is chromosome id, name of miRNA localization, start and stop are coordinates, orientation is strand with + and - values, based_on_coordinates states if localization is based on miRBase coordinates or structure prediction, arm is which arm of miRNA this sequence is on and type is type of localization within miRNA precursor.

Example row: chr1, hsa-mir-6859-1-3p_post-seed, 17369, 17383, -, yes, 3p, post-seed

Localization file may be prepared with a script prepare_localization_file.py based on two files downloaded from miRBase hairpin.fa, hsa.gff (here saved as hsa.gff.txt to differentiate from hsa.gff from mirgendb) and coordinates file. ViennaRNA is needed (for installation see above). Important: add absolute path to Vienna package

To run this script use:
```
python3 prepare_localization_file.py /home/<user>/Documents/ViennaRNA-2.4.11 ~/Documents/files_for_mirnaome/hairpin.fa \
~/Documents/files_for_mirnaome/hsa.gff3.txt\
~/Documents/files_for_mirnaome/new_coordinates_all_02.bed ~/Documents/output/Data_LUAD
```
(optional) Chromosome file

Tab-separated file with regions covered by NGS probes with columns TargetID, Interval, Regions, Size, Databases, Coverage, HighCoverage, LowCoverage

Example row: HSA-LET-7A-1, chr9:96938239-96938318, 1, 80, CustomRegion, 100.0, 1, 0

Output data description

temp folder

Contains merged vcf files if there were multiple samples available per single patient.
temp_reference folder

Temporary files created in create localization file script.
files_summary_count_per_patient.csv

File with information how many files there are per patient per algorithm. Sanity check: if the merging of vcf files was successful, we should have only ones.
files_summary.csv

Files summary including user_id, file localization, sample id name and aliQ, and algorithm used.
files_count_per_type.csv

Count of files per algorithm used.
not_unique_patients.csv

Patients for which we had multiple files (multiple samples) that were combined in a single vcf.
do_not_use.txt

Files that were replaced with merged vcf files (stored in temp folder) that will not be used in next steps.
results_muse.csv, results_mutect2.csv, results_somaticsniper.csv, varscan2.csv

Mutations found in vcf files obtained in each of four algorithms.
results_muse_eval.csv, results_mutect2_eval.csv, results_somaticsniper_eval.csv, varscan2_eval.csv

Mutations found in vcf files obtained in each of four algorithms after additional evaluation methods based on read counts, SSC, BQ and QSS.
all_mutations_filtered.csv

Mutations found by each of four algorithms concatenated.
all_mutations_filtered_merge_programs.csv

All mutations grouped to deduplicate mutations found by multiple algorithms.
all_mutations_filtered_mut_type_gene.csv

All mutations within miRNA genes with gene information, localization and mutation type.
complex.csv

Complex mutations are multiple mutations in single miRNA in single patient. Column complex is 1 if mutation is treated as complex.
miRNA_per_chromosome.csv

How many miRNAs were mutated on single chromosome and how many mutations were found in total on each chromosome.
occur.csv

How many total mutations, unique mutations and patients with mutation found per gene.
distinct.csv

Unique mutations description with information how many patients had unique mutations.
distinct_with_loc.csv

Unique mutations description with localization within miRNA precursor.
(optional) mirnas_outside_probes.csv

Mutations in what miRNAs were detected in vcf files outside probes-defined regions.
(optional) high_confidence_mirnas_per_chrom.csv

miRNAs count per chromosome.
(optional) mirnas_per_chrom.csv

miRNAs mutated (found in vscf) count per chromosome.
(optional) patients_per_chrom.csv

Patients that had mutations in each chromosome.

How to use it

Example run is prepared in run_mirnaome.sh bash script.

To prepare confidence data run

python3 prepare_confidence_file.py ~/Documents/files_for_mirnaome/confidence.txt \
~/Documents/files_for_mirnaome/confidence_score.txt \
~/Documents/files_for_mirnaome/aliases.txt ~/Documents/files_for_mirnaome/mirna_chromosome_build.txt \
~/Documents/files_for_mirnaome/confidence_file.xlsx

Confidence file can be found in defined directory under defined filename.

To prepare localization data run

python3 prepare_localization_file.py ~/Documents/ViennaRNA-2.4.11 ~/Documents/files_for_mirnaome/hairpin.fa \
~/Documents/files_for_mirnaome/hsa.gff3.txt\
~/Documents/files_for_mirnaome/new_coordinates_all_02.bed ~/Documents/output/Data_LUAD

Localization file can be found in output folder.

To run analysis run

python3 run_mirnaome_analysis.py ~/dane/HNC/DATA_HNC ~/dane/HNC/RESULTS_HNC ./Reference/new_coordinates_all_02.bed \
./Reference/confidence.xlsx ./Reference/hsa.gff ./Reference/cancer_exons.txt

See additional features running

python3 run_mirnaome_analysis.py --help

To skip steps of the analysis (if first steps were already completed) use -s argument adding step from which script should start.

To include chromosome analysis use -c argument adding chromosome file path.

Authors

Martyna O. Urbanek-Trzeciak, Paulina Galka-Marciniak, Piotr Kozlowski

Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704, Poznan, Poland

Citation

Somatic Mutations in miRNA Genes in Lung Cancer—Potential Functional Consequences of Non-Coding Sequence Variants

Paulina Galka-Marciniak¹, Martyna Olga Urbanek-Trzeciak¹, Paulina Maria Nawrocka¹, Agata Dutkiewicz¹, Maciej Giefing², Marzena Anna Lewandowska^3,4, and Piotr Kozlowski¹

¹ Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland

² Institute of Human Genetics, Polish Academy of Sciences, Poznan, Poland

³ The F. Lukaszczyk Oncology Center, Department of Molecular Oncology and Genetics, Bydgoszcz, Poland

⁴ The Ludwik Rydygier Collegium Medicum, Department of Thoracic Surgery and Tumours, Nicolaus Copernicus University, Bydgoszcz, Poland

Cancers (MDPI) link to publication:

https://www.mdpi.com/2072-6694/11/6/793

Biorxiv link to preprint:

http://biorxiv.org/cgi/content/short/579011v1

Citation in MLA format

Galka-Marciniak, Paulina, et al. "Somatic Mutations in miRNA Genes in Lung Cancer—Potential Functional Consequences of Non-Coding Sequence Variants." Cancers 11.6 (2019): 793.

Contact

For any issues, please create a GitHub Issue.

Funding

This work was supported by research grants from the Polish National Science Centre [2016/22/A/NZ2/00184 (to P.K.) and 2015/17/N/NZ3/03629 (to M.O. U.-T.)] and the Polish Ministry of Science and Higher Education (support for young investigators to P.G.-M.)

martynaut/mirnaome_somatic_mutations