Codes and datasets for our manuscript "Prediction of plant complex traits via integration of multi-omics data".
The SNP matrix was download from https://1001genomes.org/data/GMI-MPI/releases/v3.1/SNP_matrix_imputed_hdf5/1001_SNP_MATRIX.tar.gz
The transcriptomic data was downloaded from https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE80744&format=file&file=GSE80744%5Fath1001%5Ftx%5Fnorm%5F2016%2D04%2D21%2DUQ%5FgNorm%5FnormCounts%5Fk4%2Etsv%2Egz which are read count files for 727 accessions
The gene body methylation data was download from http://signal-genet.salk.edu/1001.php , which contains 1107 methylomes, with the ID name. The tsv files for individual accessions were download from NCBI with GEO accession ID: GSE43857
Flowering time at 10℃ (1163 accessions) and 16℃ (1123 accessions) were downloaded from https://arapheno.1001genomes.org/study/12/ Other phenotypic data were download from https://arapheno.1001genomes.org/study/38/
Benchmark flowering time genes were downloaded from the FLOR-ID database http://www.phytosystems.ulg.ac.be/florid/
Related scripts for SNP matrix can be found in the folder Data_preprocessing/01_SNP_matrix
Get the SNP matrix for all accessions, and also output the list of all SNPs
python 01_extract_the_h5py_binary_SNP_matrix.py
Save the SNP matrix for 383 accession; convert the common allele to 1, and the alternative allele to -1; get rid of the rare variants (MAF < 5%)
python 02_get_snp_matrix_for_383_accessions.py
Related scripts can be found in the folder Data_preprocessing/02_transcriptomic data
To calculate the TPM (transcripts per kilobase million), we need to get the transcript length for genes at first. Get the GFF3 file ready before run this script. Here we used the TAIR10 GFF3 file, which is downloaded from the TAIR database
python 01_get_transcript_length_including_ncRNA.py
Calculate TPM using the function “calculateTPM” from the R package “scater”
Put the downloaded file GSE80744_ath1001_tx_norm_2016-04-21-UQ_gNorm_normCounts_k4.tsv and the transcript length file TAIR10_longest_mRNA_length_including_ncRNA.txt in your work directory
Rscript 02_calculate_TPM.r
Log the TPM and get the logged TPM matrix for 383 accessions
python 03_log_TPM_and_get_transcriptomic_matrix_for_383_accessions.py
Related scripts can be found in the folder Data_preprocessing/03_methylomic_matrix/1_gene_body_methylation
Parse the downloaded gene-body methylation matrix, and save the corresponding matrix for 383 accessions
python 01_get_methylation_for_383_accessions.py
Impute missing data
python 02_knn_imputation.py input_file Test_for_383_accessions.txt
Download individual methylation data for 383 accessions
python 01_download_individual_methylation_data_for_383_accessions.py
Related scripts can be found in the folder Data_preprocessing/03_methylomic_matrix/2_single_site_methylation_P_A
Only save methylated sites, which are sites with the sixth column at 1 in the downloaded *.tsv file.
python 02_only_save_methylated_sites.py inputFile
To make slurm jobs for the above script, you can use the script below, your_work_dir is the directory containing all the downloaded *.tsv files
python 03_write_slurm_jobs_for_03.py inputFile
Get the methylated single site list for all 383 accessions, and order the list
python 04_get_methylation_list.py
Rscript 05_order_methylation_sites.r
Combine all files with the single-site presence/absence state of methylation, and the output file is Methylation_genome_wide_383_accessions.csv
python 06_combine_all_methylation_files.py
To distinguish the un-methylated sites (value should be 0) from sites with missing information (values should be NA to be imputed in the future), we first converted all 0 values to NAs, and then filled the NAs using information from SNP matrix.
Get targeted methylated sites (all single sites which were methylated in at least one accession) for each accession. The input files are the downloaded *.tsv files, and the file Methylation_sites_listgenome_wide_383_accessions.txt is output from python 04_get_methylation_list.py
awk \'{print "Chr"$1"_"$2"_"$4"_"$3}\' < inputFile > inputFile_selected_columns
awk \'NR==FNR { lines[$0]=1; next } $0 in lines\' inputFile_selected_columns Methylation_sites_listgenome_wide_383_accessions.txt > inputFile_met.txt
To make slurm jobs for the above two awk commands, you can use the script below. Your_work_dir is the directory containing all the downloaded *.tsv files
python 07_get_targeted_sites.py your_work_dir
Get overlapping sites between the SNPs and the methylation sites
Rscript 08_get_overlapping_site_between_M_and_G.r
Get the SNP information for the overlapping sites betweem SNPs and methylation sites
python 09_get_ref_seq_for_overlapping_methylation_sites.py inputFile
To facilitate the job running, the Methylation_genome_wide_383_accessions.csv file was split into 100 small files
awk 'NR==1{header=$0; count=1; print header > "Methylation_genome_wide_383_accessions.csv_" count; next } !( (NR-1) % 173766){ count++; print header > "Methylation_genome_wide_383_accessions.csv_" count; } {print $0 > "Methylation_genome_wide_383_accessions.csv_" count }' Methylation_genome_wide_383_accessions.csv
Fill 0s with NAs, and then fill the NAs back with 0s for overlapping site between SNPs and methylation sites. The inputFile is one of the small files output from the above awk command.
python 10_fill_0_with_NaN.py inputFile
python 11_fill_NA_back_with_0_for_shared_sites.py inputFile
Count the number of accessions which have NAs for each methylation site, and the inputFile is one of the small files
python 12_check_methylation_data_NaN_proportions_and_drop_separate_files.py inputFile
To make slurm jobs for the above three python scripts, you can use the script below. your_work_dir is the directory containing all the small files
python 13_make_slurm_jobs_for_10_11_and_12.py your_work_dir
Concat all count files together
cat *count > Methylation_genome_wide_618_accessions_considering_same_site_NaN_count.txt
Find out the number of accessions with missing methylation data for sites at 90th, 75th and 50th percentiles. This is done in R
dat <- read.table('Methylation_genome_wide_618_accessions_considering_same_site_NaN_count.txt',head=F,sep='\t')
dat <- cbind(dat,618-dat[,2])
quantile(dat[,3], c(.50, .75, .90))
Distinguish methylation sites that are missing in <=2 (90th percentile), <=8 (75th percentile), <=38 (50th percentile) accessions
python 14_keep_50_75_90_percentile_MAF.py
Impute the 50th percentile files; fit on the training dataset, then transform it on the test dataset; filter with MAF (minor allele frequency); extract the 75th percentile and 90th percentile informations from these 50th percentile files. The file Test_for_383_accessions.txt can be found in the folder /Datasets
python 15_knn_imputation_for_0_1_training_fit_on_test.py 50per_file Test_for_383_accessions.txt
Make slurm jobs to concat the final imputed methylation files, using "cut" and "paste" shell commands
python 16_write_jobs_for_merging_methylation_data_using_paste.py
Split the methylation metrix into CG, CHG, CHH-type methylation files
python 17_split_methylation_to_CG_CHH_CHG.py inputFile
Related scripts can be found in the folder Data_preprocessing/03_methylomic_matrix/3_single_site_methylation_Prop
Calculate the methylation proportion for each C site, and save the methylated sites. Write the slurm jobs to parse the raw single-site methylation files separately. your_work_dir is the directory containing all the small files
python 01_write_slurm_jobs_for_methylation_proportion_and_save_methylated_site.py your_work_dir
Note: for individual files, you can run awk command lines, and the input file is the downloaded *.tsv file
awk \'{print "Chr"$1"_"$2"_"$4"_"$3"\\t"$5"/"$6"\\t"$5}\' < inputFile > inputFile_proportion
awk \'$3>0\' < inputFile_proportion > inputFile_meted.txt
Make dictionary for single-site methylation proportion, and the input file is *_meted.txt
python 02_make_dic_for_methylation_proportion.py input_file
Repeat what have been done for methylation P/A analysis. Or if you want to keep the same methylation sites for methy_prop and methy_P/A, you can just fill the corresponding values in methy_P/A matrix using methylation proportion before imputation. The inputFile is the *_50per file.
python 03_fill_single_base_methylation_as_proportion.py inputFile
Impute the methylation proportion matrix
python 04_knn_imputation_for_proportion_training_fit_on_test.py inputFile
Related scripts can be found in the folder Data_preprocessing/04_methylation_binning
Related scripts can be found in the folder Data_preprocessing/05_methylation_profile_clustering
Related scripts can be found in the folder Data_preprocessing/06_correlation_matrix_for_omics_data
Convert the SNP matrix to hmp format
python 01_genotype2hapmap.py SNP_binary_matrix_383_accessions_drop_all_zero_MAF_larger_than_0.05.csv SNP_383_accessions.hmp.txt
To get the kinship matrix, please follow the illustration of the software tassel, which can be found in https://www.maizegenetics.net/tassel
tassel/tassel-5-standalone/run_pipeline.pl -SortGenotypeFilePlugin -inputFile SNP_383_accessions.hmp.txt -outputFile SNP_383_accessions_order.hmp.txt -fileType Hapmap -Xmx100g -Xms2g
tassel/tassel-5-standalone/run_pipeline.pl -importGuess SNP_383_accessions_order.hmp.txt -KinshipPlugin -method Centered_IBS -endPlugin -export SNP_383_accessions_kinship.txt -exportType SqrMatrix -Xmx100g -Xms2g
Rscript 02_get_correlation_matrix_for_omics_data.r
3.3 Remove the potential confounding effects of K on mCor. You may want to replace input files in this script
Rscript 03_removing_confounding_effects_of_K_from_mCor.r
Example data can be found in the folder /Example_data_for_model_building.
For the RandomForest regression model building, we used the script from Azodi et al., 2020. Plant Cell paper. Here is the link to the script ML_regression.py. Example run as below:
python ML_regression.py -df SNP_383_accessions_kinship.csv_training.csv -df2 Phenotype_value_383_common_accessions_2017_Grimm.csv -y_name Length -sep ',' -alg RF -gs T -cv_set CVFs.csv -save RF_kinship_Length -test Test.txt -n_jobs 4 -n 10 -cv_num 5
For the rrBLUP model, we used the script from our another project. Here is the link to the script 13_rrBLUP_training_test_split_fread_predict_values.r. Example run as below:
Rscript 13_rrBLUP_training_test_split_fread_predict_values.r SNP_383_accessions_kinship.csv_training.csv Phenotype_value_383_common_accessions_2017_Grimm.csv all Length Test.txt 5 10 CVFs.csv Kinship
The scrips and example data can be found in the folder /SHAP
Note: the feature interaction value calculation is pretty slow for datasets with a large number of features. In our study, we only took the benchmark flowering time gene-related features to calculate the feature interactions.
Example run for the script SHAP_training_only_saving_interaction_figures_for_given_feature_list.py. Example input data can be found in the folder /SHAP/Example_data
python SHAP_training_only_saving_interaction_figures_for_given_feature_list.py -df Matrix_top_GTM_features_for_426_flowering_time_genes.csv -df2 Phenotype_value_383_common_accessions_2017_Grimm.csv -sep "," -y_name FT10_mean -test Test.txt -save SHAP_426_benchmark_top_GTM -model RF_top_GTM_426_benchmark_genes_model.pkl -n_jobs 16 -top 20 -interaction y -interaction_score y -feature_list Feature_list_selected_for_426_flowering_genes.txt
To summarize and order the feature interaction values, please run the script Sum_SHAP_interaction.py. The path contains all the interaction value files output from the above script. Example interaction files can be found in the folder /SHAP/Example_data/Interaction_files, where only interaction values among 30 features in two individuals were kept.
python Sum_SHAP_interaction.py -path ./ -save Summarized_and_ordered_interactions