MOTPEC trains a prediction model using lasso regression based on peripheral blood profiles to predict the gene expression of central tissues. The peripheral blood profiles include expression, eSNP, splicing, APA event, co-expression module and transcriptional factors. For each gene in each tissue, MOTPEC predicts its tissue-specific expression.The pearson's r between observed expression and predicted expression is employed to evaluate the accuracy of model. Integrating TWAS, compare DGE of observed expression and predicted expression to estimate the utility of prediction model.
Download data files from GTEx Portal and GENCODE https://www.gencodegenes.org/human/
- gene expression
- genetics variants
- splicing profiles
- APA events
- phenotype (BMI)
- demography variables
- transcriptional factors list
- loci file: gencode.v32.GRCh37.txt
workdir: ~/MOTPEC
raw_data: ~/MOTPEC/data/raw_data
input_data: ~/MOTPEC/data/input
output_data: ~/MOTPEC/data/output
- All these files should be kept in raw_data: ~/MOTPEC/data/raw_data:
- gene expression(normal)
normal exp - partial whole blood profiles
blood.pca, blood.tf.pca, modules_pca
colors and modules:
annotation/co_exp_module.Rdata - Transcriptional factors list
annotation/TF.txt - splicing
Whole_Blood.v8.leafcutter_phenotypes.bed.gz - APA event
Whole_Blood_All_PDUIs.zip - samples attributes
GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt - BMI phenotype
GTEx_Analysis_2017-06-05_v8_Annotations_SubjectPhenotypesDS.txt - demography variables
sample_ga.txt
- The loci file gencode.v32.GRCh37.txt should be kept in ~/MOTPEC/data/input/
- The script get_input_data.R is used to get all formatted data, and data will be stored in "~/MOTPEC/data/input", for example:
- normal expression:
file = fread("Adipose_Subcutaneous.v8.normalized_expression.bed")
file[1:5, 1:10]
chr | start | end | gene_id | GTEX-1117F | GTEX-111CU | GTEX-111FC | GTEX-111VG | GTEX-111YS | GTEX-1122O | |
---|---|---|---|---|---|---|---|---|---|---|
1 | chr1 | 29552 | 29553 | ENSG00000227232.5 | 1.31353292 | -0.9007945 | -0.2926896 | -0.7324114 | -0.2747541 | -0.6990255 |
2 | chr1 | 135894 | 135895 | ENSG00000268903.1 | -0.39787592 | 0.5724601 | -1.0918158 | 2.1157713 | 0.1035509 | 0.1295693 |
3 | chr1 | 137964 | 137965 | ENSG00000269981.1 | 0.06033348 | 0.9953057 | -0.8440787 | 2.3148972 | 0.6187479 | 0.7045350 |
4 | chr1 | 173861 | 173862 | ENSG00000241860.6 | 0.22586574 | -0.8197281 | -0.2347111 | 1.0686590 | 0.2613606 | 1.1236339 |
5 | chr1 | 195410 | 195411 | ENSG00000279457.4 | 0.29268956 | -1.0023984 | 0.4025419 | -0.5422753 | -1.7767986 | -0.5522808 |
- splicing:
file = fread('Whole_Blood.v8.leafcutter_phenotypes.bed.gz')
file[1:5, 1:8]
#Chr | start | end | ID | GTEX-1LVAO | GTEX-1AX9K | GTEX-1GN73 | GTEX-RM2N | |
---|---|---|---|---|---|---|---|---|
1 | chr1 | 29552 | 29553 | chr1:14829:14970:clu_40978:ENSG00000227232.5 | 1.5627810 | 1.617492 | 1.8133847 | 0.003239502 |
2 | chr1 | 29552 | 29553 | chr1:15038:15796:clu_40978:ENSG00000227232.5 | -1.3877237 | -1.650869 | -1.7964716 | -0.004566534 |
3 | chr1 | 29552 | 29553 | chr1:15947:16607:clu_40980:ENSG00000227232.5 | -0.7163063 | -2.124633 | 0.4795690 | -0.005269084 |
4 | chr1 | 29552 | 29553 | chr1:16310:16607:clu_40980:ENSG00000227232.5 | 1.1117534 | 1.685548 | -0.4241061 | -0.014246458 |
5 | chr1 | 29552 | 29553 | chr1:17055:17233:clu_40981:ENSG00000227232.5 | 1.7357796 | 2.367227 | 0.6758256 | 0.156601906 |
- sample attributes:
file = fread('GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt')
file[1:5,1:9]
SAMPID | SMATSSCR | SMCENTER | SMPTHNTS | SMRIN | SMTS | SMTSD | SMUBRID | SMTSISCH | |
---|---|---|---|---|---|---|---|---|---|
1 | GTEX-1111F-1111-SM-58Q7G | NA | B1 | NA | Blood | Whole Blood | 0011111 | 1111 | |
2 | GTEX-1111F-1111-SM-5DWSB | NA | B1 | NA | Blood | Whole Blood | 0011111 | 1111 | |
3 | GTEX-1111F-1111-SM-6WBT7 | NA | B1 | NA | Blood | Whole Blood | 0011111 | 1111 | |
4 | GTEX-1111F-1111-R10a-SM-AHZ7F | NA | B1, A1 | NA | Brain | Brain - Frontal Cortex (BA9) | 0001111 | 1111 | |
5 | GTEX-1111F-1111-R10b-SM-CYKQ8 | NA | B1, A1 | 7.2 | Brain | Brain - Frontal Cortex (BA9) | 0001111 | 1111 |
- APA event:
file = fread(unzip('Whole_Blood_All_PDUIs.zip', "Whole_Blood_All_PDUIs.txt"))
file[1:5,1:10]
Gene | GTEX-1111 | GTEX-1112 | GTEX-1113 | GTEX-1114 | GTEX-1115 | GTEX-1116 | GTEX-1117 | GTEX-1118 | GTEX-1119 | |
---|---|---|---|---|---|---|---|---|---|---|
1 | NM_020524 | 0.68 | 0.58 | 0.57 | 0.63 | 0.56 | 0.71 | 0.61 | 0.61 | 0.79 |
2 | NM_053282 | NA | NA | 0.40 | 0.53 | 0.40 | 0.39 | 0.43 | 0.34 | 0.53 |
3 | NM_020978 | 0.17 | 0.19 | 0.21 | 0.25 | 0.24 | NA | 0.46 | 0.32 | NA |
4 | NM_001294339 | 0.67 | 0.55 | 0.56 | 0.65 | 0.57 | 0.80 | 0.58 | 0.68 | 0.53 |
5 | NM_024602 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 |
- demography variables:
file = fread('sample_ga.txt')
file[1:5,1:5]
SUBJID | COHORT | SEX | AGE | RACE | |
---|---|---|---|---|---|
1 | GTEX-1111F | ********** | 2 | 66 | 2 |
2 | GTEX-1111U | ********** | 1 | 57 | 3 |
3 | GTEX-1111C | ********** | 1 | 61 | 3 |
4 | GTEX-1111G | ********** | 1 | 63 | 3 |
5 | GTEX-1111S | ********** | 1 | 62 | 3 |
- BMI:
file = fread('GTEx_Analysis_2017-06-05_v8_Annotations_SubjectPhenotypesDS.txt')
file[1:5,1:10]
SUBJID | COHORT | SEX | AGE | RACE | ETHNCTY | HGHT | HGHTU | WGHT | WGHTU | |
---|---|---|---|---|---|---|---|---|---|---|
1 | GTEX-1111F | ********** | 2 | 66 | 2 | 0 | 66 | in | 199 | lb |
2 | GTEX-1111U | ********** | 1 | 57 | 3 | 0 | 70 | in | 234 | lb |
3 | GTEX-1111C | ********** | 1 | 61 | 3 | 0 | 73 | in | 190 | lb |
4 | GTEX-1111G | ********** | 1 | 63 | 3 | 0 | 69 | in | 200 | lb |
5 | GTEX-1111S | ********** | 1 | 62 | 3 | 0 | 72 | in | 227 | lb |
- loci:
loci <- read.csv('gencode.v32.GRCh37.txt', sep = '\t')
loci[1:5, 1:10]
chr | left | right | strand | geneid_full | geneid | genetype | genename | |
---|---|---|---|---|---|---|---|---|
1 | chr1 | 11869 | 14409 | + | ENSG00000223972.5_2 | ENSG00000223972 | transcribed_unprocessed_pseudogene | DDX11L1 |
2 | chr1 | 14404 | 29570 | - | ENSG00000227232.5_2 | ENSG00000227232 | unprocessed_pseudogene | WASH7P |
3 | chr1 | 29554 | 31109 | + | ENSG00000243485.5_6 | ENSG00000243485 | lncRNA | MIR1302-2HG |
4 | chr1 | 34554 | 36081 | - | ENSG00000237613.2_3 | ENSG00000237613 | lncRNA | FAM138A |
5 | chr1 | 52473 | 53312 | + | ENSG00000268020.3_4 | ENSG00000268020 | unprocessed_pseudogene | OR4G4P |
-
exp_matrix: a list of 49 tissues' expression(include Whole Blood), each element is a matrix
exp.matrix[['Whole_Blood']][1:5,1:5]
ENSG00000227232 | ENSG00000238009 | ENSG00000233750 | ENSG00000268903 | ENSG00000269981 | |
---|---|---|---|---|---|
GTEX-1111S | -1.2250236 | -0.6733178 | 0.3014908 | 0.4171043 | 0.8916543 |
GTEX-1111O | -0.8533908 | 0.1217088 | 0.8587798 | 0.3766841 | 0.3093189 |
GTEX-1111S | 0.4293614 | -0.6546895 | -0.4375687 | 0.8163377 | 1.1788697 |
GTEX-1111C | 0.8533908 | -1.1788697 | -1.5580346 | -2.6142683 | -2.6142683 |
GTEX-1111C | -0.7064943 | -0.8373702 | -0.7702820 | -0.9312701 | -0.8215617 |
- WB: a list include: blood.pca, blood.tf.pca, colors, modules_pca, modules
WB[['blood.pca']][1:5,1:5]
PC1 | PC2 | PC3 | PC4 | PC5 | |
---|---|---|---|---|---|
GTEX-1111S | -68.56067 | 18.53914 | -15.42782 | -3.938970 | 20.411347 |
GTEX-1111O | -102.96515 | -51.41632 | 29.38948 | -40.438315 | -13.367777 |
GTEX-1111S | 62.51361 | -22.31329 | 27.41144 | -65.461205 | 10.169594 |
GTEX-1111C | 16.41480 | -83.65493 | -104.79319 | -5.910071 | -1.902242 |
GTEX-1111C | 38.33161 | -12.52743 | 60.43982 | -9.061731 | 5.730263 |
WB[['blood.tf.pca']][1:5,1:5]
PC1 | PC2 | PC3 | PC4 | PC5 | |
---|---|---|---|---|---|
GTEX-1111S | 15.956726 | 8.756085788 | -4.649753 | -3.299114 | -4.446283 |
GTEX-1111O | 25.759200 | -5.709656529 | 11.203563 | -9.575776 | 1.353987 |
GTEX-1111S | -15.255152 | -7.904850477 | 1.738098 | -14.940524 | -5.970758 |
GTEX-1111C | 9.744112 | -36.092799242 | -23.263840 | 2.306993 | 1.778211 |
GTEX-1111C | -9.658695 | -0.004755886 | 12.989872 | -2.667919 | -2.465581 |
WB[['colors']][1:5]
ENSG00000227232 | ENSG00000238009 | ENSG00000233750 | ENSG00000268903 | ENSG00000269981 |
---|---|---|---|---|
"grey" | "grey" | "turquoise" | "turquoise" | "turquoise" |
WB[['modules']][['grey']][1:5]
ENSG00000227232 | ENSG00000238009 | ENSG00000241860 | ENSG00000279928 | ENSG00000279457 |
---|---|---|---|---|
1 | 2 | 7 | 8 | 9 |
WB[['modules_pca']][['grey']][1:5,1:5]
PC1 | PC2 | PC3 | PC4 | PC5 | |
---|---|---|---|---|---|
GTEX-1111S | -27.653046 | 3.697308 | -8.865022 | 2.724629 | -19.317787 |
GTEX-1111O | -32.551785 | -12.059564 | 32.652657 | 6.593311 | 1.591776 |
GTEX-1111S | 16.218895 | -3.304558 | 21.695663 | 34.882897 | -16.463806 |
GTEX-1111C | 5.248274 | 70.203751 | 25.648423 | -6.454238 | -5.834333 |
GTEX-1111C | 16.836411 | -36.001744 | 10.849349 | 4.437152 | -3.455231 |
-
splicing: a matrix
splicing[1:5,1:10]
#Chr | start | end | ID | GTEX-1LVAO | GTEX-1AX9K | GTEX-1GN73 | GTEX-RM2N | GTEX-111YS | GTEX-1R9PN | |
---|---|---|---|---|---|---|---|---|---|---|
1 | chr1 | 29552 | 29553 | ENSG00000227232 | 1.5627810 | 1.617492 | 1.8133847 | 0.003239502 | 0.4766374 | 0.9620965 |
2 | chr1 | 29552 | 29553 | ENSG00000227232 | -1.3877237 | -1.650869 | -1.7964716 | -0.004566534 | -0.3512171 | -0.8436864 |
3 | chr1 | 29552 | 29553 | ENSG00000227232 | -0.7163063 | -2.124633 | 0.4795690 | -0.005269084 | 1.7039249 | -0.2537381 |
4 | chr1 | 29552 | 29553 | ENSG00000227232 | 1.1117534 | 1.685548 | -0.4241061 | -0.014246458 | -1.8304159 | 0.1309287 |
5 | chr1 | 29552 | 29553 | ENSG00000227232 | 1.7357796 | 2.367227 | 0.6758256 | 0.156601906 | 0.2407400 | 1.1691889 |
-
apa_pc_data: a matrix
apa_pc_data[1:5,1:5]
PC1 | PC2 | PC3 | PC4 | PC5 | |
---|---|---|---|---|---|
GTEX-1111S | -3.162204 | 0.365157 | -1.5807869 | -1.2272305 | 1.69888839 |
GTEX-1111O | -4.485739 | 2.712429 | 0.7774328 | 0.5957889 | -0.01354537 |
GTEX-1111S | 1.289923 | 1.588977 | 1.6681345 | 0.4693770 | 1.09205211 |
GTEX-1111C | -6.401152 | 1.260727 | 2.8176932 | -0.8861736 | -0.31765179 |
GTEX-1111C | 4.051234 | 2.184295 | 0.4758485 | -0.8010198 | 0.58414172 |
-
dummy.demo.info: a matrix
dummy.demo.info[1:5,1:5]
SEX | AGE | RACE.1 | RACE.2 | RACE.3 | |
---|---|---|---|---|---|
GTEX-1111F | 1 | 66 | 0 | 1 | 0 |
GTEX-1111U | 0 | 57 | 0 | 0 | 1 |
GTEX-1111C | 0 | 61 | 0 | 0 | 1 |
GTEX-1111G | 0 | 63 | 0 | 0 | 1 |
GTEX-1111S | 0 | 62 | 0 | 0 | 1 |
- bmi: a matrix
bmi[1:5,]
SUBJID | BMI | |
---|---|---|
GTEX-1111F | GTEX-1117F | 32.12 |
GTEX-1111U | GTEX-111CU | 33.57 |
GTEX-1111C | GTEX-111FC | 25.06 |
GTEX-1111G | GTEX-111VG | 29.53 |
GTEX-1111S | GTEX-111YS | 30.78 |
MOTPEC uses Lasso regression model under a 5-fold cross-validation to train prediction model and get prediction. The prediction accuracy for each gene are
evaluated using Pearson correlation test between the predicted expression and the ground truth.
The script prediction_model.R is used to realize above work, and functions.R saves some functions used in prediction_model.R. In prediction_model.R, gene expression is required. Splicing, APA event and genetic variants are optional.
The results will be saved in ~/MOTPEC/data/output/prediction_model_rs/ by tissue, including baseline accuracy, MOTPEC's accuracy, MOTPEC's prediction, MOTPEC's beta.
The script format_prediction.R will format the prediction model results, and pcc_2models, exp_pred, all_model_beta, sam_gene_time will be stored in ~/MOTPEC/data/output/
PrediXcan (https://www.nature.com/articles/ng.3367) is employed to do TWAS in this study.
The code is in folder do_TWAS(predixcan_r.r).
The required data should be kept in ~/MOTPEC/data/raw_data/ beforehand.
The TWAS results will be stored in ~/MOTPEC/data/output/twas_rs.
A linear regression equation after confonding factors adjusted is used to estimate the association between expression and BMI. MOTPEC does this on observed expression, predicted expression and baseline expression respectively. The script do_DGE.R is used to realize above work, and DGE results will be stored in ~/MOTPEC/data/output/b_beta.Rdata
The script get_predicted_expression.R is designed to get other tissues' expression by coefficents trained by us, which is executed by command line.
- necessary
- Gene expression data (Note: expression matrix variable must be named "blood.exp" and be saved into "blood_exp.Rdata" and be placed into input dir!)
- all_model_beta.Rdata (Provided by us: data/raw_data/annotation/all_model_beta.Rdata)
- co_exp_net_and_blood_pc.Rdata (Provided by us: data/raw_data/annotation/co_exp_net_and_blood_pc.Rdata)
- optional Splicing, APA event, genetic variants and demography variables (refer to get_input_file.R).
Here introduce the format of blood expression after processing:
load(paste0(input_dir, '/blood_exp.Rdata'))
blood.exp[1:5,1:5]
ENSG00000227232 | ENSG00000238009 | ENSG00000233750 | ENSG00000268903 | ENSG00000269981 | |
---|---|---|---|---|---|
GTEX-1111S | -1.2250236 | -0.6733178 | 0.3014908 | 0.4171043 | 0.8916543 |
GTEX-1111O | -0.8533908 | 0.1217088 | 0.8587798 | 0.3766841 | 0.3093189 |
GTEX-1111S | 0.4293614 | -0.6546895 | -0.4375687 | 0.8163377 | 1.1788697 |
GTEX-1111C | 0.8533908 | -1.1788697 | -1.5580346 | -2.6142683 | -2.6142683 |
GTEX-1111C | -0.7064943 | -0.8373702 | -0.7702820 | -0.9312701 | -0.8215617 |
--input_dir: the directory to place input files
--output_dir: the directory to place output files
--args: 1-48 represent 48 tissues
args | tissue |
---|---|
1 | "Adipose_Subcutaneous" |
2 | "Adipose_Visceral_Omentum" |
3 | "Adrenal_Gland" |
4 | "Artery_Aorta" |
5 | "Artery_Coronary" |
6 | "Artery_Tibial" |
7 | "Brain_Amygdala" |
8 | "Brain_Anterior_cingulate_cortex_BA24" |
9 | "Brain_Caudate_basal_ganglia" |
10 | "Brain_Cerebellar_Hemisphere" |
11 | "Brain_Cerebellum" |
12 | "Brain_Cortex" |
13 | "Brain_Frontal_Cortex_BA9" |
14 | "Brain_Hippocampus" |
15 | "Brain_Hypothalamus" |
16 | "Brain_Nucleus_accumbens_basal_ganglia" |
17 | "Brain_Putamen_basal_ganglia" |
18 | "Brain_Spinal_cord_cervical_c-1" |
19 | "Brain_Substantia_nigra" |
20 | "Breast_Mammary_Tissue" |
21 | "Cells_Cultured_fibroblasts" |
22 | "Cells_EBV-transformed_lymphocytes" |
23 | "Colon_Sigmoid" |
24 | "Colon_Transverse" |
25 | "Esophagus_Gastroesophageal_Junction" |
26 | "Esophagus_Mucosa" |
27 | "Esophagus_Muscularis" |
28 | "Heart_Atrial_Appendage" |
29 | "Heart_Left_Ventricle" |
30 | "Kidney_Cortex" |
31 | "Liver" |
32 | "Lung" |
33 | "Minor_Salivary_Gland" |
34 | "Muscle_Skeletal" |
35 | "Nerve_Tibial" |
36 | "Ovary" |
37 | "Pancreas" |
38 | "Pituitary" |
39 | "Prostate" |
40 | "Skin_Not_Sun_Exposed_Suprapubic" |
41 | "Skin_Sun_Exposed_Lower_leg" |
42 | "Small_Intestine_Terminal_Ileum" |
43 | "Spleen" |
44 | "Stomach" |
45 | "Testis" |
46 | "Thyroid" |
47 | "Uterus" |
48 | "Vagina" |
--demo_index: whether add demography profiles, default F, while T represents add
--APA_index: whether add APA event profiles, default F, while T represents add
--eSNP_index: whether add genetic profiles, default F, while T represents add
--splicing_index: whether add splicing profiles, default F, while T represents add
--plink: the installed directory of plink, the version of which is 1.9
--plink_wd: the working directory of plink, only support bed,fam,bim format, not support bgen
--plink_all_sam_name: file names of all exported sample genetic profiles
For example, you will get such a file named by tissue name in output directory.
file = fread(paste0(output_dir, 'Artery_Aorta.txt'))
file[1:5, 1:5]
V1 | ENSG00000228794 | ENSG00000142599 | ENSG00000171608 | ENSG00000130940 |
---|---|---|---|---|
GTEX-1111 | -0.035103603 | -0.04837770 | -0.13777566 | 0.0006240904 |
GTEX-1111Y | -0.169903070 | 0.08400447 | 0.08639159 | -0.0617938968 |
GTEX-1111Q | 0.210433138 | -0.03393886 | -0.12010460 | 0.0914501497 |
GTEX-1111W | 0.007376501 | -0.37130859 | -0.15527437 | 0.1009398230 |
GTEX-1111 | -0.042160986 | 0.31247051 | 0.14350070 | -0.0519353176 |
For the code the following R-packages needs to be installed data.table, WGCNA, readr, ggplot2, glmnet, caret, foreach, doMC
Predixcan(https://www.nature.com/articles/ng.3367, https://github.com/gamazonlab/MR-JTI/blob/master/model_training/predixcan/)