This blog records the progress of my thesis in XJTU
Prof. Guo advises that osteoarthritis might be a good research.
This article conduct a meta-analysis of GWAS. I might not do that meta analysis but just GWAS. So the goal might be Learning Approach to Osteoarthritis Analysis
For theoretical basis, I search for definitions of meta-analysis and possible learning approach in GWAS
After discussion with Prof.Guo, there might be three main problems i might cast sight on:
- Given the GEO data, analysis possible gene loci.
- Given the Metabolism data, find possible pathway that can be regulated by drugs.
- Given the patient sample, predict the risk.
Hence I shall first look through the data and determine which should i pick.
-
Select 177517/826690 individuals for analysis
2 EA groups and 11 European
-
Found 11897 SNV
p < 1.3 10e-8
-
Phenotype conditional analysis identify 223 independent associations
-
87/96 loci replicated
-
Phenotype independently conditional analysis shows 100 associations
-
Lead SNV of each of associations mentioned in 5 are selected.
6 Coding SNV 59 transcription reside SNV 35 intergenic SNV
- Update risk SNV for different tissue
- 6 rare SNVs are detected (discovered in Iceland)
-
Risk SNVs also related to EA groups with evidence in several phenotype
-
Polygenic risk scores related to some of phenotypes
-
3 Female only SNVs discovered.
RS116112221 interestingly located
-
Meta analysis shows another risk variant
-
60/100 SNVs related to phenotype
40 weight-bearing only
4 non-weight bearing only
42 both may contribute to pathology
-
Some SNVs have more participation in joint replacement than osteoarthritis pathology; Especially with pains
-
Identify 637 genes with possible ability to become effector gene
-
Identify 77 genes with higher potential based on various criteria
4 supported by missense SNV
48 previously reported
30 newly discovered
-
77 genes mentioned above are distributed in 6 groups
- Skeletal development (63/77)
- Joint degeneration
- Neuronal function
- Muscle function
- Immune response
- Adipogenesis
-
205/637 genes are potential drug target
71/205 genes cooperate well with drugs licensed
-
20/77 genes can be candidates
7 newly discovered
32 research contains metabolomic data
Tissue | Biospecimen | Phenotypes | Amount |
---|---|---|---|
Joint | Synovial | OA, RA | 13 |
Serum | Serum | OA, RA | 19 |
Phenotypes | Biospecimen | Methodology | Sample size | References |
---|---|---|---|---|
Knee OA, RA, and postmortem controls | SF | UPLC Q-TOF MS | ==OA 5, RA 3, Controls 5.== | Carlson A et al. [16] |
Early and later knee OA and controls (all postmortem) | SF | UPLC Q-TOF MS | ==Early OA 55; late OA 17; controls 7.== | Carlson A et al. [18] |
Knee OA and cadaveric controls | SF | 1H NMR and GC-MS | Knee OA 55; controls 13. | Mickiewicz B et al. [13] |
Knee OA, RA, postmortem controls | SF | ESI-MS/MS | ==Early OA 17; late OA 13; RA 18; controls 9.== | Kosinska M et al. [14] |
Knee OA vs. controls | SF | GC-TOF/MS | ==OA 49; controls 21.== | Zheng K et al. [17] |
Knee OA, gout, calcium pyrophosphate disease (CPPD), spondylarthritis, septic arthritis, and RA | SF | 1H NMR | OA 15; gout 18; CPPD 11; septic arthritis 4; RA 4; reactive arthritis 3; Crohn's disease 2; ankylosing spondylitis 1; psoriasis arthritis 1. | Hügle T et al. [22] |
Reactive arthritis and undifferentiated spondyloarthropathy; RA, and OA | SF | 1H NMR | ==OA 21; RA 25; and reactive arthritis 30.== | Muhammed H et al. [23] |
Knee OA severity | SF | GC/TOF MS | ==OA 15.== | Kim S et al. [12] |
Knee and hip OA | SF | 1H NMR | ==Hip 12; knee 12.== | Akhbari P et al. [19] |
Classification of OA | SF | hip and knee OA 80. | Zhang W. et al. | |
Knee OA vs. controls and other forms of arthritis | Serum | GC-TOF MS/UPLC-QTOF MS | ==OA 27; RA 27; AS 27; gout 33, and controls 60.== | Jiang M et al. [25] |
OA, RA, and FM | Bloodspot | IRMS | OA 12; RA 15; FM14. | Hackshaw KV et al. [26] |
Knee OA vs. controls | Plasma | GC/Q-TOF-MS | ==OA 12; controls 29.== | Huang Z et al. [27] |
OA vs. controls | Serum | LC/MS | ==Knee and hip OA 70; controls 82.== | Tootsi K. et al. [28] |
OA vs. controls | Serum | 1H NMR | ==OA 1556; controls 2125.== | Meessen, J. et al. [29] |
OA vs. controls | Serum | UPLC-TQ-MS | ==OA 32 and controls 35 in discovery cohort; OA 30 and controls 30 in replication cohort.== | Chen R. et al. [30] |
Obesity and non-obesity knee OA vs. controls | Serum | LC/Q-TOF/MS/MS | Obesity knee OA 14; non-obesity knee OA 14, and controls 15. | Senol O et al. [31] |
Knee OA and risk for TKR | Plasma and serum | HPLC-MS/MS | ==Knee OA 64 and control 45 in the discovery cohort; knee OA 72 and controls 76 in the replication cohort; 158 subjects in the longitudinal study.== | Zhang W. et al. [8] |
Knee cartilage volume loss over 2 years | Serum | HPLC-MS/MS | ==Knee OA 139.== | Zhai G et al. [33] |
Drug response in knee OA | Serum | HPLC-MS/MS | Knee OA 158. | Zhai G et al. [34] |
Knee OA | Plasma | HPLC-MS/MS | ==Knee OA 64 and controls 45 in the discovery cohort; knee OA 72 and controls 76 in the replication cohort.== | Zhang W. et al. [35] |
Knee OA | Serum | HPLC-MS/MS | ==Knee OA 123 and controls 299 in the discovery cohort; knee OA 76 and controls 100 in the replication cohort.== | Zhai G. et al. [36] |
Knee OA progression in 5 years | Serum | HPLC-MS/MS | ==Knee OA progressor 234; nonprogressor 322.== | Zhai G et al. [39] |
-
OA patient and control groups might own background bias due to the source of the sample
-
58/1233 metabolites varied in Carlson A et al.
Involved pathways:
- NO production
- Chondroitin sulfate degradation
- Arg and Pro metabolism
Their research also contains RA in SF, however, no dissimilarities found
-
188/9903 metabolites varied in Carson A et al. in a larger group
Involved pathways:
- Extracellular matrix components metabolism
- AA, fatty acid and lipid metabolism
- Inflammation
- Energy metabolism
- Vitamin metabolism
Cluster results:
- Increased inflammation
- Oxidative stress
- Structural deterioration
-
Energy demand varies in Mickiewicz B et al.
-
Sphingomyelin(SM) and ceramide most abundant among samples in Kosinska M et al.
-
Three main molecules found different in a replication cohort validated research from Zheng K et al. They even differ between OA and RA
- Glutamine
- 1,5-anhydroglucitol
- Gluconic lactone
-
While another two research report consistency among OA and RA, but they own limitations on group size.
-
Samples from different joint in one patient would help eliminate possible differences between individuals in Xu Z et al.
Involved pathways:
- Phenylalanine metabolism,
- Taurine and hypotaurine metabolism
- Arg and Pro metabolism
-
68/469 metabolites found different in Yang G. et al.
-
28/114 metabolites differ between early and late radiographic OA in Kim S et al.
-
Knee OA and hip OA could own difference according to Akhbari P et al.
-
Metabolic syndrome might be used to cluster the patients by Guangju Z et al.
- Jiang M et al. introduce sexual control in their analysis. 6/30 metabolites are considered as difference, which has AUC of 0.91. They also found differences between OA and RA
- Huang Z et al. studied 12 knee OA patients and 20 healthy controls and identified three metabolites – succinic acid, xanthurenic acid, and tryptophan.
- Tootsi K et al. studied 70 knee and hip OA patients and 82 controls and found that glycine and arginine were independently associated with OA radiographic severity.
- Meessen, J et al. studied a total 227 metabolites assessed by NMR platform in a total 2125 controls and 1556 OA cases Optimal group size?
- Chen R et al. focus on the amino acid difference among population and found several related to OA
- Senol O et al. and Zhang W et al. cast sight on phenotypes such as obsity and diabase
- Evolutionary learning
- Differential correlation network
- Meta Analysis?
Using UKB to apply final multimodal model for risk evaluation
Similar research has been performed as GWAS for osteo
However, after discussion with Prof. Guo. My intention of setting that three source network was denied, I hence need to come up with a new strategy.
After screening the following articles and integrating Guo's advice. A new setup is to be made:
What is Few-Shot Learning? Methods & Applications in 2022
Low Data Drug Discovery with One-Shot Learning | ACS Central Science
A Comprehensive Survey on Graph Neural Networks | IEEE Journals & Magazine | IEEE Xplore
https://www.nature.com/articles/s41588-018-0327-1
graph LR
A[SNP] -->B{GNN-Network}
B--> E(Risk Prediction)
graph LR
A[SNP] -->B{GNN-Network}
B--> E(Risk Prediction)
B--> C[SNP interaction]
C--->A
graph LR
A[SNP] -->B{GNN-Network}
B--> E(Risk Prediction)
B--> C[SNP interaction]
C ---> A
D[Phenotype] --->E
graph LR
A[SNP] -->B{GNN-Network}
B--> E(Risk Prediction)
B--> C[SNP interaction]
C --> F(Genes and Pathways)
D[Phenotype] --->E
Stage 5 plan: Introduce metabolites to GNN, increase ? accuracy**
graph LR
A[SNP] -->B{GNN-Network}
B--> E(Risk Prediction)
B--> C[SNP interaction]
C --> F(Genes and Pathways)
D[Phenotype] --->E
H[Metaboltes] ---> B
Start from TIID data i retreived last year, i start the development of a simple GNN based on the research from Design Space for Graph Neural Networks. Main guidlines are
To test the network, I apply 3 conGNN layers to deal with the graph and a pool layer to readout or just eliminate the unrelated nodes. Detailed structures is
- GNN Layer with 128 channels 0.1 dropout
- GNN Layer with 64 channels 0.1 dropout
- GNN Layer with 32 channels 0.1 dropout
- Pooling layer by summing
- Dense Layer to make prediction
Current trained model showed well convergence of train loss, but limited by the data quality, accuracy is still reletively low. Thus I might take the data from UKB's osteoarthritis genotype to make further progress.
However, when i first selected the 100 SNP loci derived from Nature article, forming a dataset with 17000 samples and 100 features, i failed to construct a reasonable classification model. I tried it on kNN, randomForest and Xgboost, the AUC was unacceptable (0.52).
According to the supplementary data of that article, these snps does not exhibit well classification potential (AUC0.5 by PRS). So i change another snp sources based on research upon UKB data. (UKB data branch in GITHUB)
Shit always happens, new data set (SNP=77) has little increase on AUC(0.52-0.54). So I still need to refer to article for solutions. One research reveals that increasing the input features will improve the performance. So i make another dataset with p<1e-5 and MAF >0.01 (SNP=8800).
BUT BUT BUT, the result is still schlecht (AUC0.6 by XGboost). HOW could it happen?! Article tells me that the coding pattern might matter. So i change from ATCG one-hot encoding to Effect Allele numbers observed in SNP loci. AUC improved a little. Then i realize that i cannot just provide that too much features to its samples(8000:6000). So i conduct a feature selection procedure based on both Chi-square and linear SVC fs methods. Finally, AUC of simple model ( Decision Tree) goes better.
model_type | metric_type | metric_value |
---|---|---|
Baseline | auc | 0.5 |
Decision Tree | auc | 0.644074 |
Xgboost | auc | 0.714733 |
Neural Network | auc | 0.649612 |
Random Forest | auc | 0.681943 |
PRS | auc | 0.63 |
The baseline of the total research is the PRS result, getting AUC of 0.63
For the convenience of the project i select spectral as the GNN package, building the model as
- Input
- 64 - Channels GNN
- 64 - Channels GNN
- Readout Layer
- Dense
- Dense
The training result goes as
The optimal model goes as follows
According to the GNN explainer, i can retrieve the essential data contributing to the prediction task. Taking different threshold on the sub graph, i can find some thing interesting.
For instance, for Case 4496347 in UKB, i got subgraph as
The SNP has the feature as
RSID | FREQ | Effect | P |
---|---|---|---|
rs10912775 | 0.2448 | -0.0325 | 4.041E-07 |
rs332797 | 0.2496 | -0.0313 | 8.768E-07 |
rs12065527 | 0.2475 | -0.0328 | 5.929E-07 |
rs4652425 | 0.2452 | -0.0327 | 3.37E-07 |
rs112519723 | 0.2474 | -0.031 | 1.217E-06 |
rs2568096 | 0.2498 | -0.0309 | 1.228E-06 |
rs10912776 | 0.2448 | -0.0325 | 4.051E-07 |
rs12084948 | 0.2464 | -0.0327 | 3.115E-07 |
Relaxing the threshold, we can get
There will be more SNPs found contributing to the prediction. Nevertheless, the detailed meaning of this graph remains unknown and awaits later interception.