/HallCircadian

Identifying marker genes to tell the circadian time using a single transcriptomic time point

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Identifying marker genes to tell the circadian time using a single transcriptomic time point

We developed a ML based pipeline to predict the circadian time (phase) at any single transcriptomic sampling time point using gene expression data from a set of marker genes. This code uses an artificial neural network (with Tensorflow). The training, test and validation datasets for our analysis can be found in the data folder (as detailed below).

To try it out, the best place to start is the jupyter notebook. This is what you need:

1. A working version of python and tensorflow. A simple installation on your laptop could probably just be done with pip. (https://www.tensorflow.org/install/)
2. The data: we have provided this in the folder “Data”, this provided data can be replaced in the notebook with your own data.		
	- X_train_raw is the Romanowski expression matrix before scaling and feature selection [Our training data]
	- X_valid_raw is the Yang expression matrix before scaling and feature selection [Our validation data]
	- X_test_raw_A/B are the Graf expression matrices for the two different ecotypes before scaling and feature selection [Our test data]
	- X_train_postprocessed is the Romanowski expression matrix after scaling and feature selection [Our training data]
	- X_valid_postprocessed is the Yang expression matrix after scaling and feature selection  [Our validation data]
	- X_test_postprocessed is the Graf expression matrix including the two different ecotypes after scaling and feature selection [Our test data]
	- X_train_clusters contains WGCNA clusters generated for the Romanowski data

This notebook takes as input the raw training, test and validation data (In[3]). After removing low variance genes (In[4]), the following metrics; ARSER (Yang & Su, 2010); JTK-Cycle (Hughes et al., 2010); cross-correlation with circadian time; autocorrelation, were applied to quantify the rhythmicity of each gene’s expression pattern in the training dataset. Despite the metrics being highly correlated with each other, using several metrics should increase the number of detected rhythmic patterns compared to using one single criterion. The scores for each metric were combined using a Gaussian copula resulting in one score for each gene that enables the set of genes to be ranked in order of measured rhythmicity. The notebook includes the cross-correlation with circadian time and autocorrelation feature selection metrics that we used (‘CorrelationMetrics.py’ and In[6]). The top-ranking n genes were taken forward for model training and further feature selection where n was adjusted to minimise validation error. After a ranked subset of n rhythmic genes was obtained, their expression levels were scaled using MinMaxScaler in the range 0 to 1 as this method of scaling resulted in the smallest error for the validation dataset (In[6]). The scaling transformation was fitted on the training set and applied to both validation and test sets to prevent data leakage. We prioritized transcripts using feature selection to make the frequency distribution of sub-clusters generated by the WGCNA gene co-expression network analysis uniform and train the neural network (‘CustomFeatureSelection.py’ and In[7]).

The training dataset was used to learn the circadian time in hours, from the expression data from each transcriptomic timepoint individually, whilst the validation set was utilised for adjusting hyperparameters to reduce overfitting. The test set was used to estimate the error on unseen data. Here n is equal to our final set of transcripts that are sufficient to allow prediction of the circadian time. Finally we show the expected output line plots for training and validation predictions compared to ground truth data using our final subset of transcripts and optimized hyperparameters (In[8]). Due to the cyclical nature of the target (time 0-24 hours), standard regression loss functions were not suitable for this task. To quantify the error in the predictions, we defined the loss function as the squared angle between actual circadian time and predicted circadian time after being transformed onto a unit circle (In[5]).

The ‘Results’ folder contains the final predictions that are generated by ‘FinalPredictions.py’ which is found in the repository.

Dependencies: Python 3.7.5 Pandas 0.25.0 Numpy 1.16.6 Matplotlib 3.1.1 Sklearn 0.23.2 Seaborn 0.9.0 Tensorflow 2.0.0 Keras 2.2.4 Correlation Metrics Tqdm 4.32.1