/Predictive_Senescence

Machine learning project aimed at establishing a relationship between telomere length and survival in a population.

Primary LanguageJupyter Notebook

Predictive Senescence




Why choose Senescence?

In the early 1970's, Soviet theorist Alexei Olovnikov recognized that according to the scientific models of the time, chromosomes are not able to replicate their ends. The ends of the chromosomes are called telomeres, and they have been studied since this discovery to figure what the implications of this imperfect replication are. Through the years, it has been hypothesized that telomeres are associated with the process of aging, also called senescence. Studying senescence could lead to discoveries in advanced Age Therapy as well as potentially the reversal of the aging process.

This project was an attempt to determine the relationship of health and environmental factors on telomere length and vice versa. To do so, a large dataset containing telomere measurments for individual of a species taken throughout their lifetime was needed. This project used a study of the warbler bird population on the Cousin Island of Seychelles, off the coast of Africa. For more context on this study, please see this document

Description of the data source

Seychelles Warbler - Photo by Chong Boon Leong Seychelles Warbler - Photo by Chong Boon Leong

The data came from this source. The supplemental information file contains the following description and sources:

"Detailed in (Sparks et al., 2021): “The Seychelles warbler is a small passerine endemic to the Seychelles archipelago (Komdeur et al., 1991). The entire population (~320 adult individuals in 115 territories) on Cousin island (29 ha; 04°20′S, 55°40′E) has been monitored intensively since 1985 (Hammers et al., 2019; Komdeur, 1992; Raj Pant et al., 2019; Richardson et al., 2007).”, “Since 1990, blood samples (~25 μl) have been taken and stored at room temperature in absolute ethanol, thus allowing …. telomere length measurement (Barrett et al., 2013).” “We used the telomere data set generated in Spurgin et al. (2018), which included birds caught and blood sampled between 1995 and 2014, when the data were most complete. RTL was estimated using qPCR (quantitative polymerase chain reaction; Barrett et al., 2013; Bebbington et al., 2016; Spurgin et al., 2018). DNA integrity (agarose gel) and 260/280 ratios were checked in all samples before running any qPCR, and any samples with signs of degradation were removed, reextracted and checked again.” "

The data set contains the data collected from all the the aforementioned studies and includes the telomere length measurements we are interested in.



Hypotheses and questions to be answered with the data

Hypothesis

There is a positive correlation between telomere length and survival

Questions:

  • Does the telomere length decrease over time for each bird? If yes, is the rate similar?
  • Is there a correlation between telomere length and the age class of the birds?



Presentation

This project includes a Google Slides which can be found here: Predective Senescence Presentation.

Environments

You can follow this tutorial to duplicate this project's environment with anaconda along with the environment.yml file. Alternatively, this tutorial can be used to pip install packages along with the requirements.txt file.



Database

Our Database is written in postgreSQL using the following Queries

Database_ERD

If you wish to connect to the DB please do the following:

  1. Rename this text file "config.py"
  2. Fill the database password as intructed in the document.
  3. Be sure to change the psycopg2.connect arguments to match your RDS instance.

For reference, here is a video showing how to connect the DB: DB Connection Video.



Machine Learning Models

Hierarchical Cluster model uses telomere length to determine age class. It uses the hierarchical cluster model with KMeans. The model includes class labels chick, juvenile, and adult.

Categorical Machine Learning model uses telomere length to determine age class. It uses the Random Forest model. The model includes class labels chick, juvenile, and adult, which out perfromed the Logistic Regression model that used Age class with only Young and Adult classification.

Continuous Machine Learning model initially used the rate at which Telomeres change in an individual to determine the amount of aging observed. It used a linear regression model with some feaure engineering to optimize our correlation. After noticing a megaphone data distribution, the Box Cox algorithm was applied to best demonstrate the linear correlation between telomere lengthe rate of change and bird age.

Tableau

Dashboard

In order to effectively present findings, some key indicators were displayed in the following Tableau dashboard:

Predictive Senescence - Tableau Story

Interactive elements

Tableau allows the use of interactive elements that provide tools to further explore data without the need for new charts.

e.g. Within our Grouped Classes by Age dashboard, a cursor selection is possible to be more selective with the data. This allows users to "zoom in" into the scatter plot to further differentiate birds with high RTL and short average Age. Similar selections are possible in all our Tableau visualizations, and some even include filtering options to filter the data by individual characteristics like sex etc.



Potential future research

For the sake of meeting deadlines, this project was fairly limited in scope. There are still many questions which may require further investigation among which are the following:

  • How do environmental factors change telomere length?
  • Does the presence of dominant birds influence survival rates of the other birds?
  • Does pedigree influence telomere length?



Contributors

Our team (emoji key):


James Bell

💻 👀🤔🔣🔬

Kermit Bravo

💻🎨🤔🔬

Robin Dassy

💻 👀 🤔🔬

Kari Hodge

💻🤔🔣🔬


Special Thanks:

Jakob Akhmerov
🤔
Artem Bordetskiy
🧑‍🏫🤔
Trent Little
🤔
Jackson Sheppard
🤔
Klaus Smit
🧑‍🏫🤔