/PROJ_MAYJ_DSA4262

DSA4262 Project to Build a Classification Model to Detect M6A Modification in RNA Transcript Reads

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

PROJECT MAYJ DSA4262 PROJECT REPOSITORY

Group Members:

Michael Yang, Amas Lua, Yong Sing Chua, Jing Yi Yan

Project Motivation

This project is aimed at building a Classification Model to detect post-transcriptional M6A modification in RNA Transcript reads. The probability of m6A modification of a given candidate RNA-Seq short read coming from a specific transcript position is being output as an inference of the model.

m6A modification (m6A RNA methylation) is the most abundant modification among the many post-transcriptional modifications discovered so far as it serves as a significant regulator of transcript expression.

During the modification, adenosine molecules are being methylated, thereby resulting in a change of its chemical structure that translates to possibly destructive effects on one's metabolism. When the rate of m6A modification is too high, oncoprotein expression may be increased, thereby triggering cancer cells proliferation and mestastasis.

Hence, it is crucial for us to investigate into means to detect such modifications to allow early detection of illnesses and open up the possibilities of immunotherapy in cancer treatments.

image Image source: Spandidos Publicatios

Project Video:

Youtube Video Thumbnail

Overview of how we build our model:

image

  • Step 1: Parse information from provided data.json file into pandas dataframe for further processing.

    Before:

    {"ENST00000000233":
        {"244":
            {"AAGACCA":[[0.00299,2.06,125.0,0.0177,10.4,122.0,0.0093,10.9,84.1],
                        [0.00631,2.53,125.0,0.00844,4.67,126.0,0.0103,6.3,80.9],
                        [0.00465,3.92,109.0,0.0136,12.0,124.0,0.00498,2.13,79.6]]
            }
        }
    }

    After:

    Note: We added a Read_Counts column, which corresponds to the number of reads for each transcript at each candidate m6A position.

    Transcript Position Sequence Read_Counts dwelling_time(-1) std_dev(-1) mean_current(-1) dwelling_time(0) std_dev(0) mean_current(0) dwelling_time(+1) std_dev(+1) mean_current(+1)
    0 ENST00000000233 244 AAGACCA 185 0.00299 2.06 125.0 0.01770 10.40 122.0 0.00930 10.90 84.1
    1 ENST00000000233 244 AAGACCA 185 0.00631 2.53 125.0 0.00844 4.67 126.0 0.01030 6.30 80.9
    2 ENST00000000233 244 AAGACCA 185 0.00465 3.92 109.0 0.01360 12.00 124.0 0.00498 2.13 79.6
  • Step 2: Train-test split by gene_id that can be found in data.info to make sure no overlapping of genes between different split, preventing data leakage.

    • Categorised the genes into 3 categories based on the genes transcripts counts: Low, Medium and High
      • Low: transcript counts < 15
      • Medium: 14 < transcript counts < 46
      • High: transcript counts > 45
    • Distribution of split:
      • Training set: 70% of the genes from each category
      • Test set: remaining 30% of the genes from each category
    • Concatenate the gene_id samples of all 3 categories into 2 lists respectively: train_genes and test_genes
  • Step 3: Feature extraction and data transformations

    • Split the Sequence column into 3 columns: first_base, last_base and middle_sequence. For the columns first_base and last_base, we converted the alphabetical letters into numeric numbers with a mapping dictionary.

      ATGC_mapping = {'A': 0, 'T': 1, 'G': 2, 'C': 3}
    • Join labels (response variable) from data.info file into the dataframe.

    • data.info sample rows:

      gene_id,transcript_id,transcript_position,label
      ENSG00000004059,ENST00000000233,244,0
      ENSG00000004059,ENST00000000233,261,0
      ENSG00000004059,ENST00000000233,316,0
      
    • Combined DataFrame:

      gene_id transcript_id transcript_position first_base last_base middle_sequence Read_Counts dwelling_time(-1) std_dev(-1) mean_current(-1) dwelling_time(0) std_dev(0) mean_current(0) dwelling_time(+1) std_dev(+1) mean_current(+1) label
      0 ENSG00000004059 ENST00000000233 244 0 0 AGACC 185 0.00299 2.06 125.0 0.01770 10.40 122.0 0.00930 10.90 84.1 0
      1 ENSG00000004059 ENST00000000233 244 0 0 AGACC 185 0.00631 2.53 125.0 0.00844 4.67 126.0 0.01030 6.30 80.9 0
      2 ENSG00000004059 ENST00000000233 244 0 0 AGACC 185 0.00465 3.92 109.0 0.01360 12.00 124.0 0.00498 2.13 79.6 0
    • With the gene_id lists from step 2, we split the dataframe into training set and test set.

    • For the training set dataframe, we resampled the rows with label of the minority class to deal with the imbalanced dataset (m6A modified transcripts are much less than the unmodified transcripts).

    • After resampling, we dropped identity columns which are namely gene_id and transcript_id. The Read_Counts column was also dropped as it is not logical to use it as a feature to predict m6A modification.

    • For the column middle_sequence, we converted the categorical variable using the OneHotEncoder function in the sklearn package before fitting the model as XGBoost cannot be fitted with values of character type.

  • Step 4: Build a baseline XGBoost Model with resampled training data from Step 3

  • Step 5: Experimenting with different values of hyper-parameters such as max_depth, learning_rate, n_estimators, reg_alpha, reg_lambda and tracking the differences with MLflow GUI. More information about how we used MLflow can be found in the model_training folder.

  • Step 6: Choose best model. Among the parameters that we tried, we chose the combination that yielded the most improvements in the auc_roc and pr_auc metrics from our first model as our final model.

    Model pr_auc roc_auc precision recall f1_score
    XGBoost_v1 0.31 0.837 0.128 0.768 0.22
    XGBoost_v2 0.31 0.847 0.124 0.789 0.214

How to use our model to get predictions:

  • Note: XGBoost_v1.pkl in this directory was the model that we built for our first submission, while XGBoost_v2.pkl was the model that we built for our second submission. The model that we built for our second submission is the model that you will be using to test our inference pipeline.

image

1. Provision a Ubuntu 20.04 Large Instance on Research Gateway and SSH into it to use the Linux terminal.

We recommend an EBS Volume Size of 1500GB and an Instance Type of 4xlarge for faster results. This size is also able to handle the workload of predicting the possibilites of m6A modification of the short reads in the SG-NEx Samples in reasonable time.

  • Your IP address of your instance can be found by following the steps in the screenshot below:

    image

  • Alternative SSH Method 1: Using local Windows PowerShell

    You can use the following command format to SSH into the instance using Windows PowerShell:

    ssh -i '<path/to/pemfile.pem>' -L 8888:localhost:8888 ubuntu@<ip-address of Instance>

    image

  • Alternative SSH Method 2: Using Visual Studio Code's Remote-SSH extension

    • Download Remote SSH Extension if you do not have it. You can refer to the screenshot below to check if you have the extension installed.

      image

    • Click the F1 Key or Fn + F1 Keys to launch the search bar for you to configure your SSH Host details.

      image

    • Replace 123.456.789.542 with your instance's IP address as well as the ~\path\to\pem-file.pem and paste the following configurations into your config file:

      Host 123.456.789.542
          HostName 123.456.789.542
          User ubuntu
          IdentityFile ~\path\to\pem-file.pem
          Port 22
      
    • Now, Click the F1 Key or Fn + F1 Keys again to connect to the SSH Host you have just configured.

      image

    • A New VSCode window will be launched and you can select Linux when prompted to choose between Linux, Windows, or Mac.

3. CD into a working directory that your Instance is mounted to, where you want to clone our repository to and run our model inference in.

  • For example in terminal:

    cd <working_dir>
  • In VSCode:

    image

4. Clone our public repository into your working directory:

  • Using Git in terminal:
git clone --depth 1 https://github.com/jingyiyanlol/PROJ_MAYJ_DSA4262.git
  • If the above method does not work, you can try configuring your Git with commands below before running the git clone command above again:

    set GIT_TRACE_PACKET=1
    set GIT_TRACE=1
    set GIT_CURL_VERBOSE=1
    git config --global core.compression 0
  • If the above methods still do not work, you can click here to learn how to create your GitHub personal access token and try the next method. If you are unable to clone this repository still, feel free to raise an issue in this repository!

    • Insert your GitHub personal access token in <tokenhere> in the command below.
      git clone https://<tokenhere>@github.com/jingyiyanlol/PROJ_MAYJ_DSA4262.git

5. CD into the cloned repository:

  • Via terminal:

    cd PROJ_MAYJ_DSA4262
  • Launch Integrated Terminal in VSCode:

    image

6. Update your Ubuntu and install Make using the commands below in order to run our makefile:

sudo apt update
sudo apt install make

7. Run the command below to install the required packages to run our model:

make install_all
  • If u are prompted with the message Do you want to continue? [Y/n], type Y and press Enter.

  • If you would like to install our dependencies in a python virtual environment, run the commands below instead:

    make install
    python3 -m venv ~/.venv
    source ~/.venv/bin/activate
    make install_python_dependencies

8. Run our model's prediction on the small_test_data.json that we have provided for you in this repository:

make predictions_on_small_dataset

9. You should see the similar following outputs in your terminal if the run is successful.

When type the command ls, you should see a new directory called XGBoost_v2_predictions created in your working directory. The directory should contain the file small_test_data_predictions.csv which contains the output of our model predictions with the following columns: transcript_id, transcript_position and score.

image

10. Use our model to do prediction on a dataset that you have.

You can type the following command in your terminal and replace <path/to/data.json> and <data_predictions.csv> with the path to the m6ANet processed RNA sequence reads json file and the name of the output csv file respectively.

python run_predictions.py XGBoost_v2.pkl <path/to/data.json> <data_predictions.csv>

You should be able to see the output csv in the XGBoost_v2_predictions directory when the run is completed.