Project repository for DSA4262 Sense-making Case Analysis Health and Medicine.
This project aims to use machine learning to identify m6A RNA modifications from direct RNA-Seq data.
- Python 3.8.10
- For Amazon EC2, use instance type t3.xlarge & above
- Clone the repository.
git clone https://github.com/HaotingS/dsa4262_project.git
cd dsa4262_project
- Make sure you are on the right branch,
demo
.
git checkout demo
- Create a folder for storing outputs.
mkdir outputs
- Install Python packages.
sudo apt install -y python3-pip
sudo pip install -r requirements.txt
- Download data.
wget -O data.tgz https://www.dropbox.com/s/j24g0e4fg7kqj43/data.tgz?dl=1
- Unzip data and remove compressed files.
tar -xzvf data.tgz data && rm data.tgz
The scripts below parse, train and predict on the full datasets. They might take a long time to run. Therefore, we provide sample data sample_data.json
and sample_data.info
at the project root for you to test before running on the full datasets.
Parse data.json
into data.csv
. data.csv
is used only in some notebooks (only on main branch).
python3 scripts/parse_data.py -f data/data.json -s data/data.csv # full train dataset
python3 scripts/parse_data.py -f sample_data.json -s sample_data.csv # sample dataset
-f data/data.json
specifies the RNA-Seq data.-s data/data.csv
specifies the resulting csv file.
Train model using data.json
and data.info
.
python3 scripts/train.py -d data/data.json -l data/data.info -s outputs/xgb.model # full train dataset
python3 scripts/train.py -d sample_data.json -l sample_data.info -s outputs/sample_xgb.model # sample dataset
-d data/data.json
specifies the RNA-Seq data.-l data/data.info
specifies the labels.-s outputs/xgb.model
specifies the resulting model.
Use trained model to predict on dataset1.json
, dataset2.json
, dataset3.json
.
python3 scripts/predict.py -d data/dataset1.json -m outputs/xgb.model -s outputs/teamgenono_dataset1.csv # full test dataset 1
python3 scripts/predict.py -d data/dataset2.json -m outputs/xgb.model -s outputs/teamgenono_dataset2.csv # full test dataset 2
python3 scripts/predict.py -d data/dataset3.json -m outputs/xgb.model -s outputs/teamgenono_dataset3.csv # full test dataset 3
python3 scripts/predict.py -d sample_data.json -m outputs/sample_xgb.model -s outputs/sample_dataset.csv # sample dataset
-d data/dataset<n>.json
specifies the n-th test dataset.-m outputs/xgb.model
specifies the model to use for prediction.-s outputs/teamgenono_dataset<n>
specifies the n-th prediction output.
- Fork it!
- Create your own branch:
git checkout -b my-new-branch
. - Make your changes.
- Comment out what changes are made:
// Changed ArrayList to Array
. - Comment out why are they made:
// Saves memory
.
- Comment out what changes are made:
- Commit your changes:
git commit -am 'added some feature'
. - Push to the branch:
git push origin my-new-branch
. - Submit a pull request. 😄
DSA4262 Project is licensed under the MIT license.