/quadtree

Quadtree - gradient-boosted decision tree model used to predict guanine quadruplexes in DNA sequences

Primary LanguageJavaScriptMIT LicenseMIT

Quadtree


package_version python_version node_js_version react nextjs

The Quadtree is a gradient-boosted decision tree model used to predict guanine quadruplexes in DNA sequences. It's developed on top of the LightGBM python library. Each sequence base is encoded based on a given encoding prescription. The model was trained to be used with a sliding window and analyses the whole sequence. Machine learning model can be used as python script or thru preview website quadtree.vercel.app

Repository structure

quadtree
    └─ web -> preview website source code
    └─ python
          └─ model -> lightgbm model params
          └─ train -> example files how training was performed
          └─ quadtree.py -> predictor

Requirements

  • lightgbm==3.3.2
  • numpy==1.21.2

Install dependencies

Before using install the requirements:

  pip install -r requirements.txt

Usage

Create model instance

  from quadtree import Quadtree
  
  model = Quadtree()

Run analysis - algorithm inputs

  • sequence as a string (maximum length is not limited)
  • threshold (recommended values is 0.2)
  • quadnet model file path
result = quadtree.analyse(
    sequence='ATTAATACTTTTAACAATTGTAGTATATAAAAAAGGGAGTAACC...', 
    model_path='/path/to/quadnet_model.txt',', 
    score_threshold=0.1
)

Results are then returned in given form which can be loaded into pandas DataFrame.

import pandas as pd

df = pd.DataFrame(result)
index position sequence length
0 0 907 GCAACAATGGCTGATCCAGAAGGTACAGACGGGGAGGGCACGGGTTGTAACGGCTGGTTTTATGTACAAGCTATTGTAGACAAAAAAACAGGAGATGTAATATCA 105
1 1 1184 GAGGCAGCACAGAAAACAGTCCATTAGGGGAGCGGCTGGAGGTGGATACAGAGTTAAGTCCACGGTTACAAGAAATATCTTTAAATAGTGGGCAGA 96
2 2 1389 ATGTAGTGGCGGCAGTACGGAGGCTATAGACAACGGGGGCACAGAGGGCAACAACAGCAGTGTAGACGGTACAAGTGACAATAGCAATATAGAAAATGTAAATCCAC 107
3 3 1635 AGATTGGGTTACAGCTATATTTGGAGTAAACCCAACAATAGCAGAAGGATTTAAAACACTAATACAGCCATTTAT 75
4 4 2229 AATAGATGAAGGGGGAGATTGGAGACCAATAGTGCAATTCCTGCGATACCAACAAATAGAGTTTATAACATTTTTAG 77

Model scheme

LAYOUT_LEFT_RIGHT Quadtree

Training parameters

These parameter were used to train lightgbm model

LGBM Classifier value
colsample bytree 0.817574864502621
learning rate 0.03744835808549148
max bin 127
min child sample 3
number of estimators 1000
number of leaves 74
regularization alpha 0.0033803043003857677
regularization lambda 0.7013136087939289
objective binary

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details. # quadtree