In order to accurately capture the codon distribution of the host genes, the codon optimization problem can be converted into that of sequence annotation in deep learning.Sequence labeling models are quite popular in many NLP tasks, such as Named Entity Recognition (NER), part-of-speech (POS) tagging and word segmentation.NCRF++ is a PyTorch based framework with flexiable choices of input features and output structures.
This is a training tool for the codon optimization model for E. coli,which is based on NCRF++
Welcome to start this repository!
Python: 3
PyTorch: 1.4 (Currently, 0.3 and earlier are not supported)
The program can run in two status; training and decoding.
In training status:
python main.py --config demo.train.config
In decoding status:
python main.py --config demo.decode.config
The configuration file controls the network structure, I/O, training setting and hyperparameters.
Detail configurations and explanations are listed here.
You can refer to the data format in CPA.
In sample_data,we have prepared train_set, dev_set and test_set for training model.
You can also use our trained_model to decode.
codonToBox and boxToCodon is provided to transform data.