PlsChain: A C repository from RunpengLuo

Getting Started

git clone https://github.com/RunpengLuo/PlsChain.git
cd PlsChain && make
# create an index for the plasmid library with k=15
./plschain -i -k 15 -o lib_idx/ backbone.fa promotor.fa peptide.fa gene.fa terminal.fa terminator.fa
# classify the reads against the indexed library
./plschain -q lib_idx/ -o qry_res/ query.fastq.gz
# perform fuzzy match and group the classification
python scripts/plschain_postprocess.py qry_res/ lib_idx/

About PlsChain

PlsChain is an algorithm to classify Oxford Nanopore noisy reads (~5% error rate) sequenced from the plasmid mixtures, it solves the cyclic co-linear chaining problem in the cyclic manner.

Installation

The program is designated for Unix-like system (Linux & MacOS), C compiler, GNU make and zlib development files are required to compile the program.

Run the python script scripts/plschain_postprocess.py for grouping the results with a Python3 environment with no additional library been required.

Program Usage

Usage: plschain -i -k INT -o DIRECTORY FILE1 FILE2 FILE3 ...
       plschain -q DIRECTORY -o DIRECTORY <query.fa>
Options:
    -i            Indexing mode
    -q DIRECTORY  Query mode, index directory
    -k INT        k-mer size [15,32]
    -o DIRECTORY  output directory
    -h            show this message

FILE1 FILE2 ... consists the library of expression cassettes (with backbone removed), the order should follow the plasmid structure, cyclic order is allowed, e.g., backbone.fa promotor.fa peptide.fa gene.fa terminal.fa terminator.fa.

$python scripts/plschain_postprocess.py
scripts/plschain_postprocess.py <query_dir> <index_dir>

index_dirrefers to the output directory after running PlsChain with -i indexing mode, and query_dir refers to the output directory after running PlsChain with -q query mode.

Program Output

<out_dir>/qry_total.csv and <out_dir>/qry_total.fuzzy.csv stores the classification result per read with and without fuzzy match opertaions. Each row consists read name, followed by the ordered list of classified components. * indicates the corresponding component is not decided by PlsChain. fail indicates unclassified record. contamination indicates the filtered unclassified record as contamination based on read length.
<out_dir>/qry_total.group.csv and <out_dir>/qry_total.group.fuzzy.csv stores the grouped results based on <out_dir>/qry_total.csv and <out_dir>/qry_total.fuzzy.csv, respectively.

Simulation

PlsChain also provides a simulation script plschain_simulator.py that simulates sequencing data from a library of expression cassettes. plschain_simulator.py operates in three modes: sub_sampling, all_sampling, and real_sampling. plschain_simulator.py takes a configuration file and generate index and FASTA file. An example of the configuration file is provided at scripts/sim_conf.txt. Please check out the script and provided example for detailed explanation.