/control_seq_gen

Generates control sequences for motif analysis. Goal is to maintain the statistical parameters of the input sequences (i.e. nucleotide composition) without recapitulating the sites of interest.

Primary LanguagePython

control_seq_gen

Overview

Generates a set of control sequences from a set of reference sequences that are passed in. The goal is to maintain the statistical parameters (i.e. nucleotide composition) of the reference sequences without recapitulating the sites of interest.

How it works

All of the reference seqeunces are concatenated into one string. Random units are resampled from the conglomerate stirng and concatenated to make a control sequence.

Usage

Method 1 - manually pass in paramters

cs_gen()

The parameters can be passed directly into the function. It will return the set of control sequences:

Parameter Default Desrciption
fasta_files input.fasta path to a FASTA file containing the set of reference sequences to pull from to make the control sequences.
output_file data/cs_gen.fasta path to the output file where generated control set will be written.
output_descrip Control sequence THe description that is attached to each of the generated sequences in the output FASTA
mark_param 5 Markovnikov parameter, the length of the resampling unit. (i.e. 5 -> fragments of length 5 will be selected)
length 1000 The length of each individual control sequence
num_seq 100 The number of control sequences to be returned.
Method 2 - load parameters from file

cs_genj()

An input JSON file is parsed for the parameters. JSON file layout:

    {
       "fasta_files" : ["data/input.fasta", "data/input2.fasta"],
       "output_file" : "cs_gen.fasta",
       "output_descrip" : "Control Sequence",
       "mark_param": 5,
       "length" : 10,
       "num_seq" : 10
   }