/random_seqmask

Input a fasta of sequence, masks each sequence randomly according to input parameters. Allows identical masking across the input sequences or random masking.

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

Input a fasta of sequence, masks each sequence randomly according to input parameters. Allows identical masking across the input sequences or random masking. See help for more details.

Example:

python random_seqmask.py -T identical -f fake.fa -p 0.01 -m 5 -o fake_mask.fa

usage: random_seqmask.py [-h] [-T [{identical,random}]] -f FASTA_INPUT
                         [-p PENALTY] [-m MISSING_FACTOR] -o OUT_FILE

optional arguments:
  -h, --help            show this help message and exit
  -T [{identical,random}], --mask_type [{identical,random}]
                        Apply identical masking across sequences, requires all
                        input sequence to be the same length, OR applies
                        random masking to each sequence, using the same
                        masking parameters.
  -f FASTA_INPUT, --fasta_input FASTA_INPUT
                        Input is alignment file in fasta format.
  -p PENALTY, --penalty PENALTY
                        Positive real. All else equal, determines how long
                        islands of missing bases will be. Smaller values means
                        shorter islands.
  -m MISSING_FACTOR, --missing_factor MISSING_FACTOR
                        Positive real. All else equal, determines how frequent
                        missing bases will be. Smaller values result in more
                        missingness.
  -o OUT_FILE, --out_file OUT_FILE
                        The file to write output to.

The repo includes two other helpful scripts: fake_seq.py and count_missing.py, which create a really simple fasta sequence to test the masking out on, and summarize the resulting run lengths of missing and non-missing bases, respectively.

Usage:

python fake_seq.py #modify size and number of sequence in script


#fake_seq.py write fake_mask.fa
python count_missing.py fake_mask.fa