/fatools

a python tool package for working with fasta sequences

Primary LanguagePythonGNU Lesser General Public License v3.0LGPL-3.0

fatools: A python utility package for working with large-scale fasta sequences

A total of 30 utilities/options for common operarations with fasta sequences are currently organized under 6 main subcommands. The utilities range from searching specific sequence entries from a large set of fasta sequences based on ID or a text string in the defline or a sequence motif to spliting a large sequence file into small chuncks, reporting summary stats of a genome assembly, and filtering sequences by length or gap size or redundant entries based on ID or sequences. Furthermore, by allowing input from and output to stdout, multiple processes can be done sequencially in one line of commpands via use of pipe (|). Make sure you have python 2 installed to run fatools.

List of subcommands

Typing 'fatools ' displays the list of subcomands; typing 'fatools ' displays the detailed utilities/options for the subcommand.

convert

-r print sequence in revevrse compliment.

-N convert all non-ACGT letters to N.

-R remove all non-ACGT letters.

-U to upper case

-u to lower case


extract

-F N extract the first N fasta entries, if N is larger than the total number of entries, then print to the last entries.

-S N extract from the Nth entry to the last entry.

-L N extract last N sequence entries. If N is larger than the total number of entries, then print all entries.

Use -S N -F M for entries from N to M; use -F N and -L N to extract both the first and last N entries.

-f N extract first N bp, prints the entire sequence if N is larger than the total length.

-s N extract sequence up to to N bp.

-l N extract last N bp, prints the entire sequence if N is larger than the total length.

Use -s N -f M for sequence from N to M bp; use -f N and -l N to extract both the first and last N bp as one sequence separated by a space.

Note: The -f, -s, and -l options were designed for working with a single long sequences, even though they will work for multiple sequences by applying the same operation to all sequences.


filter

-g N skip sequences with N or more Ns.

-r 1/2 1: skip redundant entry based ID; 2: keep redundant entries by adding a serial number to the identical IDs to make each ID unique.

-R 1/N skip redundant entries based on sequence. 1: use the entire sequence; N: use only the first and last N bases.

-l N skip sequences shorter than N bp.

-L N skip sequences longer than N bp.

use -l N -L M for sequences with length from N to M bp (inclusive).

In all options, '-e' can be added to print the skipped entries in STDERR, which can be captured using 2>[skipped.fa].


report

-f print fasta entries as in the input.

-F print fasta entries with all sequence in one line.

-n print sequences without the defline.

-d print deflines in short form (part before the first space).

-D print deflines in the original form.

-c print the total number of fasta entries in the input.

-l print short defline +[\t] length.

-L print original defline +[\t] length.

-s print sequence summary statistics including N50.

-S print sequence summary statistics plus detailed gap info.
Use -h with -s and -S to disable the header above the outputs
Use -H to print parameters in human friendly form.


search

-s string: search for entries containing "string" in the sequence.

-d string: search for entries containing "string" in the defline: Default is for exact match; use "/string" to search for entries with "string" as part of the ID.

-F file: search for sequences based on a list of IDs in the file (one ID/line).
Can use -D to specify delimiter in the defline. Default is space or '|' or end of line;
use -i to specify the field number, default is 1.

-1 print only the 1st match for -d and -s.

-v use with -s, -d or -F to negate the search.


split

-G N split each of the sequences in the input file as non-gap fragments.
"N" is the number of consecutive Ns base, default is 1;
Use -G N with -t to print just the gap positions.

-n N split the input sequences into chunks, each containing N fasta entries (the last chunk may be less).

-N N split the input sequences into N chunks, each containing equal number of entries (last one may be smaller).

-M N split the input sequences into chunks at ~N MB (million bp) in size (last chunk may be smaller).

-o file: prefix for output files (serial numbers added to prefix; required).


Making fatools executable and available from any directory

Linux

  1. Open the terminal and navigate to where you have downloaded fatools
  2. Find where python2 is installed in your system which python. Usually you would get something like /usr/bin/python
  3. Copy this output to the beggining of fatools as #!/usr/bin/python.
  4. Run the following command to make the script executable chmod +x fatools
  5. Add fatools to your bin directory or any other directories included in your $PATH
  6. You should be able to run fatools from anywhere!

Windows

  1. Type 'control panel' in the Windows search bar
  2. Go to System and Security > System > Advanced System settings > Environment variables
  3. Under system variables, select 'Path' and click 'Edit' and then 'New'
  4. Add the path of where you have fatools located and click OK
  5. You should be able to run fatools from anywhere!

Examples

Navigate to the exampleFiles directory in this repository. In there, a fasta file (exampleFasta.fa) and a file containing a list of IDs (IDlist.txt) from exampleFasta.fa.

To extract the fasta sequences from exampleFasta.fasta based on the list of IDs:
fatools search -F IDlist.txt exampleFasta.fa

NP_001245510.1 notch, isoform B [Drosophila melanogaster]
NNMQSQRSRRRSRAPNTWICFWINKMHAVASLPASLPLLLLTLAFANLPNTVRGTDTALVAASCTSVGCQNG
GTCVTQLNGKTYCACDSHYVGDYCEHRNPCNSMRCQNGGTCQVTFRNGRPGISCKCPLGFDESLCEIAVP
NACDHVTCLNGGTCQLKTLEEYTCACANGYTGERCETKNLCASSPCRNGATCTALAGSSSFTCSCPPGFT... 

DY343456.1 Macropus rufus BRCA1 (BRCA1) gene, partial cds
CAAACAGCCTGGCTTAGCAAAAAACCAACAGAGCAGTCTGGATGAAAGTAAGGAAATATGTAGTGCTGGA
TGTGGCACAGATGCTCGTGCCACCTCATTACTTCCTGAAACCACCAGCTTATCGCCCAACACAGACCGAA
TGAATGTAGAAAAGGCTGAACTCTGTAATAAAAGTAAGACCCTGGGTGCCCATGAGCTGAATGCCCATCA 

FY343456.1 Macropus rufus BRCA1 (BRCA1) gene, partial cds
TGTGGCACAGATGCTCGTGCCACCTCATTACTTCCTGAAACCACCAGCTTATCGCCCAACACAGACCGAA
TGAATGTAGAAAAGGCTGAACTCTGTAATAAAAGCAAACAGCCTGGCTTAGCAAAAAACCAACAGAGCAG
TCTGGATGAAAGTAAGGAAATATGTAGTGCTGGAAAGACCCTGGGTGCCCATGAGCTGAATGCCCATCAT 

To report summary statistics:
fatools report -s exampleFasta.fa

Total 2,868 bps from qualified 5 sequences (5 total); length average: 573 (210-1262) bp; N50: 698 bp

To get fasta sequences with a specific maximum length filter

python fatools filter -L250 exampleFasta.fa

DY343456.1
CAAACAGCCTGGCTTAGCAAAAAACCAACAGAGCAGTCTGGATGAAAGTAAGGAAATATGTAGTGCTGGA
TGTGGCACAGATGCTCGTGCCACCTCATTACTTCCTGAAACCACCAGCTTATCGCCCAACACAGACCGAA
TGAATGTAGAAAAGGCTGAACTCTGTAATAAAAGTAAGACCCTGGGTGCCCATGAGCTGAATGCCCATCA

FY343456.1
TGTGGCACAGATGCTCGTGCCACCTCATTACTTCCTGAAACCACCAGCTTATCGCCCAACACAGACCGAA
TGAATGTAGAAAAGGCTGAACTCTGTAATAAAAGCAAACAGCCTGGCTTAGCAAAAAACCAACAGAGCAG
TCTGGATGAAAGTAAGGAAATATGTAGTGCTGGAAAGACCCTGGGTGCCCATGAGCTGAATGCCCATCAT 

You can combine multiple utilities using pipe "|". Let's say you want to see the short defline and the length of the first 3 fasta sequences in a fasta file.

fatools extract -F3 sequenceTesting2.txt | fatools report -l -

AY211956.1Macropus(BRCA1)gene,partialcds       698 
NP_001245510.1  1262 
DY343456.1      210

Or extract the sequences from a large sequence set for a list of IDs and then search sequences with a specific sequence by using

fatools search -F IDlist.txt exampleFasta.fa |fatools -S AAATAAA -