/seqExtractor

a python tool to extract multiple fasta sequence records from multiFASTA file based on a list of record ids

Primary LanguagePythonMIT LicenseMIT

seqExtractor

A python tool to extract multiple fasta sequence records from multiFASTA file based on a list of record ids or a substring in the sequence header.
Written by: Li Chuin Chong and Yeo Keat Ee


Installation

  1. Download only the python code

    wget https://raw.githubusercontent.com/ChongLC/seqExtractor/master/seqExtractor.py
    
  2. Download the entire repo

To use the seqExtractor, the following packages are needed to be installed:

  • Biopython
  • argparse

You can install these packages by using following command:

pip install -r requirements.txt

Usage

seqExtractor.py [-h] -i INPUT (-l ID_LIST | -s SUBSTRING) -o OUTPUT [-t THREADS] [-c]

Extract fasta sequence records from multiFASTA file based on a list of record ids or a substring in the sequence header

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Filename include extension of original FASTA file
  -l ID_LIST, --id_list ID_LIST
                        Filename include extension of the sequence ID list
  -s SUBSTRING, --substring SUBSTRING
                        Substring to search for in the sequence header
  -o OUTPUT, --output OUTPUT
                        Filename include extension of output FASTA file
  -t THREADS, --threads THREADS
                        Number of threads to use (default: 1)
  -c, --case_insensitive
                        Make the substring search case insensitive (default: False)
  -e, --exclude
                        Make the output FASTA file only contains excluded result (default: False)

There are two ways to use seqExtractor tool:

  • Extract sequences based on a list of sequence IDs:

    python seqExtractor.py -i input.fasta -l id_list.txt -o output.fasta -t 4
    
  • Extract sequences based on a substring in the sequence header:

    • case sensitive (by default)

      python seqExtractor.py -i input.fasta -s Belgium -o output.fasta -t 4
      
    • case insensitive

      python seqExtractor.py -i input.fasta -s belgium -o output.fasta -t 4 --case-insensitive
      

Motivation and goal

Inspiration

Inspired by faSomeRecords from Santiago Sanchez-Ramirez, which firstly created by kentUtils in C++ version.


Found a bug?

Or would like a feature added? Or maybe drop some feedback? Just open a new issue or send an email to us (lichuinchong@gmail.com).