Python scripts for bioinformatics data manipulation

mitodownloader.py

Downloads all RefSeq mitogenome records available for a given taxon

usage: mitodownloader.py [-h] [-f] TAXON_NAME

positional arguments:
  TAXON_NAME        Taxon name

optional arguments:
  -h, --help   show this help message and exit
  -f, --fasta  Downloads records in fasta format (default: genbank)

extract_large_contigs.py

Gets contig information from a multifasta file. Has to be used with one of three options (-c, -a, -r):

usage: python3 extract_large_contigs.py [-h] [-c | -a  | -r ] infile

-c, --count    Get a list of all contigs and their size
-a , --acc     Get a single contig by ID (please provide description line without '>')
-r , --range   Get sequence of all contigs inside a min-max length. Please provide the lower and upper limits such as '12000-18000'
-h, --help     show help message and exit

gb_to_fa.py

Converts a single genbank file to fasta, printing its output to the screen.

usage: python3 gb_to_fa.py sequence.gb

generate_phylip_from_multifasta.py

Aligns a multifasta file using clustal omega (at the moment, needs clustalo-1.2.4-Ubuntu-x86_64 on $PATH to work) and converts this alignment into a relaxed (more than 10 characters allowed for sequence identifiers) phylip alignment with no line wrapping.

The phylip alignment output can be used for the generation of phylogenetic/phylogenomic trees using PartitionFinder2.

This scripts only works with sequences that are less than 1 Gbp in size.

usage: python3 generate_phylip_from_mutlifasta.py [-h] [-t] multifasta.fa

optional arguments:
  -h, --help    show this help message and exit
  -t , --type   Type of data: {Protein, RNA or DNA(default)}

mitos_to_artemis.py

Converts .gff files generated by MitosWebServer to a modified .gff that can be exported to the Artemis Annotation tool.

usage: python3 mitos_to_artemis.py filename.gff

remove_score_seqin.py

Removes "score" values present in the annotation of MitosWebServer. The removal of the score values from seqin files is necessary in order to submit mitochondrial sequences to genbank.

usage: python3 remove_score_seqin annotated_sequence.seqin

sam_to_fastq.py

Extracts reads (in fastq format) from a sam file.

usage: python3 sam_to_fastq.py [-h] [-P] file.sam

optional arguments:
  -h, --help    show this help message and exit
  -P, --paired  Generates two paired-end data files (unpaired reads included)

sra_download.py

Downloads a list of datasets in sra file format.

The sra_download.py script works by reading a text file (list of sra datasets) that should contain two collumns using tab as separators: Accession number and species name, as represented below:

ERR1306022      Species1
ERR7295165      Species2
ERR1306034      Species3
SRR4409513      Species4

At the moment, the wget is required. Please install it before running the script:

pip install wget

Script usage:

python3 sra_download.py dataset_list.txt

split_multigenbank.py

Splits a multigenbank in individual records, generating a genbank file (name_of_species.gb) for each.