FASTAr (FASTA Processor)

Primary LanguagePythonMIT LicenseMIT

FASTAr (FASTA Processor)

Used to make simple modifications to FASTA formatted FASTA files (generally provided by UCSC)

Written by Jake Bradford. Checkout our lab website! biomedicaldatascience.com



For collapsing multi-line FASTA blocks files into single-line blocks. A FASTA comment will cause a new block to start.


For extracting a sequence within a file at specified start and end points. Ignores FASTA comments.


Provides some basic statistics about a FASTA file.


Realigns FASTA annotations according to an offset. Useful for after extracting a sequence from a FASTA file and wanting to maintaing the annotation.


Extract gene names between start and end position using the given reference


Converts a single multi-FASTA formatted file into multiple single FASTA files.

Files Required


By cloning this repository: git clone https://github.com/jakeb1996/FASTAr.git

FASTA sequence

Obtained from UCSC FTP server (help).

Server: ftp://hgdownload.cse.ucsc.edu (user: anonymous pass: <your-email>)

Chromosome example: /apache/htdocs/goldenPath/mm10/chromosomes/chr19.fa.gz

Annotation file

Obtained from UCSC MySQL server (help)

MySQL server: FASTA-mysql.soe.ucsc.edu (username: FASTA, password: )

Find mm10 database. Find refGene table. Export as tab separated file.

This is a simple task using the Table Data Export Wizard in MySQL Workbench (help).


1) Understand the context of this method

View GRCm38/mm10 chr19:16,767,421-20,818,303 in the UCSC FASTA Browser. It will show Foxb2 gene. UCSC Browser - GRCm38/mm10 Chr19

This verification method provides an example of:

  • Collapsing a FASTA file into a single line (this way line breaks are ignored; as some CRISPR related tools do anyway)

  • Extracting a sequence from the collapsed file (eg: you may only want a sample file that is 10 million base-pairs in length)

  • Updating an annotation file so that the annotations still align with your extract

Note: this method uses the Foxb2 gene as an example (the gene is very short)

2) Obtain Foxb2 gene info

In the UCSC FASTA Browser, right click on Foxb2 and:

3) Collapse chr19 data

Run fastar.py in collapse mode to collapse chr19 of mm10 (collapsed multi-line DNA sequence into single-line DNA sequence).

If you need CHR19:

1) UCSC data: `ftp://hgdownload.cse.ucsc.edu`

2) Navigate to: `/apache/htdocs/goldenPath/mm10/chromosomes`

3) Download: `chr19.fa.gz`

$ python fastar.py -m collapse -f "C:\FASTAs\mm10-ucsc-mod\chr19\chr19.fa"

You can verify the file was collapsed using the analyse mode

$ python fastar.py -m analyse -f "C:\FASTAs\mm10-ucsc-mod\chr19\chr19.fa"

4) Extract large portion of Chr19

Run fastar.py in extract mode and extract 10m to 20m BP from the chr19 UCSC FASTA data

$ python fastar.py -m extract -s 10000000 -e 20000000 -f "C:\FASTAs\mm10-ucsc-mod\chr19\chr19.fa.collapse"

5) Adjust the annotation file

Seeing that we just took an extract from the chromosome file, the annotations will now be out of alignment.

Run fastar.py in refadjust mode to adjust the UCSC FASTA annotation file

Obtain annotation:

1) MySQL server: `FASTA-mysql.soe.ucsc.edu` (username: `FASTA`, password: `<your-email>`)
2) Find `mm10` database. Find `refGene` table. Export as tab separated file. 
3) This is a relatively simple task using MySQL Workbench (use Table Data Export Wizard)

In our example, we have named this TSV file refGene.txt

$ python fastar.py -m refadjust -f "C:\FASTAs\mm10-ucsc-mod\chr19\refGene.txt" -o 10000000 -l 50

-o flag (offset flag) should be equal to the -s flag in step 4

-l flag indicates length of each line in original UCSC FASTA data file

6) Find Foxb2 in the adjusted annotation file

Note: each numerical value in "adjusted" is 10m less than in "original".

Original: 713 NM_008023 chr19 - 16872315 16873830 16872353 16873640 1 16872315, 16873830, 0 Foxb2 cmpl cmpl 0,

Adjusted: 713 NM_008023 chr19 - 6872315 6873830 6872353 6873640 1 6872315, 6873830, 0 Foxb2 cmpl cmpl 0,

7) Extract gene from custom chr19 file

Run fastar.py in extract mode and extract Foxb2 from the output generated in step 4

Note: -s and -e flags match the adjusted gene start and end values from step 6

Refer to the MySQL table structure for what each column represents

$ python fastar.py -m extract -f "C:\FASTAs\mm10-ucsc-mod\chr19\chr19.fa.collapse.extract" -s 6872315 -e 6873830

8) Verify sequences match

Verify the sequence provided in "Get DNA for Foxb2" (step 2) from UCSC FASTA Browser matches the extract produced in step 7.


fastar.py process: TCTCTCGACA...GTCCCCGCAA

Note: if you followed this process precisely, the final file you should look at is called: <dirs>/chr19.fa.collapse.extract.extract

9) Verification complete



  • See LICENSE *