/ASCIIGenome

Text Only Genome Viewer!

Primary LanguageJavaMIT LicenseMIT

Text Only Genome Viewer!

Description

ASCIIGenome is a command-line genome browser running from terminal window and solely based on ASCII characters. Since ASCIIGenome does not require a graphical interface it is particularly useful for quickly visualizing genomic data on remote servers. The idea is to make ASCIIGenome the Vim of genome viewers.

As far as I know, the closest program to ASCIIGenome is samtools tview but ASCIIGenome offers much more flexibility, similar to popular GUI viewers like IGV.

Some key features:

  • Command line input and interaction, no graphical interface, minimal installation and requirements
  • Can load multiple files in various formats
  • Can access remote files via URL or ftp address
  • Easy navigation and searching of features and sequence motifs and filtering options
  • Support for BS-Seq alignment

Requirements and Installation

Installation quick start

In the commands below replace version number with the latest from releases:

wget https://github.com/dariober/ASCIIGenome/releases/download/v0.1.0/ASCIIGenome-0.2.0.zip
unzip ASCIIGenome-0.2.0.zip

cd ASCIIGenome-0.2.0/
chmod a+x ASCIIGenome
cp ASCIIGenome.jar /usr/local/bin/ # Or ~/bin/
cp ASCIIGenome /usr/local/bin/     # Or ~/bin/ 

Installation through Homebrew

ASCIIGenome can also be installed through brew / Linux Brew, although it is still not an official package:

brew install https://raw.githubusercontent.com/dariober/ASCIIGenome/master/install/brew/asciigenome.rb

A little more detail

ASCIIGenome.jar requires Java 1.7+ and this should be the only requirement. There is virtually no installation needed as ASCIIGenome is pure Java and should work on most (all?) platforms. Download the zip file ASCIIGenome-x.x.x.zip from releases, unzip it and execute the jar file with

java -jar /path/to/ASCIIGenome.jar --help

To avoid typing java -jar ... every time, you can put both the helper script ASCIIGenome and the jar file ASCIIGenome.jar in the same directory in your PATH and execute with:

ASCIIGenome [options]

Note the helper is a bash script. To set the amount of memory available to java use the -Xmx option as e.g. java -Xmx1500m -jar ....

If for some reason the text formatting misbehaves, disable it with the -nf option. I have developed ASCIIGenome on MacOS, Ubuntu and CentOS with bash 4.1, white colour background.

Usage examples

These are just some functionalities to give an idea behind ASCIIGenome.

Minimal example

Open an indexed bam file, as simple as:

ASCIIGenome aln.bam

Open with a reference genome (reference must be indexed, see Supported input):

ASCIIGenome -fa genome.fa aln.bam

Open and browse

Open some peak and bigWig files from ENCODE. Note that opening remote bigwig files is a little slow (IGV seems equally slow) and it might not work with some proxy settings (see also issue#6):

encode=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs

ASCIIGenome -g hg19 \
    $encode/wgEncodeSydhTfbsGm10847NfkbTnfaIggrabPk.narrowPeak.gz \
    $encode/wgEncodeSydhTfbsGm10847NfkbTnfaIggrabSig.bigWig \
    $encode/wgEncodeSydhTfbsGm12892Pol2IggmusPk.narrowPeak.gz \
    $encode/wgEncodeSydhTfbsGm12892Pol2IggmusSig.bigWig

Find the first feature on the first file, then change colour of one of the tracks. Reset y axes to span 0 to 50, finally save as png to default file name:

[h] for help: next #1
[h] for help: colorTrack magenta wgEncodeSydhTfbsGm12892Pol2IggmusSig
[h] for help: ylim 0 50
[h] for help: save .png

Result on terminal screen should look like this:

Saved file is chr1_996137-1003137.png (currently the png output doesn't include colours though).

Finding & filtering stuff

Once started, ASCIIGenome makes it easy to browse the genome. The picture below shows the distribution of transcripts on chromosome 36 of Leishmania major. It is clearly visible how transcripts in Leishmania tend to be grouped in blocks transcribed from the same direction (blue: forward strand, pink: reverse strand). Note how overlapping features are stacked on top of each other.

This screenshot has been produced by first loading the L. major GTF file:

ASCIIGenome ftp://ftp.ensemblgenomes.org/pub/release-31/protists/gtf/leishmania_major/Leishmania_major.ASM272v2.31.gtf.gz

At command prompt issue the following commands:

[h] for help: goto 36:1-2682151
[h] for help: filter \ttranscript\t
[h] for help: trackHeight 100

Now return to the start of the chromosome and find the first feature containing LmjF.36.TRNAGLN.01, print it to screen:

[h] for help: 1
[h] for help: find_first LmjF.36.TRNAGLN.01
[h] for help: print 

Now showing:

Chaining commands

Commands need not to be executed one at a time but can be chained with the && operator (like in Bash). This is more convenient than executing commands one by one and it is also faster as tracks are processed only once. For example, the example above could be executed in one pass as

goto 36:1-2682151 && filter \ttranscript\t && trackHeight 100

In addition, the same could be achieved at the start via the --exec/-x option:

ASCIIGenome -x 'goto 36:1-2682151 && filter \ttranscript\t && trackHeight 100' \
    ftp://ftp.ensemblgenomes.org/pub/release-31/protists/gtf/leishmania_major/Leishmania_major.ASM272v2.31.gtf.gz

Note that if the first option passed to -exec/-x starts with - you need to add a space between the opening quote and the option itself. For example do ASCIIGenome -x ' -F 16' ... instead of ASCIIGenome -x '-F 16' ....

Supported input

File name extensions matter as file types are usually recognized by their extension in case insensitive mode.

  • bam files should be sorted and indexed, e.g. with samtools sort and samtools index. Paths to remote URLs are supported but painfully slow (IGV seems to suffer of the same issue).
  • bigWig recognized by extension .bw or .bigWig. Remote URLs supported.
  • bedGraph recognized by extension .bedGraph or .bedgraph
  • bed, gtf, gff recognized by respective extensions. Remote URLs supported.
  • tdf This is very useful for quickly displaying very large intervals like tens of megabases or entire chromosomes see tdf
  • vcf Supported but not too sophisticated representation. URL should be supported but it appears ftp from 1000genomes doesn't work (same for IGV).
  • All other extensions (e.g. txt, narrowPeak) will be treated as bed files, provided the format is actually bed!

All plain text formats (bed, bedgraph, etc) can be read as gzipped and there is no need to decompress them.

Bedgraph files should be sorted by position, a sort -k1,1 -k2,2n will do. Unindexed bedGraph files are first bgzipped and indexed to temporary files which are deleted on exit. This can take time for large files so consider creating the index once for all with tabix, e.g.

bgzip my.bedgraph &&
tabix -p bed my.bedgraph.gz

Bed & gtf file are not required to be sorted or index but in this case they are loaded in memory. To save memory and time for large files you can again index them as above. Loading in memory is typically fast for files of up to ~1/2 million records.

For input format specs see also UCSC format and for guidelines on the choice of format see IGV recommendations.

Fasta reference: The reference sequence should be uncompressed and indexed, with e.g. samtools faidx:

samtools faidx genome.fa

Notable formats currently not supported: cram, bigBed.

bigBed files can be converted to bgzip format with bigBedToBed from UCSC utilities and then indexed with tabix. For example:

bigBedToBed input.bb /dev/stdout/ | bgzip > input.bed.gz
tabix -p bed input.bed.gz

Genome option

An optional genome file can be passed to option -g/--genome to give a set of allowed sequences and their sizes so that browsing is constrained to the real genomic space. The genome file is also used to represent the position of the current window on the chromosome, which is handy to navigate around.

There are three ways to pass a genome file:

  • A tag identifying a built-in genome, e.g. hg19. See genomes for available genomes

  • A local file, tab separated with columns chromosome name and length. See genomes for examples.

  • A bam file with suitable header.

Note that if the input list of files contains a bam file, the --genome option is effectively ignored as the genome dictionary is extracted from the bam header.

Formatting of reads and features

When aligned reads are show at single base resolution, read bases follow the same convention as samtools: Upper case letters and . for read align to forward strand, lower case and , otherwise; second-in-pair reads are underlined; grey-shaded reads have mapping quality of <=5.

GTF/GFF features on are coded according to the feature column as below. For forward strand features the colour blue and upper case is used, for reverse strand the colour is pink the case is lower. Features with no strand information are in grey.

Feature Symbol
exon E
cds C
start_codon A
stop_codon Z
utr U
3utr U
5utr W
gene G
transcript T
mrna M
trna X
rrna R
mirna I
ncrna L
lncrna L
sirna S
pirna P
snorna O

If available, the feature name is shown on the feature itself. The feature name has a trailing underscore to separate it from the rest of the feature representation. The last character of the feature is always the feature type. For example, the feature named myGene appears as:

myGene_EEEEEEEEE ## Enough space for the full name
myGenE           ## Not enough space, name truncated and last char is E

For BED features, name is taken from column 4, if available. Default for GTF/GFF is to take name from attribute Name, if absent try: ID, transcript_name, transcript_id, gene_id, gene_name. To choose an attribute see command gffNameAttr.

Read coverage tracks at single base resolution show the consensus sequence obtained from the underlying reads. If the reference fasta file is present the = symbol is used to denote a match. Heterozygote bases or variants are shown using the iupac ambiguity codes for up to two variants (N otherwise). Variants are called with a not-too-sophisticated heuristics: Only base qualities >= 20 are considered, an alternative allele is called if supported by at least 3 reads and makes up at least 1% of the total reads. The first and second allele must make at least 98% of the total reads otherwise the base is N (see PileupLocus.getConsensus() for exact implementation). Insertion/deletions are currently not considered.

Saving screenshots

Screenshots can be saved to file with the commands save. Output format is either ASCII text or png, depending on file name extension. For example:

[h] for help: save mygene.txt ## Save to mygene.txt as text
[h] for help: save            ## Save to chrom_start-end.txt as text
[h] for help: save .png       ## Save to chrom_start-end.png as png
[h] for help: save mygene.png ## Save to mygene.png as png

Without arguments, save writes to file named after the current genomic position e.g. chr1_1000-2000.txt. The ANSI formatting (i.e. colours) is stripped before saving so that files can be viewed on any text editor (use a monospace font like courier).

Tips gotchas and miscellanea

  • Performance Alignment files are typically accessed very quickly but ASCIIGenome becomes slow when the window size grows above a few hundreds of kilobases. Annotation files (bed, gff, gtf) are loaded in memory unless they are indexed with tabix.

  • Regular expression Use the (?i) modifier to match in case insensitve mode, e.g. '(?i).actb.'

  • When displaying bam files, ASCIGenome is hardcoded to disable the coverage and read tracks if the window size is >100,000 bp. This is to prevent the browsing to become horribly slow. To display such large windows consider bigWig or tdf file format.

  • When opening bam files, the first chromosome is often the mitochondrial chromosome chrM (or chrMT) which often has very high read depth (say 10,000x). This can make the opening slow. Consider using the -r option in these cases. E.g. ASCIIGenome -r chr1 file1.bam file2.bam ...

Interactive commands

The description of each interactive commands is here

As there is no GUI, everything is handled thorough command line. Once ASCIIGenome is started enter a command and press ENTER to execute.

Some features of Unix console are enabled:

  • Arrow keys UP and DOWN scroll previous commands.
  • TAB auto-completes commands.
  • ENTER without any argument repeats the previous command.

Examples:

[h] for help: ff <ENTER>   ## Move forward
[h] for help: <ENTER>      ## Move forward again...
[h] for help: <ENTER>      ## ... and again
[h] for help: col <TAB>    ## Is expanded to colorTrack
[h] for help: <ARROW UP>   ## Shows previous command
[h] for help: h <ENTER>    ## Show help.

When track names are passed as arguments, it is not necessary to give the full name as partial matching is enabled. This is handy since track names have an ID appended as suffix which can be used in place of the full name, e.g. next myLongfileName.bed#1 can be also typed as next #1.

Credits

  • Bam processing is mostly done with the samtools/htsjdk library.
  • Bigwig and tdf are processed with classes from IGV source code.
  • Block compression and indexing done using jvarkit.
  • Brew installation thanks to dalloliogm.