Can be used for modelling and manipulating
- sequences
- probes
- taxonomy
- alignments
Sequences can be created in several ways. Here are a few.
Simply pass a String or StringBuilder object to an appropriate method of SequenceFactory. The factory method will determine the sequence3 type based on the raw sequence.
String raw = "atggatctttctttcactctttcggtcgtgtcggccatcctcgccatcactgctgtgattgctgtatttattgtgat";
Sequence sequence = SequenceFactory.createSequence(raw);
System.out.println("created sequence = " + sequence);
will output
created sequence = UNNAMED SEQUENCE
ATGGATCTTTCTTTCACTCTTTCGGTCGTGTCGGCCATCCTCGCCATCACTGCTGTGATTGCTGTATTTATTGTGAT
Or if you know the sequence type, use the overloaded method
String raw = "atggatctttctttcactctttcggtcgtgtcggccatcctcgccatcactgctgtgattgctgtatttattgtgat";
Sequence sequence = SequenceFactory.createSequence(testSeqDNAone, SequenceType.DNA);
sequence.setSequenceName("New Sequence");
System.out.println(sequence);
will output
New Sequence
ATGGATCTTTCTTTCACTCTTTCGGTCGTGTCGGCCATCCTCGCCATCACTGCTGTGATTGCTGTATTTATTGTGAT
Of course, you can also use the type specific constructors, but then you'll have to make sure you have legal (e.a. correct alphabet), uppercase String variables:
DnaSequence dnaSeq = new DnaSequence(testSeqDNA);
RnaSequence rnaSeq = new RnaSequence(testSeqRNA);
Sequences are immutable for their sequence property, so you can only change sequences by creating a modified copy. Here are two ways to get a subsequence.
sequence.setSequenceName("Mighty big sequence");
Sequence subSeq = sequence.getSubSequence(0, 10);
//or, with a coordinates object
Sequence subSeq2 = sequence.getSubSequence(new SequenceCoordinates(0, 10));
System.out.println(subSeq);
will output
Mighty big sequence [FROM 0 TO 10]
ATGGATCTTT
Note that subsequencing follows Java substringing conventions: zero-based and from--to-but-not-including.
Reading from file always involves the SequenceReader
class. It can determine
the file type (sequence format) dynamically, or -preferred- you specify the sequence type
through the constructor (which is less error-prone of course).
Reading sequences from file will always return SequenceObject instances.
Reading can be done in two ways: streaming and batch (whole file).
When registering a SequenceReaderListener
, streaming will be used.
When no listener has been registered, bacth-processing is assumed.
The general form is this
SequenceReader reader = new SequenceReader(SequenceObjectType.YOUR_FORMAT, YOUR_FILE(S));
reader.read();
Currently, only Fasta, GenBank and GFF3 are supported.
All sequence formats support streaming processing. Simply register a listener and get notified whenever a new sequence is ready.
SequenceReader reader = new SequenceReader(fastaFile);
reader.addSequenceReaderListener(new SequenceReaderListener() {
@Override
public void sequenceRead(SequenceObject sequenceObject) {
System.out.println("sequence = " + (Sequence)sequenceObject);
}
@Override
public void sequenceReadingFinished() {
System.out.println("finished");
}
});
reader.read();
}
Reading Fasta sequences will return SequenceObject
instances that can be directly cast to Sequence
instances.
File fastaFile = new File("sample_data/dna_sequences_4.fa");
assertTrue(fastaFile.exists() && fastaFile.canRead());
SequenceReader reader = new SequenceReader(fastaFile);
reader.read();
ArrayList<SequenceObject> sequenceObjects = reader.getSequenceList();
for (SequenceObject seq : sequenceObjects) {
//since Fasta has been read, a direct cast to sequence is possible
System.out.println((Sequence)seq);
}
The Fasta description line, which has this form: gi|15215093|gb|AAH12662.1| Fhit protein [Mus musculus]
can be processed into an Annotations
object, using a static method of the Annotations class:
String line = ">gi|15215093|gb|AAH12662.1|TAXID|123456| Fhit protein [Mus musculus]";
Annotations a = Annotations.fromFastaDescriptionLine(line);
System.out.println(a);
will give
Annotations{annotations={description=[Fhit protein], organism=[Mus musculus], accession number=[gi|15215093, gb|AAH12662.1, TAXID|123456]}}
The annotations can be processed while reading from file by setting this flag on the SequenceReader object: reader.setParseFastaHeader(true);
GenBank entries are also read
SequenceReader reader = new SequenceReader(SequenceObjectType.GENBANK_SEQUENCE, genbankFile);
reader.read();
//I know there is only one GenBank entry in the file
GenbankEntry gbEntry = (GenbankEntry)(reader.getSequenceList().get(0));
System.out.println("GB Sequence entry:\n" + gbEntry);
System.out.println("\nnumber of genes = " + gbEntry.getElementList(SequenceElementType.GENE).size());
System.out.println("\nnumber of CDSs = " + gbEntry.getElementList(SequenceElementType.CDS).size());
System.out.println("\nnumber of rRNAs = " + gbEntry.getElementList(SequenceElementType.RRNA).size());
//Iterate the features
for (SequenceElementType type : gbEntry.getSequenceElementTypeList()) {
for (SequenceElement element : gbEntry.getElementList(type)) {
System.out.println("TYPE=" + type + "; CLASS=" + element.getClass().getSimpleName() + " ELEMENT=" + element);
}
}
Nucleic acid sequences can be complemented and revers complemented. Ambiguous nucleotides are supported.
NucleicAcidSequence dnaSeq = new DnaSequence("atggatctttcttaaa".toUpperCase());
System.out.println("dnaSeq = " + dnaSeq);
NucleicAcidSequence compl = dnaSeq.complement();
System.out.println("compl = " + compl);
DnaSequence ambiguousDna = new DnaSequence("ARWYGCTAN");
System.out.println("ambiguousDna = " + ambiguousDna);
DnaSequence ambRevCompl = (DnaSequence)ambiguousDna.reverseComplement();
System.out.println("ambRevCompl = " + ambRevCompl);
will output
dnaSeq = UNNAMED SEQUENCE
ATGGATCTTTCTTAAA
compl = UNNAMED SEQUENCE
TACCTAGAAAGAATTT
ambiguousDna = UNNAMED SEQUENCE
ARWYGCTAN
ambRevCompl = UNNAMED SEQUENCE
NTAGCRWYT
With nucleotides you can either find Open Reading Frames (ORFs) or translate them.
To do ORF finding, you need to pass a GeneAnalysisOptions object together with the sequence to be analysed.
Call GeneAnalysisOptions.getDefaults()
to get an object with default values.
GeneAnalysisOptions geneAnalysisOptions = new GeneAnalysisOptions();
geneAnalysisOptions.setMinimumOrfSize(20);
geneAnalysisOptions.setStrandSelection(SequenceStrand.BOTH);
geneAnalysisOptions.setOrfDefinition(OrfDefinition.STOP_TO_STOP);
OrfFinder orfFinder = new OrfFinder(dnaOne, geneAnalysisOptions);
orfFinder.start();
List<OpenReadingFrame> orfs = orfFinder.getOrfList();
for (OpenReadingFrame orf : orfs) {
System.out.println(orf);
}
To translate a single sequence between two positions:
StringBuilder translation = OrfFinder.doTranslation(1, dnaOne, 8, 56, "");
System.out.println("translation = " + translation);
To get a 3- or 6-frame translation, follow this use case
String[] translation = OrfFinder.doSixFrameTranslation(dnaOne);
for (String line : translation) {
System.out.println(line);
}
M D L S F T L S V V S A I L A I T A V I A V F I * K D
W I F L S L F R S C R P S S P S L L * L L Y L F K K
G S F F H S F G R V G H P R H H C C D C C I Y L K R
ATGGATCTTTCTTTCACTCTTTCGGTCGTGTCGGCCATCCTCGCCATCACTGCTGTGATTGCTGTATTTATTTAAAAAGAT
TACCTAGAAAGAAAGTGAGAAAGCCAGCACAGCCGGTAGGAGCGGTAGTGACGACACTAACGACATAAATAAATTTTTCTA
H I K R E S K R D H R G D E G D S S H N S Y K N L F I
S R E K V R E T T D A M R A M V A T I A T N I * F S
P D K K * E K P R T P W G R W * Q Q S Q Q I * K F L