/FALDO-paper

Primary LanguageTeXOtherNOASSERTION

This GitHub repository is being used to draft a scientific manuscript to describe FALDO, a formal ontology for Feature Annotation Locations in RDF: https://github.com/JervenBolleman/FALDO

FALDO was begun at the BioHackathon 2012 meeting in Japan, https://github.com/dbcls/bh12/wiki/Feature-annotation-locations-in-RDF

Citation

A preprint of this work is now available to be cited as follows:

Jerven Bolleman, Christopher J. Mungall, Francesco Strozzi, Joachim Baran, Michel Dumontier, Raoul J. P. Bonnal, Robert Buels, Robert Hoehndor, Takatomo Fujisawa, Toshiaki Katayama, Peter J. A. Cock (2014) FALDO: A semantic standard for describing the location of nucleotide and protein feature annotation. bioRxiv http://dx.doi.org/10.1101/002121 http://biorxiv.org/content/early/2014/01/31/002121

Formal journal submittion is planned shortly.

LaTeX

We are currently writing the manuscript using LaTeX using a BMC journal template (files named bmc_article.*) with location.tex as the primary file which includes the sub-sections as separate child files:

  • abstract.tex - Abstract
  • background.tex - Background
  • implementation.tex - Implementation
  • results.tex - Results
  • discussion.tex - Discussion
  • conclusions.tex - Conclusions
  • avareq.tex - Availability and requirements

To produce the whole PDF file, use LaTeX and BibTex:

$ pdflatex location.tex
$ bibtex location
$ pdflatex location.tex

FALDO

FALDO is the Feature Annotation Location Description Ontology. It is a simple ontology to describe sequence feature positions and regions as found in GFF3, DBBJ, EMBL, GenBank files, UniProt, and many other bioinformatics resources.

The aim of this ontology is to describe the position of a sequence region or a feature. It does not aim to describe features or regions itself, but instead depends on resources such as the Sequence Ontology or the UniProt core ontolgy.

Examples

The examples in turtle avoid declaring prefixes for space reasons.

Known positions

faldo:Region A genomic region where we know exactly where it starts and ends on the reference genome sequence:

<_:1> a faldo:Region ;
           faldo:begin <_:1b> ;
           faldo:end <_:1e> .

<_:1b> a faldo:Position ; 
           a faldo:ExactPosition ;
           a faldo:ForwardStrandPosition ;
            faldo:position "1"^^xsd:integer ;
            faldo:reference ddbj:XXXDSDS .

<_:1e> a faldo:Position ; 
           a :FuzzyPosition ;
           a :ForwardStrandPosition ;
           faldo:begin <_:1ea> ;
           faldo:end <_:1eb> ;
           faldo:reference ddbj:XXXDSDS .

<_:1ea> a faldo:Position ;
        a faldo:ExactPosition ;
        a faldo:ForwardStrandPosition ;
           faldo:position "3"^^xsd:integer ;
           faldo:reference ddbj:XXXDSDS .

<_:1eb> a faldo:Position ;
        a faldo:ExactPosition ;
        a faldo:ForwardStrandPosition ;
           faldo:position "7"^^xsd:integer ;
           faldo:reference ddbj:XXXDSDS .

A genomic region where the begin is on one contig and the end on an other:

<_:2> a faldo:Region
           faldo:begin <_:2b> ;
           faldo:end <_:2e> .
<_:2b> a faldo:Position ; 
            a faldo:ExactPosition ;
            faldo:position "1"^^xsd:integer ;
            faldo:reference <_:contig17> .
<_:2e> a faldo:Position; 
           a faldo:ExactPosition ;
           faldo:position "4"^^xsd:integer ;
           faldo:reference <_:contig29> .

A rather curcial difference with most begin and end conventions here they are biological begin and end. Not smallest number is start and the larger number is end.

----->increasing count of position
123456789012345678901234567890
actgacgactagatcgatcgatcgactagt

tgactgctgatctagctagctagctgatca
     <----- direction of transcription 
     |    |--transcription on reverse strand begins here
     |--transcription on reverse strand ends here      

For example the cheY gene in Escherichia coli str. K-12 substr. MG1655 is described in the INSDC feature table as complement(1965072..1965461), which is 390 base pairs using inclusive one-based counting. In FALDO

<_:geneCheY> a <http://purl.obolibrary.org/obo/SO_0000704> ; # A gene as defined by the Sequence Ontology
           rdfs:label "cheY" ;
           faldo:location <_:example> ;

uniprot:P0AE67 up:encodedBy <_:geneCheY> .

<_:example> a faldo:Region ;
           faldo:begin <_:example_b> ;
           faldo:end <_:example_e> .

<_:example_b> a faldo:Position ,
                faldo:ExactPosition ,
                faldo:ReverseStrandPosition ;
            faldo:position "1965461"^^xsd:integer ; #see the end is smaller than the begin
            faldo:reference refseq:NC_000913.2 .


<_:example_e> a faldo:Position ,
                faldo:ExactPosition ,
                faldo:ReverseStrandPosition ;
            faldo:position "1965072"^^xsd:integer ; #see the end is smaller than the begin
            faldo:reference refseq:NC_000913.2 .

Fuzy positions

Assume we have a protein aminoacid sequence "ACK" and a massspectrometry experiment says the amino acid A or C is glycosylated. But we don't know which of the two it is. We do know it is not "K".

<_:glysolyatedAminoAcid>            a 	glycan:glycol:glycosylated_AA ; #The glycan ontology is used here
				faldo:location <_:fuzzyPosition> .
<_:fuzzyPosition> 	a 	faldo:FuzzyPosition ,
				faldo:InRangePosition ;
			faldo:begin <_:exactBegin> ;
			faldo:end   <_:exactEnd> .
<_:faldoBegin>		a	faldo:ExactPosition ;
			faldo:position 1 ;
			faldo:refence <_:sequence> .
<_:faldoEnd>		a	faldo:ExactPosition ;
			faldo:position 2 ;
			faldo:refence <_:sequence> .
<_:sequence> a uniprot:Sequence ;
           rdf:value "ACK" .

In the above example uniprot and glyco refer to the glycoprotein and uniprot schema's.

Probabilistic fuzzy positions

Here we have a begin position that could be one of two nucleotides. This case uses a probablisitic model that denotes that the feature could start at both positions 1 or 2. Position 1 has a likelihood of 0.1 and position 2 has a likelihood of 0.9.

<_:3> a    faldo:Region faldo:begin ;
           faldo:begin <_:3b> ;
           faldo:end <_:3e> .

<_:3b> a   faldo:ProbablePosition ;
           faldop:posibilities(<_:3bp1>,<_:3bp2>) .

<_:3bp1> a faldop:ProbablePosition ;
           faldop:probability "0.1"^^xsd:double ;
           faldop:location <_:3bb1> .

<_:3bp2> a faldop:ProbablePosition ;
           faldop:probability "0.9"^^xsd:double ;
           faldop:location <_:3bb2> .
<_:3bb1> a faldo:Position ,
           faldo:ExactPosition ;
           faldo:position "1"^^xsd:integer ;
           faldo:reference <_:1Strand> .

<_:3bb2> a faldo:Position ,
           faldo:ExactPosition ;
           faldo:position "2"^^xsd:integer ;
           faldo:reference <_:1Strand> .

License

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 Unported License.