The Annotated Data Model (ADM) is a family of Java classes that represent a text and a set of annotations on the text. The text, which would typically derive from a document, is represented as an ordinary Java String of UTF-16. The annotations are a collection of objects. Some annotations relate to the entire text; others to a span of text identified by a range of character offsets.
The ADM provides a representation for the annotations that are used by Basis Technology's Rosette platform.
It is not intended as a generalized representation, such as those defined by UIMA or GATE. There are always
tradeoffs between generality and efficiency. The ADM is faster and easier to work with because it does need
to allow for arbitrary annotations, either via a compiled schema or via complete runtime construction.
The ADM is defined as a set of Java classes, together with a Jackson module that provides a standard Json serialization representation.
The top-level object of the ADM is the AnnotatedText
object. An AnnotatedText
object represents a text.
This object contains the text itself, optional document metadata, and the annotations. In the ADM, the
annotations of the text are grouped into 'attributes'. As above, some attributes are data about the entire
text, and others concern spans.
The attributes are available in a Map<String, BaseAttribute>
. The AttributeKey
enum provides constants
for the defined attributes. In addition, there are convenience methods to retrieve the individual well-known
attributes.
The document metadata is a Map<String, List<String>>
, intended to model typical MIME or 'Dublin Core'
information.
Most of the annotations are lists of items that span ranges of text. For example, the TOKEN
annotation
describes the tokens of the text. It is implemented as a ListAttribute
. List attributes implement
java.util.List<Item extends BaseAttribute>
,
but they also guarantee that their contents are all of the same type.
All of the attributes inherit from BaseAttribute
. A base attribute consists only of a map that allows for
compatibility and extensibility. The Attribute
subclass adds in start and end offsets. Most of the
defined attributes inherit from it.
ADM classes are defined to be immutable. They include no APIs that modify their contents, and their implementations use immutable data structures. This allows arbitrary multi-threaded traversal.
To construct new ADM items, an application must use the provided Builder classes. A common pattern is for some module to accept an ADM data structure and return a new ADM data structure with additional annotations. To facilitate this, the Builders can be constructed over existing instances of their corresponding classes. Of course, since the objects are immutable, the Builders are free to reuse the input as part of the output.
Base Noun Phrases are spans of text, aligned with token boundaries.
Entities are the output of REX. Entities are a result of document-level grouping of mentions that refer to the same thing. Mentions are spans of characters, not necessarily aligned with tokens. See the REX documentation for the information delivered with an entity mention.
Relationship Mentions are the output of RELAX which identifies how entities and other noun phrases are connected to each other in the text. THIS IS AN ALPHA FEATURE OF THE ANNOTATED DATA MODEL.
This annotation provides for splitting a text into regions by language. For each region, the data structure provides details on the language detection.
This annotation provides language-detection results for the the entire text.
This annotation provides information on how the text can be divided into regions according to script, as defined in ISO15924.
This annotation describes the sentences of the text. By convention, the sentences cover the entire text. Whitespace at the end of each sentences is incorporated into that sentence.
The token annotations identify the tokens of the text and annotate them with the results of Rosette Base Linguistics.
This annotation provides a translation for the entire text into a language and script specified by the domain.
The translated tokens annotations provides a itemList of translations for each of the tokens in the text. Each annotation contains a domain identifying the language and script of the translations and a itemList of translations corresponding to the tokens of the text.
The adm-json artifact provides support for reading and writing Json via the Jackson library. Custom serialization and deserialization provides type-safe lists and accomplishes polymorphism with a minumum of additional space.
The artifact provides two Jackson modules. One writes the ADM as a graph of Json objects. The second avoids all the key names by using the 'array shape' feature of Jackson. In most cases, you can get space/time performance from the normal object model by using the Jackson 'SMILE' format or some other compression mechanism, but the array variation reduces overhead.
The library has a concept of the version of the serialized form. When
serializing, it adds a version
property, and when deserializing it
checks this version for compatibility. As of now, the format is at
version 1.1.0. The deserialization code accepts either '1.x.y' or
nothing for a version. Future versions may include the ability to read
and convert old formats.
This project builds several Maven artifacts.
The data model proper is in adm-model. This provides the classes of the model, and has dependencies, notably on Guava.
<dependency>
<groupId>com.basistech</groupId>
<artifactId>adm-model</artifactId>
<version>2.0.0</version>
</dependency>
adm-json provides classes for reading and writing the ADM with Jackson. It required Jackson 2.6.2, and declares dependencies on Jackson.
<dependency>
<groupId>com.basistech</groupId>
<artifactId>adm-json</artifactId>
<version>2.0.0</version>
</dependency>
mvn site site:stage
mvn scm-publish:publish-scm
Typically, this would be after a release:
mvn release:perform
cd target/checkout
mvn site site:stage
mvn scm-publish:publish-scm