/TeXoo

TeXoo – A Zoo of Text Extractors

Primary LanguageJavaApache License 2.0Apache-2.0

TeXoo – A Zoo of Text Extractors

TeXoo is a framework for Deep Learning based text analytics in Java developed at DATEXIS, Beuth University of Applied Sciences Berlin. TeXoo comes with a NLP-style document model and a zoo of Deep Learning extraction models which you can access in texoo-models module. Here is a brief overview:

Features

  • Java Framework for language-independent text extraction
  • Language-independent document model
  • Convenient document readers with tokenization and sentence splitting
  • Named Entity Recognition
  • Named Entity Linking
  • Topic Classification and Segmentation

Getting Started

These instructions will get you a copy of TeXoo up and running on your local machine for development and testing purposes. If you are going to use TeXoo as a Maven dependency only, you might skip this step.

Prerequisites

TeXoo comes with a Dockerfile that contains all software necessary to run on most systems, including the CUDA 10 toolkit for GPUs.

The following dependencies are required if you are planning to run TeXoo locally. They are already contained in the Dockerfile:

Installation

First we need to build a docker image with all dependencies (including CUDA 10.1):

  • run docker build -t texoo .

And then we're ready to build TeXoo from source:

  • run bin/run-docker texoo-build

Usage

Command Line

There exist several run scripts in the bin/ directory. You can start them right in the docker container, e.g. run all JUnit tests:

  • run bin/run-docker texoo-test or - run bin/run-docker-cuda texoo-test

See the Modules Overview for more examples.

Maven Dependency

To use TeXoo NER in your Java project, just add the following dependencies to your pom.xml:

<dependency>
  <groupId>de.datexis</groupId>
  <artifactId>texoo-core</artifactId>
  <version>1.3.3</version>
  <type>jar</type>
</dependency>
<dependency>
  <groupId>de.datexis</groupId>
  <artifactId>texoo-entity-recognition</artifactId>
  <version>1.3.3</version>
  <type>jar</type>
</dependency>

To enable CUDA support, add the following dependencies in your project:

<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-cuda-9.2-platform</artifactId>
  <version>${dl4j.version}</version>
</dependency>
<!-- DL4j cuDNN -->
<dependency>
  <groupId>org.deeplearning4j</groupId>
  <artifactId>deeplearning4j-cuda-9.2</artifactId>
  <version>${dl4j.version}</version>
</dependency>
<!-- DL4j CUDA + cuDNN binaries -->
<dependency>
  <groupId>org.bytedeco.javacpp-presets</groupId>
  <artifactId>cuda</artifactId>
  <version>9.2-7.1-1.4.2</version>
  <classifier>linux-x86_64-redist</classifier>
</dependency>

And to enable AVX512 CPU optimizations, add the following dependencies in your project:

<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-native</artifactId>
  <classifier>linux-x86_64-avx512</classifier>
</dependency>

See the examples module for some implementation examples.

texoo-core – Document Model and Core Library

Package / Class Description
de.datexis.model TeXoo Document model (see below)
de.datexis.encoder Implementations of Bag-of-words, Word2Vec, Trigrams, etc.
DocumentFactory Factory to create Document objects from text
RawTextDatasetReader Reader to create Datasets from files
AnnotatorFactory Factory to create and load models from the zoo
ObjectSerializer Helper methods to import/export JSON

texoo-entity-recognition – Named Entity Recognition (NER)

This module contains Annotators for Named Entity Recognition (NER). This is a very robust deep learning model that can be trained with only 4000-5000 sentences. It is based on a bidirection LSTM with Letter-trigram encoding, see http://arxiv.org/abs/1608.06757.

Command Line Usage:

  • run bin/run-docker texoo-annotate-ner
usage: texoo-annotate-ner -i <arg> [-o <arg>]
TeXoo: run pre-trained MentionAnnotator model
 -i,--input <arg>    path or file name for raw input text
 -o,--output <arg>   path to create and store the output JSON, otherwise dump to stdout
  • run bin/run-docker texoo-train-ner
usage: texoo-train-ner -i <arg> [-l <arg>] -o <arg> [-t <arg>] [-u] [-v
       <arg>]
TeXoo: train MentionAnnotator with CoNLL annotations
 -i,--input <arg>        path to input training data (CoNLL format)
 -l,--language <arg>     language to use for sentence splitting and
                         stopwords (EN or DE)
 -o,--output <arg>       path to create and store the model
 -t,--test <arg>         path to test data (CoNLL format)
 -u,--ui                 enable training UI (http://127.0.0.1:9000)
 -v,--validation <arg>   path to validation data (CoNLL format)
  • run bin/run-docker texoo-train-ner-seed
usage: texoo-train-ner-seed -i <arg> -o <arg> -s <arg> [-u]
TeXoo: train MentionAnnotator with seed list
 -i,--input <arg>    path and file name pattern for raw input text
 -o,--output <arg>   path to create and store the model
 -s,--seed <arg>     path to seed list text file
 -u,--ui             enable training UI (http://127.0.0.1:9000)

Java Classes:

Package / Class Description / Reference
MentionAnnotator Named Entity Recognition
GenericMentionAnnotator Pre-trained models for English and German
MatchingAnnotator Gazetteer that uses Lists to annotate Documents
CoNLLDatasetReader Reader for CoNLL files

Cite

If you use this module for research, please cite:

Sebastian Arnold, Felix A. Gers, Torsten Kilias, Alexander Löser: Robust Named Entity Recognition in Idiosyncratic Domains. arXiv:1608.06757 [cs.CL] 2016 https://arxiv.org/abs/1608.06757

texoo-entity-linking – Named Entity Linking (NEL)

This module contains the Annotators for Named Entity Linking (NEL) (currently under development). There is no model included, but you can use the Knowledge Base and Annotators with your own datasets, see https://www.aclweb.org/anthology/C/C16/C16-2024.pdf.

Package / Class Description / Reference
NamedEntityAnnotator Named Entity Linking used in TASTY
ArticleIndexFactory Knowledge Base implemented as local Lucene Index which imports Wikidata entities

If you use this module for research, please cite:

Sebastian Arnold, Robert Dziuba, Alexander Löser: TASTY: Interactive Entity Linking As-You-Type. COLING (Demos) 2016: 111–115

texoo-sector – Topic Classification and Segmentation (SECTOR)

Annotators for SECTOR models from WikiSection dataset.

Package / Class Description / Reference
SectorAnnotator Topic Segmentation and Classification for Long Documents

If you use this module for research, please cite:

Sebastian Arnold, Rudolf Schneider, Philippe Cudré-Mauroux, Felix A. Gers and Alexander Löser. SECTOR: A Neural Model for Coherent Topic Segmentation and Classification. Transactions of the Association for Computational Linguistics 2019 Vol. 7, 169-184

Command Line Usage:

  • run bin/run-docker texoo-train-sector
usage: texoo-train-sector -i <arg> -o <arg> [-u]
TeXoo: train SectorAnnotator from WikiSection dataset
 -i,--input <arg>    file name of WikiSection training dataset
 -o,--output <arg>   path to create and store the model
 -u,--ui             enable training UI (http://127.0.0.1:9000)

About TeXoo

Frameworks used in TeXoo

Contributors

Sebastian Arnold – core developer https://prof.beuth-hochschule.de/loeser/people/sebastian-arnold/

Rudolf Schneider https://prof.beuth-hochschule.de/loeser/people/rudolf-schneider/

License

Copyright 2015-2020 Sebastian Arnold, Alexander Löser, Rudolf Schneider

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.