The pipeline is composed of several Docker containers:
- Sentence splitter
- Word tokenizer
- Part-of-speech tagger
- Named-entity chunker
Each container runs a single process, a server that implements the Concrete Thrift service Annotator on port 9090. Annotator supports a method annotate :: Communication -> Communication
, and we call Docker containers like this an Analytic. They accept Communications, and return them with some annotation. In principle, any of these analytics could be used in isolation by passing Communications directly to it, but different analytics depend on preexisting annotations: the tagger has to know what the tokens are, the tokenizer has to know what the sentences are, and so forth. Encoding these annotations in Concrete objects is a tedious process, so we'll create one more Docker container:
This container also implements the Annotator interface, but it's only job is to accept minimal Communications (e.g. where only the text field is filled in) and pass them through the other containers in the appropriate order. We'll only expose this top-level container as a service, and so from other applications' perspectives it appears like a single analytic that provides four types of annotation.
Finally, there is a trivial web interface with a text box for the user to submit communications to the pipeline and display all named entities it finds.
Assuming you have Docker, Docker Compose, and Git installed, building and running the pipeline is trivial:
git clone https://gitlab.hltcoe.jhu.edu/lippincott/docker-nltk.git
cd docker-nltk
docker-compose build --force-rm
docker-compose up
Now, you should have the top-level analytic listening on port 9090.
The file scripts/concrete_annotator_client.py
implements the Client aspect of the Annotator interface, and can be used to connect to the pipeline:
python scripts/concrete_annotator_client.py -p 9090
This will let you enter text, run it through the pipeline, and print out any named entities that were found:
Write some text > John went to Philadelphia. While there, he met Sue.
GSP Philadelphia
PERSON John
PERSON Sue
Note this isn't showing the full annotated communication, which now has sentence, token, part-of-speech, and named-entity information (this would be an ugly object) but you could easily modify scripts/concrete_annotator_client.py
to see it. Or to run in batch mode over a database of communications. Or...
This is a tremendously simple example: each container is actually the same Docker image, just running different server code. This is because the analytics are all the default solutions from NLTK, and so the dependencies are identical. However, since the analytics are decoupled, it would be easy to swap in a different e.g. part-of-speech tagger analytic that uses the same tag inventory.
The code for each analytic server (in the scripts/
directory) has a ton of duplication that could be factored out. However, if your analytics were more diverse, this would be less the case: for example, in this pipeline none of the analytic servers need to be passed any arguments or model paths, but real-life analytics will need more involved initialization, might need special data manipulation, etc. At the end of the day though, the Annotator interface is all the end-user should need to understand.