Custom-built full text geocoding.
This software was donated to the Open Event Data Alliance by Caerus Associates. See Releases for the 2015-2016 production version of Mordecai.
Mordecai
accepts text and returns structured geographic information extracted
from it. It does this in several ways:
-
It uses MITIE to extract placenames from the text. In the default configuration, it uses the out-of-the-box MITIE models, but these can be changed out for custom models when needed.
-
It uses word2vec's models, with gensim's awesome Python wrapper, to infer the country focus of an article given the word vectors of the article's placenames.
-
It uses a country-filtered search of the geonames gazetteer in Elasticsearch (with some custom logic) to find the lat/lon for each place mentioned in the text.
It runs as a Flask-RESTful service.
Standing up the service is as simple as installing
docker and
docker-compose. If you're using Ubuntu,
this gist is a good
place to start. Once these components are installed, the service is started by
running sudo docker-compose up
from inside the mordecai
directory. To run the
service in the background use docker-compose up -d
. This will pull the
Elasticsearch docker image with the Geonames gazetter already stored as an
index. It will also build the mordecai
docker image and link this to the
Elasticsearch image. Elasticsearch requires a fair amount of resources,
specifically RAM, so it should be noted that running this on a small computer
will be met with suboptimal performance. Our production deployment has the
Geonames gazetter in an Elasticsearch cluster with a few nodes.
Please note that many of the required components for mordecai, such as the word2vec and MITIE models, are rather large so downloading and loading takes a while.
The software currently assumes that an Elasticsearch instance is running with
the Geonames gazetteer as the index. This should be taken care of if you used
docker-compose
.
-
/country
In: text
Out: An ISO country code that best matches the country focus of the text (used as input to later searches). In the future, this will be a list of country codes.
-
/places
In: text, list of country codes
Out: list of dictionaries of placenames and lat/lon in text. The keys are "lat", "lon", "placename", "searchterm", and "countrycode".
-
/osc
In: text
Out: placenames and lat/lon, customized for OSC stories
curl -XPOST -H "Content-Type: application/json" --data '{"text":"(Reuters) - The Iraqi government claimed victory over Islamic State insurgents in Tikrit on Wednesday after a month-long battle for the city supported by Shiite militiamen and U.S.-led air strikes, saying that only small pockets of resistance remained. State television showed Prime Minister Haidar al-Abadi, accompanied by leaders of the army and police, the provincial governor and Shiite paramilitary leaders, parading through Tikrit and raising an Iraqi flag. The militants captured the city, about 140 km (90 miles) north of Baghdad, last June as they swept through most of Iraqs Sunni Muslim territories, swatting aside a demoralized and disorganized army that has now required an uneasy combination of Iranian and American support to get back on its feet."}' 'http://localhost:5000/places'
Or if you know this text is about Iraq:
curl -XPOST -H "Content-Type: application/json" --data '{"text":"(Reuters) - The Iraqi government claimed victory over Islamic State insurgents in Tikrit on Wednesday after a month-long battle for the city supported by Shiite militiamen and U.S.-led air strikes, saying that only small pockets of resistance remained. State television showed Prime Minister Haidar al-Abadi, accompanied by leaders of the army and police, the provincial governor and Shiite paramilitary leaders, parading through Tikrit and raising an Iraqi flag. The militants captured the city, about 140 km (90 miles) north of Baghdad, last June as they swept through most of Iraqs Sunni Muslim territories, swatting aside a demoralized and disorganized army that has now required an uneasy combination of Iranian and American support to get back on its feet.", "country": "IRQ"}' 'http://localhost:5000/places'
Returns:
[{"lat": 34.61581, "placename": "Tikrit", "seachterm": "Tikrit", "lon": 43.67861, "countrycode": "IRQ"}, {"lat": 34.61581, "placename": "Tikrit", "seachterm": "Tikrit", "lon": 43.67861, "countrycode": "IRQ"}, {"lat": 33.32475, "placename": "Baghdad", "seachterm": "Baghdad", "lon": 44.42129, "countrycode": "IRQ"}]
###Python
import json
import requests
headers = {'Content-Type': 'application/json'}
data = {'text': """(Reuters) - The Iraqi government claimed victory over Islamic State insurgents in Tikrit on Wednesday after a month-long battle for the city supported by Shiite militiamen and U.S.-led air strikes, saying that only small pockets of resistance remained. State television showed Prime Minister Haidar al-Abadi, accompanied by leaders of the army and police, the provincial governor and Shiite paramilitary leaders, parading through Tikrit and raising an Iraqi flag. The militants captured the city, about 140 km (90 miles) north of Baghdad, last June as they swept through most of Iraqs Sunni Muslim territories, swatting aside a demoralized and disorganized army that has now required an uneasy combination of Iranian and American support to get back on its feet."""}
data = json.dumps(data)
out = requests.post('http://localhost:5000/places', data=data, headers=headers)
Mordecai is meant to be easy to customize. There are a few ways to do this.
-
Change the MITIE named entity recognition model. This is a matter of changing one line in the configuration file, assuming that the custom trained MITIE model returns entities tagged as "LOCATION".
-
Custom place-picking logic. See the
/osc
for an example. Prior knowledge about the place text is about and the vocabulary used in the text to describe place times can be hard coded into a special endpoint for a particular corpus. -
If a corpus is known to be about a specific country, that country can be passed to
places
to limit the search to places in that country.
mordecai
currently includes a few unit tests. To run the tests:
cd resources
py.test