/geoextract

Geo-referencing of text

Primary LanguagePythonMIT LicenseMIT

GeoExtract

GeoExtract is a web service and Python package for extracting locations (addresses, street names, points of interest) from free-form text.

Background

Extracting locations and addresses from free-form text is a difficult task, since addresses can take many different forms. Hence, in an international context, almost any combination of words and numbers may represent an address. Even if the target region is more constrained geographically there are often parts of a text which look like genuine references to locations but which do not describe an actually existing place.

Therefore, some way of validating potential locations is required. How that is achieved depends heavily on the use case and the available data. For example you might have a database with all valid addresses (e.g. from OpenAddresses), in which case you can perform a very detailed validation. In other cases, you might only have a list of valid street names and will have to validate house numbers heuristically.

GeoExtract helps you to find potential locations and to filter and organize validated candidates. It is no turnkey solution but instead provides a framework on which you can build a solution for your use case.

Installation

GeoExtract relies on NumPy and SciPy, which are cumbersome to install from source. We therefore suggest to use your system's package manager to install them via pre-built packages. For example, on Ubuntu you would use

sudo apt-get install python-numpy python-scipy

We also recommend to use a virtualenv for installing GeoExtract. Make sure to pass the --system-site-packages parameter so that the virtualenv picks up the system-wide installations of NumPy and SciPy:

virtualenv -p python2 --system-site-packages my_virtualenv
source my_virtualenv/bin/activate

Installing GeoExtract is then easy using pip:

pip install git+https://github.com/stadt-karlsruhe/geoextract.git

Usage

GeoExtract provides a pipeline for organizing the extraction process of preparing the input text, and for extracting, validating and consolidating locations from it. The default implementations for each step can be configured or replaced by your own variants.

See the example directory for a detailed example of using GeoExtract. The script takes a text file and extracts the locations. Due to the built-in validation this only works for locations that the script knows about, therefore a sample input file is also included:

python example/geoextract_example.py example/sample_input.txt

If no parameter is given then the example script starts a web server which provides location extraction as a web service:

python example/geoextract_example.py

To use the web service, send a POST request to /api/v1/extract. The request must have a parameter text containing the UTF-8 encoded text. For example, using the excellent HTTPie client:

http -f post http://localhost:5000/api/v1/extract text@example/sample_input.txt

Deployment

To deploy GeoExtract as a web service, construct an instance of geoextract.Pipeline (see the example in the example directory) and turn it into a Flask app via the create_app method. You can then deploy that app using the usual approaches for deploying Flask applications.

Development

First clone the repository:

git clone https://github.com/stadt-karlsruhe/geoextract.git
cd geoextract

Make sure you have NumPy and SciPy installed. For example, on Ubuntu:

sudo apt-get install python-numpy python-scipy

Create a virtualenv:

virtualenv -p python2 --system-site-packages venv
source venv/bin/activate

Install GeoExtract in development mode:

python setup.py develop

History

See CHANGELOG.md.

License

Copyright (c) 2016-2017, Stadt Karlsruhe (www.karlsruhe.de)

Distributed under the MIT license, see the file LICENSE for details.