Fast Entitiy Extraction Tool
Complete documentation coming soon!
Feet is a tool for extracting entities from a text according to dictionaries. A dictionary gathers terms that are representative of an entity type, e.b. city, country, people or company.
Feet takes benefits of the distributed in-memory database system Redis. So It's fast and scalable.
The underlying natural language processing is provided by Python NLTK package. Japanese has been implemented thanks to the Mecab POS tagger.
On-going work for next version:
- better project packaging and documentation
- more testing
- implement various technics for disambiguition of entities by semantics analysis including the use of thesauri and machine learning.
- compliant to JSON API interface
- definition of relationships between entities according to grammatical patterns including but not only the definition of intents, facts etc.
- additional languages support: Chinese and Korean
As a bonus, there is a function to extract dates from text in Japanese. Next version may feature a multi-lingual dates extraction function.
Last but not the least this is WIP so rely on it at your own risk. We should get a reliable version within a few weeks, along with new exciting features.
feet supports Python versions 2.7+ and pypy.
Install Python by downloading an installer appropriate for your system from https://www.python.org/downloads/ and running it.
If you're using a recent version of Python, pip is most likely installed by default. However, you may need to upgrade pip to the lasted version:
$ pip install --upgrade pip
If you need to install [pip] for the first time, download [get-pip.py]. Then run the following command to install it:
$ python get-pip.py
Check your locals
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
Setup Redis
Use of Redis is mandatory and v3+ is recommended.
Setup Mecab for Japanese
Setup MeCaB
You can either use ipadic-utf8 dictionary or neologd dictionary:
apt-get install mecab-ipadic-utf8
- Setup neologd dictionary
Setup Feet
Download latest version of Feet repository.
$ git clone https://github.com/Altarika/feet.git
$ cd feet
$ python setup.py install
or from pip (TODO):
$ pip install feet
For development
Create a virtualenv and install dependencies
$ virtualenv venv
$ source venv/bin/activate
$ echo $(pwd) > project_venv/lib/python2.7/site-packages/feet.pth
Install Python dependencies
$ pip install -r requirements.txt
Check the changelog:
$ git log --oneline --decorate --color
Setup NLTK
Depending on the languages you want to use you have to download the necessary data for NLTK to support them. Please see the documentation of NLTK. We are using NLTK version 3.0.
$ python -m nltk.downloader averaged_perceptron_tagger maxent_ne_chunker maxent_treebank_pos_tagger punkt words
Configuration
Setup your environment parameters in feet.yaml after copying config/feet_example.yaml
You can locate the configuration file at: /etc/feet.yml ~/.feet.yaml conf/feet.yaml
Content of configuration file:
# Basic Flags
debug: true
# Logging Information
logfile: 'feet.log'
loglevel: 'DEBUG'
# Timeout processing
timeout: 180
# Redis database Information
database:
host: localhost
port: 6379
prefix: feet
# API Server
server:
host: 127.0.0.1
port: 8000
# Japanese POS Tagger MeCab
mecab:
mecabdict= 'usr/local/lib/mecab/dic/mecab-ipadic-neologd'
You can also use a .env file:
MECAB_DICT=/usr/local/lib/mecab/dic/mecab-ipadic-neologd
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DICT_DB=0
REDIS_PREFIX=feet
DEBUG=True
SERVER_HOST=localhost
SERVER_PORT=8888
LOG_FILE=feet.log
LOG_LEVEL=INFO
TIMEOUT=180
If both a YAML and a .env file are available, YAML file will override common environment variables.
Testing
Run the tests to make sure everything is OK.
$ pip install ./requirements/test.txt
$ nosetests -v --with-coverage --cover-package=feet --cover-inclusive --cover-erase tests ./feet/www/tests/
or with Tornado testing tools:
$ python -m tornado.testing discover
or with setuptools
$ python setup.py test
You can generate a documentation with mkdocs tool.
Install the mkdocs package using pip:
$ pip install mkdocs
Install Yeti theme from mkdocs-bootswatch:
$ pip install mkdocs-bootswatch
You should now have the mkdocs command installed on your system. Run mkdocs --version to check that everything worked okay.
$ mkdocs --version
mkdocs, version 0.15.2
Launch the documentation server:
$ mkdocs serve
Then open your browser on http://127.0.0.1:8000/
Follow the Quick Start instructions. Make sure a redis-server is running. Make and update your favorite configuration file (.env or yml).
From feet directory you can use $ ./bin/feet
command or you can directly
call $ feet
if you have been through complete installation via setuptools.
Load entity file with CLI:
$ feet load --registry=my_registry --entity=country --csv=./tests/test_data/countries_en.csv
Drop list of terms for company with CLI:
$ feet drop --registry=my_registry --entity=country
Extract entities from a text:
$ feet extract --registry=my_registry --entity=country --grammar="NE : {<NNP|NNPS|NN>*<DT>?<NNP|NNPS|JJ|NNS|NN>+}" --path=./tests/test_data/english_text_long.txt
or
$ feet extract --registry=my_registry --entity=country --grammar="NE : {<NNP|NNPS|NN>*<DT>?<NNP|NNPS|JJ|NNS|NN>+}" --text="I want to buy flight tickets for Japan"
Follow the Quick Start instructions. Make sure a redis-server is running. Make and update your favorite configuration file (.env or yml).
From feet directory you can use $ ./bin/feet
command or you can directly
call $ feet
if you have been through complete installation via setuptools.
Feet run !
Launch the server:
$ feet run --host=127.0.0.1 --port=8888
Feet proposes a Restful API. It's easy to guess where you are from the url. In order to manage your entities you can group them into different databases, prefixes and registries. The database identifier follows Redis rule. A prefix and a registry can be any alphanumeric urlencoded characters strings.
Get list of registries
curl -H "Accept: application/json" -X GET http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/
--> 200 {"registries":["registry1","registry2"]}
Add a registry
curl -H "Content-Type: application/json" -X POST http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/
--> 200 OK
Delete a registry
curl -H "Content-Type: application/json" -X DELETE http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/
--> 200 OK
Get list of registered entities
curl -H "Accept: application/json" -X GET http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/
Example:
http://localhost:8888/database/0/prefix/geographic/registry/my_registry/entities/
--> 200 {"entities":["city","country"]}
Add an entity
curl -H "Content-Type: application/json" -X POST http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/
--> 200 OK
Delete an entity
curl -H "Content-Type: application/json" -X DELETE http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/
--> 200 OK
Getting an entity provides the list of languages supported by the entity
curl -H "Accept: application/json" -X GET http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/
--> 200 {"languages":["english"]}
Add a language for an entity
curl -H "Content-Type: application/json" -X POST http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/
--> 200 OK
The list of language identifiers that we use comes from NLTK. You must follow this list if you want feet to branch correctly into NLTK.
Delete a language for an entity
curl -H "Content-Type: application/json" -X DELETE http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/
--> 200 OK
Geting a language provides the number of terms for this language
curl -H "Accept: application/json" -X GET http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/
--> 200 {"count":249}
This allows to get the total number of terms for a language.
Change the name of language
curl -H "Content-Type: application/json" -X PUT -d '{"new_name": "english"}'
http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/
--> 200 OK
Create/add list of terms
curl -H "Content-Type: application/json" -X POST/PUT -d '{"entities": ["Paris","Tokyo"]}' http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/terms/
--> 200 OK
Get list of terms for an entity and a language
curl -H "Accept: application/json" -X GET -d '{"page":0,"count":10}' http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/terms/?page=0&count=10
--> 200 {"terms":["Paris","Tokyo"]}
Delete all terms for an entity and a language
curl -H "Content-Type: application/json" -X DELETE -d '{"entities": ["Paris"]}' http://localhost:8888/database/<database>/prefix/<prefix_name>/entities/<entity_name>/languages/<lang_id>/terms/
--> 200 OK
Get a term
curl -H "Accept: application/json" -X GET http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/terms/Tokyo/
--> 200 Tokyo
Add a term
curl "Content-Type: application/json" -X POST http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/terms/Tokyo/
--> 200 OK
Delete a term
curl "Content-Type: application/json" -X DELETE
http://localhost:8888/database/<database>/prefix/<prefix_name>/registries/<registry>/entities/<entity_name>/lang/<lang_id>/terms/Tokyo/
--> 200 OK
Extract entities from a text
curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"test":"I am looking for tickets for Tokyo in Japan", "grammar":"NE : {<NNP|NNPS|NN>*<DT>?<NNP|NNPS|JJ|NNS|NN>+}"}'
http://localhost:8888/database/0/prefix/geographic/entities/city/languages/english/extract/
--> 200 {"result":["Tokyo"]}
TODO: Search for entities
curl -H "Accept: application/json" -H "Content-Type: application/json" -X GET
http://localhost:8888/database/0/prefix/geographic/entities/city/languages/english/search?q=Tok&page=0&count=10
--> 200 {"result":["Tokyo"]}
Follow the quick start installation.
Using Feet relies on three classes:
- Registry
- Dictionary
- Extractor
(Don't forget to run a Redis server)
Before talking about the classes it's important to explain the concept of key prefix. As we are using a Redis database to store all data they can be organized by a key prefix. It means that all keys used to store information about registries, dictionaries etc. will be prefixed. It provides you with a first level of organisation for your data.
+------------------------------------+
| +-------------------------------+ |
| | +--------------------------+ | |
| | | | | |
| | | Term | | |
| | | | | |
| | | Entity/Dictionary | | |
| | +--------------------------+ | |
| |Registry | |
| +-------------------------------+ |
|Key prefix |
+------------------------------------+
Each registry manages a list of dictionaries. It's a second level of organisation after key prefixes.
Get the list of registries for a specific key prefix:
from feet.entities.registry import Registry
registries = Registry.list(key_prefix='my_prefix')
Add a registry:
from feet.entities.registry import Registry
registry = Registry.find_or_create('my_registry',
key_prefix='my_prefix',
redis_host='127.0.0.1',
redis_port=6379,
redis_db=0)
Get the list of dictionaries in a registry:
registry.dictionaries()
Add/get a dictionary in a registry:
# dictionary is ot type Dictionary by default
dictionary = registry.get_dict('entity_name')
Delete a dictionary:
registry.del_dict('entity_name')
Delete a registry:
# This will automatically delete all related dictionaries
registry.delete()
from __future__ import print_function
from feet.entities.registry import Registry
# Get an entity dictionary
registry = Registry.find_or_create('my_registry',
key_prefix='test_feet',
redis_host='127.0.0.1',
redis_port=6379,
redis_db=0)
entity = registry.get_dict('events')
# Load a file that contains a list of entities
count = entity.load_file('./test_data/events_ja.txt')
print('%d entities imported' % count)
where 'events_ja.txt' file contains a list of marketing events with each event name on a different line.
load_file method returns the number of terms that have been succesfully imported.
If you want to deal with different file formats you can inherit from Dictionary and override get_entities method. It could be as simple as:
from __future__ import print_function
from feet.entities.registry import Registry
from feet.entities.dictionary import Dictionary
class MyDictionary(Dictionary):
def get_entities(self, entities_file):
handle = open(entities_file, "r")
for entity in handle.readlines():
yield entity
# Get an entity dictionary
registry = Registry.find_or_create('my_registry',
dict_class=MyDictionary,
key_prefix='test_feet',
redis_host='127.0.0.1',
redis_port=6379,
redis_db=0)
entity = registry.get_dict('my_entity')
# Load a file that contains a list of entities
count = entity.load_file('my_formatted_file.my')
print('%d entities imported' % count)
This will allow you to build your own importers for different formats e.g. JSON, CSV, XML, RDF etc.
A CSV import tool is available as CSVDictionary in feet.entities.dictionary:
from __future__ import print_function
from feet.entities.dictionary import CSVDictionary
registry = Registry.find_or_create('my_registry',
dict_class=CSVDictionary)
cities = registry.get_dict('cities')
count = cities.load_file('./test_data/world-cities.csv')
from __future__ import print_function
from feet.entities.extractor import Extractor
from feet.entities.dictionary import CSVDictionary
registry = Registry.find_or_create('my_registry',
dict_class=CSVDictionary)
cities = registry.get_dict('cities')
count = cities.load_file('./test_data/world-cities.csv')
# Get an entity extractor engine
# Specify a grammar to extract chunks of candidates for 'NE' Named Entities
extractor = Extractor(
ref_dictionary=cities,
grammar='NE : {<NNP|NNPS|NN>*<DT>?<NNP|NNPS|JJ|NNS|NN>+}')
# Ask for entities from a text (it must be UTF8 encoded)
result = extractor.extract(utf8_encoded_text)
# Display results
entities = []
for element in results[0]:
if element['entity_found'] == 1:
entities = list(set(entities).union(element['entity_candidates']))
print(entities)
from feet.entities.jptools import JAParser
dates = JAParser().extract_dates(utf8_encoded_text)
Make sure the REDIS_HOST in your configuration file (see Quick start configuration section) is set to the name of redis service defined in the docker_compose.yml file. For example:
REDIS_HOST=redis
Double check that SERVER_PORT matches the open port of your container as defined in docker-compose.yml file. For example:
SERVER_PORT=5000
Start a container for development (version of docker-compose > v1.9.0):
$ docker-compose -f docker-compose.yml -f docker-compose.dev.yml up
The Dockerfile describes an image that includes Neologd dictionary. Installation of Neologd takes time and space, however it will provide the best quality in terms of entity extraction. You can also rely only on macacb-ipadic-utf8 by commenting this line in Dockerfile:
RUN git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git && cd mecab-ipadic-neologd && ./bin/install-mecab-ipadic-neologd -y
Make sure that either .env and/or feet.yaml configuration files specify the right path to access the dictionary:
In feet.yaml:
# Japanese POS Tagger MeCab
mecab:
mecabdict= '/usr/lib/mecab/dic/mecab-ipadic-neologd'
In .env:
MECAB_DICT=/usr/local/lib/mecab/dic/mecab-ipadic-neologd
To get help with feet, please contact Romary on romary.dupuis@altarika.com