A library for finding locations in Norwegian texts. Through development, focus have been on processing online news articles.
clj-egsiona
currently only supports the Linux platform, because of the The-Oslo-Bergen-Tagger's supported platforms. You can get around this by using clj-obt-service, which exposes the functionality of clj-obt as a web service. This makes it possible to use clj-egsiona
in a Windows setting, either from an external server or locally with a virtual machine running Linux.
Through development, this project was evaluated whith a corpus of 113 articles restricted mainly to the region of Hordaland. The metrics achieved was recall of 93.6%, precision of 69.1% and f-measure of 76.8%. It has not been tested thoroughly on other data sets, so your mileage may vary.
clj-egsiona
is on Clojars.
[clj-egsiona "0.1.4]
At the very least you need to configure OBT, but it's recommended to use a database for caching.
In clj-egsiona.core
to use a local OBT installation:
(set-obt "/home/ogrim/bin/The-Oslo-Bergen-Tagger")
To use the web service as hosted by clj-obt-service
:
(set-obt "10.0.0.2:8085")
A database is used for persistence of tagged texts. In clj-egsiona.core
use set-db
to configure database settings. No special stuff is done, so you can use SQLite, PostgreSQL, MySQL, etc. PostgreSQL configuration will look like this:
(set-db {:classname "org.postgresql.Driver"
:subprotocol "postgresql"
:subname "//localhost:5432/database-name"
:user "username"
:password "password"})
Or if you are using SQLite, it will look like this:
(set-db {:classname "org.sqlite.JDBC"
:subprotocol "sqlite"
:subname "database.db"})
If it is the first time using the database, there is a function to create the required table:
(create-tables)
Call process-text
to get the result as simple text:
(process-text "Vennligst finn ut om Stavanger eller Sandnes er lokasjoner. De er begge byer i Rogaland.")
=> ("sandnes" "rogaland" "stavanger")
If you want better granularity, use process-locations
to get more data:
(process-locations "Vennligst finn ut om Stavanger eller Sandnes er lokasjoner. De er begge byer i Rogaland.")
=> {:address ("Sandnes"), :counties ("rogaland"), :countries (), :regions (), :eu-route (),
:grammar ({:tags ["subst" "prop" "<*>"], :lemma "Stavanger", :word "Stavanger", :i 5}
{:tags ["subst" "prop" "<*land>" "<*>"], :lemma "Rogaland", :word "Rogaland", :i 16})}
Copyright (C) 2011-2012 Aleksander Skjæveland Larsen
Distributed under the Eclipse Public License, the same as Clojure.