Finnish Geoparser (Finger): creation of test corpora and evaluation
The Basics
This repo contains files related to the creation of geoparsing corpora (sing. corpus) and the subsequent testing of the Finnish geoparser, Finger (see its repository here). A corpus is a collection of texts, often with some additional information marked on the texts, called annotations. In the case of geoparsing, which is recognizing and geocoding place names, the corpora have toponyms (place names) and their locations as coordinate pairs annotated.
The corpora
This repo has two Finnish language geoparsing corpora, which I share as openly as possible:
- Finger-news: 42 Wikinews articles from the year 2011.
- Finger-tweets: 980 random tweets collected in 8/2021 from the Twitter search API
The exact collection and annotation processes are described in my upcoming master's thesis, which I will link here once it's done and publicly available.
Find the corpora under the folder input_data. They are shared as XML files in the format defined by Hu and Wang in the EUPEG project (fingernews_corpus.xml | fingertweets.corpus.xml). Another option are the CSV files generated by Finger, which can be read back in as Pandas dataframes (fingernews_gold_df.csv | fingertweets_gold_df.csv).
Please note! Twitter's terms of service don't allow the public sharing of tweet texts. Instead, Finger-tweet's public version contains the tweet id's instead of the actual texts. These id's can be hydrated to acquire the Tweets (in case they haven't been deleted). See tweet_ids.txt for a ordered list of the ID's of the tweets used in this work.
The code
The repo contains three Jupyter notebooks, which walk through the process from collection to evaluation.
- Acquiring_tweets_and_news.ipynb: Scraping the texts from Twitter's API and Wikiuutiset URLs.
Manual annotation happened here using Label-studio. No code from this section
- Formatting gold corpora.ipynb: Formatting the annotations and input texts to the formats descibed previously.
- Evaluating Finger.ipynb: Processing the corpora with Finger and evaluating its performance in comparison to the gold standard annotations. Outputs of the evalution are stored in ./eval_outputs.
Licensing
The Finnish Wikinews texts are shared under CC 2.5 BY.