/Annif_data_exp

Automatic subject assignment for KB ebooks using Annif.

Primary LanguageJupyter Notebook

Experiments using Annif to suggest suitable Brinkman subjects

Introduction

Annif is an opensource tool for automated subject indexing and classification developed at the National Library of Finland. Annif uses a user given controlled vocabulary (e.g. thesaurus) and pre-labeled data to train models that can than be used to assign subjects to a new input text. This repository contains files created while researching the use of Annif with ebook data from the KB using the Brinkman thesaurus as controlled vocabulary.

Results

  • annif_uitkomsten.xlsx contains all Annif evaluation outcomes of experiments using different backends and settings. Each tab contains outcomes using a different subset of the original dataset (datasets not on github):

    • subset 1 ggc1: subset with summaries from all ebooks in original dataset.
    • subset 2 ggc2: subset with summaries, titles and subtitles from all ebooks in original dataset.
    • subset 3 ggc_zaaktrefwoorden: subset with summaries, titles and subtitles from ebooks which got assigned one or more Brinkman subjects refering to content.
    • subset 4 ggc_vormtrefwoorden: subset with summaries, titles and subtitles from ebooks which got assigned one or more Brinkman subjects refering to form.
  • Annif aantekeningen folder contains documentation as tex/pdf.

Generate document corpus for use in Annif

  • generate_dataset_annif.ipynb is a Jupyter Notebook file to generate a document corpus usable by Annif from raw GGC data.

Initial analysis of Brinkman subjects in GGC-dataset

  • Initial_analysis_ggc_dataset.ipynb is a Jupyter Notebook file to get a feel for the Brinkman subjects available in the source data.

Preliminary investigation on similarities between Thema and Brinkman thesaurus

Files located in thema_thes folder.

  • Generate_thema_tsv.py is a Python file to convert Thema thesaurus XML file into a TSV file to be used as Annif vocab.
  • compare_Brinkman_and_Thema.ipynb contains some exploratory investigation on similarities between Brinkman and Thema thesaurus.
  • brinkman_thema_overlap.tsv contains Brinkman subjects which are also found in Thema.