/qSKOS

Vocabulary quality assessment tools

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

qSKOS

qSKOS is a tool for finding quality issues in SKOS vocabularies. It can be used as command line tool or API. For release information please see the changelog.

qSKOS is also the basis for the PoolParty SKOS Quality Checker, an online service that lets you check your vocabularies without needing to locally install qSKOS.

Currently we work on integrating qSKOS into the PoolParty Thesaurus Server, a commercially available Web-based development tool for SKOS vocabularies. In it’s final stage we combine qSKOS’ scientifically-grounded quality checks with the easy-to-use graphical interface of the PoolParty software. Integrated in the development workflow, quality checks can work as an assistive technology, just like syntax-checkers in development environments or spell-checkers in word processing applications.

Parts of qSKOS’ functionality have also been integrated into the rsine (Resource Subscription and Notification Service) project. It enables you to monitor changes to SKOS vocabularies as soon as they occur and send out immediate notifications (e.g., by email) if some potential quality problems are detected. Rsine is also published as open source and hosted on GitHub (here).

Using the Online Service

To use the service, you need to:

  • Navigate to the PoolParty SKOS Quality Checker in your browser

  • Login with your existing accounts at the networks of Google, LinkedIn, Xing or Twitter

  • If you want to perform a quality check on a SKOS vocabulary, you first need to provide a name for the vocabulary in the text box on the top of the page

  • Upload the SKOS vocabulary into the newly created vocabulary section and the checking process starts

If the check takes longer, you can also leave the page and come back later - the checks will continue running and, once finished, the report is available for download on the vocabulary overview page which is presented directly after you log in.

Please note

The intention behind requiring the user to give a name for uploaded vocabularies is to “group” different versions of the same vocabulary. For example, if you develop a thesaurus MyThesaurus_Version1.rdf, you will give it the name MyThesaurus on the online service and upload it. As development of the vocabulary progresses, you might want to upload subsequent versions of the vocabulary, e.g. MyThesaurus_Ver2.rdf. You are advised to upload these improved vocabularies also under the name MyThesaurus. In future versions of the service we may be able to exploit this information by sending you a report on how your vocabulary improved over time.

Local Installation

Use the latest released version

You can find the released versions on the qSKOS GitHub releases page and download the latest .jar files. Alternatively you can

Build from Source

Requirements:

  • Verify that Java v.1.7 or greater is installed: javac -version

  • Make sure Maven v.3.0 or greater is installed: mvn -v

  • Make sure you have the current version of the git version control system installed on your system

1) Get the source code

You have two options:

  • Clone the project (git clone https://github.com/cmader/qSKOS.git) to your system using git.

  • Download the latest release of the project and extract it to a properly named directory, e.g., qSKOS.

2) Build the tool

  • Change into the newly created qSKOS directory and build the application: mvn clean package. Wait for tests and build to complete.

  • Two jar files are now located in the qSKOS/target directory:

    • qSKOS-cmd.jar: The executable jar file that can directly be used for vocabulary evaluation

    • qSKOS-[version].jar: The library to integrate qSKOS’ functionality into other applications

Using the qSKOS command line tool

General Usage

  • Change to the qSKOS/target directory

  • Run the tool using java -jar qSKOS-cmd.jar

  • A synopsis on the application’s parameters is displayed.

Examples

The following examples demonstrate typical qSKOS use cases. For demonstration purposes we use the IPSV vocabulary available from the qSKOS-data repository: Download IPSV vocabulary. In the examples below we assume that the vocabulary file is placed in the same directory than the qSKOS-cmd.jar file.

1) Retrieving basic vocabulary statistics

Basic statistical properties (e.g., number of concepts, semantic relations or concept schemes) can be retrieved and stored into the file report.txt by issuing the command:

java -jar qSKOS-cmd.jar summarize ipsv_skos.rdf -o report.txt

2) Finding quality issues

To perform an analysis for quality issues, use the following command:

java -jar qSKOS-cmd.jar analyze -dc mil,bl ipsv_skos.rdf -o report.txt

Please keep in mind that a full analysis can take quite some time, depending on the vocabulary size and structure. Especially link checking sometimes can take hours, so it is often useful to analyze only a subset of all issues. In the following examples you’ll learn how this can be done. In the example above, checking for missing inlinks and broken links has been disabled to speed up the checking process (using the parameter -dc mil,bl).

3) Output a list of supported statistical properties and quality issues

By starting the evaluation using either the summarize or analyze command and omitting the vocabulary filename, you get an overview about the supported statistical properties and quality issues:

java -jar qSKOS-cmd.jar summarize

or

java -jar qSKOS-cmd.jar analyze

Here’s an excerpt from the output:

ID: chr
Name: Cyclic Hierarchical Relations
Description: Finds all hierarchy cycle containing components

Every property/issue is identified by an ID string, has a name and a description. For more detailed information on the quality issues see the qSKOS wiki page.

4) Testing for a specific issue or a subset of issues

Specific issues can be tested by passing the -c parameter followed by one or more (comma-separated) issue IDs (see example above). Keep in mind, that the -c parameter has to be placed between between the analyze command and the vocabulary file like this:

java -jar qSKOS-cmd.jar analyze -c ol,oc ipsv_skos.rdf -o report.txt

The command above triggers analysis of the “Overlapping Labels” and “Orphan Concepts” issues. In a very similar way it is possible to explicitly exclude issues from testing. For example, the command

java -jar qSKOS-cmd.jar analyze -dc mil ipsv_skos.rdf -o report.txt

checks for all issues except “Missing In-Links”.

FAQ

What are “Authoritative Concepts”?

Every concept in a SKOS vocabulary is a resource and should be identified by an URI to be referenced from other vocabularies on the Web. However, when using qSKOS, for some issues it is required to distinguish between concepts that are originally specified (authoritative) in the vocabulary that’s about to be analyzed, and concepts (implictly or explicitly) defined in other vocabularies somewhere on the Web.

qSKOS is to some extent able to perform this distinction by examining the host part of the concept’s URIs. Depending on the vocabulary’s structure in some cases it might be needed to pass an “Authoritative resource identifier” (command line argument -a) to qSKOS. This is a substring of an URI that identifies a concept as authoritative.

What version of qSKOS do I use?

Simply pass the command line switch -v like this:

java -jar qSKOS-cmd.jar -v

The version will be printed in the first line of the output, directly before the usage information.

Using the qSKOS API

The QSkos class serves as facade for calculating all criteria. For each criterion it provides a corresponding public method. Please read the Javadoc for further infos. Here is an example:

// instantiation
qskos = new QSkos(new File("stw.rdf"));
qskos.setAuthResourceIdentifier("zbw.eu");

// the fun part
Issue orphanConcepts = qSkos.getIssues("oc").iterator().next();
long numberOfOrphans = issue.getResult().occurrenceCount());

Publications

A subset of the quality issues qSKOS supports (including an analysis of several existing vocabularies) have been published in our paper Finding Quality Issues in SKOS Vocabularies.

@inproceedings{cs3444,
     booktitle = {TPDL 2012 Therory and Practice of Digital Libraries},
         month = {May},
         title = {Finding Quality Issues in SKOS Vocabularies},
        author = {Christian Mader and Bernhard Haslhofer and Antoine Isaac},
       address = {Germany},
          year = {2012},
           url = {http://arxiv.org/pdf/1206.1339v1},
      abstract = {The Simple Knowledge Organization System (SKOS) is a standard model for controlled vocabularies on the Web. However, SKOS   vocabularies often differ in terms of quality, which reduces their applicability across system boundaries. Here we investigate how we can support taxonomists in improving SKOS vocabularies by pointing out quality issues that go beyond the integrity constraints defined in the SKOS specification. We identified potential quality issues and formalized them into computable quality checking functions that can find affected resources in a given SKOS vocabulary. We implemented these functions in the qSKOS quality assessment tool, analyzed 15 existing vocabularies, and found possible quality issues in all of them.}
}

We performed a survey among experts in the field of vocabulary development in order to get feedback about the quality issues checked by qSKOS. The paper Perception and Relevance of Quality Issues in Web Vocabularies reports on a subset of these issues. It furthermore sets them into relation to common usage scenarios for controlled vocabularies on the Web.

@inproceedings{cs3720,
     booktitle = {I-SEMANTICS 2013},
         title = {Perception and Relevance of Quality Issues in Web Vocabularies},
        author = {Christian Mader and Bernhard Haslhofer},
       address = {Graz, AUT},
          year = {2013},
           url = {http://eprints.cs.univie.ac.at/3720/},
      abstract = {Web vocabularies provide organization and orientation in information environments and can facilitate resource discovery and retrieval. Several tools have been developed that support quality assessment for the increasing amount of vocabularies expressed in SKOS and published as Linked Data. However, these tools do not yet take into account the users' perception of vocabulary quality. In this paper, we report the findings from an online survey conducted among experts in the field of vocabulary development to study the perception and relevance of vocabulary quality issues in the context of real-world application scenarios. Our results indicate that structural and labeling issues are the most relevant ones. We also derived design recommendations for vocabulary quality checking tools.}
}

Presentation slides are available for the I-SEMANTICS 2013 talk as well as for my ISKO 2013 presentation.

In order to get an impression about the impact of using qSKOS for improving vocabularies, we set up a case study with students in a teaching context. They were given the task of creating a thesaurus, checking it’s quality with qSKOS and submit a revised version of the vocabulary after being confronted with the quality report. We published details of this case study at the HSWI2014 workshop. The paper can be retrieved online.

@inproceedings{cs4045,
     booktitle = {Workshop on Human-Semantic Web Interaction (HSWI?14)},
         month = {May},
         title = {Supporting Web Vocabulary Development by Automated Quality Assessment: Results of a Case Study in a Teaching Context},
        author = {Christian Mader and Christian Wartena},
          year = {2014},
        series = {CEUR Workshop Proceedings},
           url = {http://eprints.cs.univie.ac.at/4045/},
      abstract = {Constructing controlled Web vocabularies such as thesauri for search and retrieval tasks is a widely intellectual process that relies on human experts. Thus, errors can occur which decrease the overall quality of the vocabulary. qSKOS is a tool that automatically checks Web vocabularies for potential quality problems and generates quality reports. In this paper we present the results of a case study designed to evaluate the impact of integrating qSKOS in the vocabulary creation process. It was carried out among students skilled in construction of controlled vocabularies. We collected in total 13 vocabularies in two versions and detected 15 different kinds of quality problems. For 11 of these problems we observed reduced occurrences after the vocabularies were revised by the participants based on the results of the generated quality report.}
}

Contributors

How can I contribute?

  • Fork, add and/or improve, and send merge requests

  • File issues and/or feature requests

Copyright © 2011-2017 Christian Mader. See LICENSE.txt for details