/WS4J

WordNet Similarity for Java provides an API for several Semantic Relatedness/Similarity algorithms

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

WordNet Similarity for Java Build Status Release

WS4J provides a pure Java API for several published semantic relatedness/similarity algorithms for, in theory, any WordNet instance. You can immediately use WS4J on Princeton's English WordNet 3.0 lexical database through MIT Java WordNet Interface 2.4.0, which is the fastest Java library for interfacing with WordNet.

The codebase is mostly a Java re-implementation of WordNet::Similarity written in Perl, using the same data files as seen in src/main/resources, with some test cases for verifying the same logic. WS4J is designed to be thread-safe.

Relatedness/Similarity Algorithms

The semantic relatedness/similarity metrics available are:

  • HSO: Hirst & St-Onge, 1998 - The Hirst & St-Onge measure is based on the idea that two lexicalized concepts are semantically close if their WordNet synsets are connected by a path that is not too long and that "does not change direction too often":

HSO(s1, s2) = const_C - path_length(s1, s2) - const_k * num_of_changes_of_directions(s1, s2);

  • LCH: Leacock & Chodorow, 1998 - The Leacock & Chodorow measure relies on the length of the shortest path between two synsets for their measure of similarity:

LCH(s1, s2) = -Math.log_e(LCS(s1, s2).length / (2 * max_depth(pos)));

  • LESK: Banerjee & Pedersen, 2002 - Lesk (1985) proposed that the relatedness of two words is proportional to the extent of overlaps in their dictionary definitions. This Lesk measure is based on adapted Lesk from Banerjee and Pedersen (2002) extended this notion to use WordNet as the dictionary for the word definitions:

LESK(s1, s2) = sum_{s1' in linked(s1), s2' in linked(s2)}(overlap(s1'.definition, s2'.definition));

  • WUP: Wu & Palmer, 1994 - The Wu & Palmer measure calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS:

WUP(s1, s2) = 2 * dLCS.depth / (min_{dlcs in dLCS}(s1.depth - dlcs.depth)) + min_{dlcs in dLCS}(s2.depth - dlcs.depth)), where dLCS(s1, s2) = argmax_{lcs in LCS(s1, s2)}(lcs.depth);

  • RES: Resnik, 1995 - Resnik defined the similarity between two synsets to be the information content of their lowest super-ordinate (most specific common subsumer):

RES(s1, s2) = IC(LCS(s1, s2));

  • PATH - The Path measure computes the semantic relatedness of word senses by counting the number of nodes along the shortest path between the senses in the 'is-a' hierarchies of WordNet:

PATH(s1, s2) = 1 / path_length(s1, s2);

  • JCN: Jiang & Conrath, 1997 - The Jiang & Conrath measure uses the notion of information content but in the form of the conditional probability of encountering an instance of a child synset given an instance of a parent synset:

JCN(s1, s2) = 1 / jcn_distance where jcn_distance(s1, s2) = IC(s1) + IC(s2) - 2 * IC(LCS(s1, s2)); when it's 0, jcn_distance(s1, s2) = -Math.log_e((freq(LCS(s1, s2).root) - 0.01) / freq(LCS(s1, s2).root)) so that we can have a non-zero distance which results in infinite similarity;

  • LIN: Lin, 1998 - The Lin measure idea is similar to JCN with a small modification:

LIN(s1, s2) = 2 * IC(LCS(s1, s2) / (IC(s1) + IC(s2)).

The descriptions above are extracted either from each paper or from WordNet-Similarity CPAN documentation.

Prerequisites

By default, the requirements for compilation are:

  • JDK 8+
  • Maven

Any WordNet instance can be used in WS4J if it implements the ILexicalDatabase interface.

Built with Maven

To create a jar file with dependencies including resource files:

$ mvn install assembly:single

Using WS4J

Then start playing with the facade WS4J API:

src/main/java/edu/uniba/di/lacam/kdde/ws4j/WS4J.java

and a simple demo class:

src/main/java/edu/uniba/di/lacam/kdde/ws4j/demo/SimilarityCalculationDemo.java

which can be run through jar-with-dependencies from the root folder by typing into the terminal:

$ java -jar target/ws4j-1.0.2-jar-with-dependencies.jar

When using WS4J jar package from other projects add the JitPack repository to your POM file:

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

and declare this GitHub repo as a dependency:

<dependencies>
    <dependency>
        <groupId>com.github.dmeoli</groupId>
        <artifactId>WS4J</artifactId>
        <version>x.y.z</version>
    </dependency>
</dependencies>

Running the tests

To run JUnit test cases:

$ mvn test

The expected results from the test cases are compatible with the original WordNet::Similarity.

Initial Work

The original author is Hideki Shima.

License License: GPL v3

This software is released under GNU GPL v3 License. See the LICENSE file for details.