WS4J provides a pure Java API for several published semantic relatedness/similarity algorithms for, in theory, any WordNet instance. You can immediately use WS4J on Princeton's English WordNet 3.0 lexical database through MIT Java WordNet Interface 2.4.0, which is the fastest Java library for interfacing to WordNet.
The codebase is mostly a Java re-implementation of WordNet::Similarity written in Perl, using the same data files as seen in src/main/resources, with some test cases for verifying the same logic. WS4J designed to be thread safe.
The semantic relatedness/similarity metrics available are:
- HSO: Hirst & St-Onge, 1998 - The Hirst & St-Onge measure is based on an idea that two lexicalized concepts are semantically close if their WordNet synsets are connected by a path that is not too long and that "does not change direction too often":
HSO(s1, s2) = const_C - path_length(s1, s2) - const_k * num_of_changes_of_directions(s1, s2);
- LCH: Leacock & Chodorow, 1998 - The Leacock & Chodorow measure relies on the length of the shortest path between two synsets for their measure of similarity:
LCH(s1, s2) = -Math.log_e(LCS(s1, s2).length / (2 * max_depth(pos)));
- LESK: Banerjee & Pedersen, 2002 - Lesk (1985) proposed that the relatedness of two words is proportional to the extent of overlaps of their dictionary definitions. This Lesk measure is based on adapted Lesk from Banerjee and Pedersen (2002) extended this notion to use WordNet as the dictionary for the word definitions:
LESK(s1, s2) = sum_{s1' in linked(s1), s2' in linked(s2)}(overlap(s1'.definition, s2'.definition));
- WUP: Wu & Palmer, 1994 - The Wu & Palmer measure calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS:
WUP(s1, s2) = 2 * dLCS.depth / (min_{dlcs in dLCS}(s1.depth - dlcs.depth)) + min_{dlcs in dLCS}(s2.depth - dlcs.depth)), where dLCS(s1, s2) = argmax_{lcs in LCS(s1, s2)}(lcs.depth);
- RES: Resnik, 1995 - Resnik defined the similarity between two synsets to be the information content of their lowest super-ordinate (most specific common subsumer):
RES(s1, s2) = IC(LCS(s1, s2));
- PATH - The Path measure computes the semantic relatedness of word senses by counting the number of nodes along the shortest path between the senses in the 'is-a' hierarchies of WordNet:
PATH(s1, s2) = 1 / path_length(s1, s2);
- JCN: Jiang & Conrath, 1997 - The Jiang & Conrath measure uses the notion of information content, but in the form of the conditional probability of encountering an instance of a child-synset given an instance of a parent synset:
JCN(s1, s2) = 1 / jcn_distance where jcn_distance(s1, s2) = IC(s1) + IC(s2) - 2 * IC(LCS(s1, s2)); when it's 0, jcn_distance(s1, s2) = -Math.log_e((freq(LCS(s1, s2).root) - 0.01) / freq(LCS(s1, s2).root)) so that we can have a non-zero distance which results in infinite similarity;
LIN(s1, s2) = 2 * IC(LCS(s1, s2) / (IC(s1) + IC(s2)).
The descriptions above are extracted either from each paper or from WordNet-Similarity CPAN documentation.
By default, requirement for compilation are:
- JDK 8+
- Maven
Any WordNet instance can be used in WS4J if it implements the ILexicalDatabase interface.
To create a jar file with dependencies including resource files:
$ mvn install assembly:single
Then start playing with the facade WS4J API:
src/main/java/edu/uniba/di/lacam/kdde/ws4j/WS4J.java
and a simple demo class:
src/main/java/edu/uniba/di/lacam/kdde/ws4j/demo/SimilarityCalculationDemo.java
which can be run through jar-with-dependencies from root folder by typing into terminal:
$ java -jar target/ws4j-1.0.1-jar-with-dependencies.jar
When using WS4J jar package from other projects add the JitPack repository to your POM file:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
and declare this github repo as a dependency:
<dependencies>
<dependency>
<groupId>com.github.DonatoMeoli</groupId>
<artifactId>WS4J</artifactId>
<version>x.y.z</version>
</dependency>
</dependencies>
To run JUnit test cases:
$ mvn test
The expected results from the test cases are compatible with the original WordNet::Similarity.
The original author is Hideki Shima.
This software is released under GNU GPL v3 License. See the LICENSE file for details.