/CompareString2

String comparison made easy

Primary LanguageJavaMIT LicenseMIT

CompareString2

Maven Central License: MIT Javadocs

This library is a wrapper of tdebatty's java-string-similarity. It provides many methods to perform String comparison with various algorithms.

Get CompareString2

Gradle

dependencies {
    implementation 'it.andreuzzi:CompareString2:1.0.8'
}

Maven

<dependency>
  <groupId>it.andreuzzi</groupId>
  <artifactId>CompareString2</artifactId>
  <version>1.0.8</version>
</dependency>

How to

Single comparison

String s1 = "mystring";
String s2 = "muswrinh";
float result = Utils.compare(s1, s2, AlgMap.NormDistAlg.NGRAM);

However, when you need to perform many comparisons with the same algorithm, it's recommended to get an instance of that algorithm and passing it to the method compare:

Algorithm ngram = AlgMap.NormDistAlg.NGRAM.buildArg();
float result = Utils.compare(s1, s2, ngram, AlgMap.NormDistAlg.NGRAM);

Some algorithms need/allow one or more parameters in order to be built properly. These are usually values that depends on the use cases. For instance, the algorithm NGRAM allows you to pass an int value:

int n = 3;
Algorithm ngram = AlgMap.NormDistAlg.NGRAM.buildArg(n);

You can check the Javadoc page for the algorithm NGRAM.

List comparison

Sorting order

Every array returned by CompareString2 is sorted by descendent order. The order is based on the result got by the entry during the comparison with s1 using the given algorithm.
Note that the sorting order isn't always the same. There are some cases where the result 0 means that the strings are totally different (distance algorithms), while in other cases the result 0 means equal (similarity algorithms).
If you want to get information about this topic, check this section of the GitHub page of java-string-similarity.

String s1 = "wahssapp";
String[] ss = new String[] {"Facebook", "Instagram", "Snapchat", "Twitter", "WhatsApp", "Reddit"};

Best match

String bM = CompareStrings.bestMatch(s1, ss, AlgMap.NormSimAlg.JAROWRINKLER);

Top n matches

This method returns an array of min(n, ss.length) elements.

String[] topN = CompareStrings.topNmatches(s1, ss, AlgMap.NormSimAlg.JACCARD, 4);
// '4' is an optional argument of the algorithm "Jaccard"

Let's redefine s1and ss. You will notice that, while s1 needs to be a String object, ss can be any Iterable<? extends StringableObject>. StringableObject is an interface which comes with CompareString2. We use the method getLowercaseString() to obtain a comparable String. Moreover, you need to implement the method getString() for testing purposes (you can return null if you don't need it). Let's see an example:

String s1 = "jonn";
List<Contact> ss = Arrays.asList(new Contact[] {new Contact("John", "Doe"),
    new Contact("Mario", "Rossi"), new Contact("Santa", "Claus")});

.
.
.

class Contact implements StringableObject {
  .
  .
  .
  @Override
  public String getString() {
    return name + " " + surname;
  }

  public String getLowercaseString() {
    return getString().toLowerCase();
  }
}

With deadline

float deadline = 0.55;
Algorithm sordice = AlgMap.NormSimAlg.SORENSENDICE.buildAlg();
Contact[] aboveDeadline = CompareObjects.withDeadline(Contact.class, s1, ss.size(), ss, deadline, sordice, AlgMap.NormSimAlg.SORENSENDICE);

Since AlgMap.NormSimAlg.SORENSENDICE is in the category normalized similarity, 0 means totally different, and 1 means equal. So a bigger result means an higher similarity, and this gives the sorting order of the returned array. Check here for more details.

Splitter

Let's redefine one more time s1, ss, and a new String[] object called splitter:

// let's suppose that MyFile implements StringableObject
String[] splitter = new String[] {"-", "_"};
String s1 = "values";
Set<MyFile> files = new HashSet<>(Arrays.asList(new MyFile[] {new MyFile("xml-entries.xml"),
    new MyFile("json_elements.json"), new MyFile("csv-values.xml"), new MyFile("JSON_values.xml")}));

If you use any CompareString2 passing s1 and files you will likely get an unwanted value. We're interested in csv-values.xml, but since every String comparison algorithm is quite linear, the csv- part will be considered before the values.xml part; the algorithm will try to compare the two strings linearly, and the outcome will be worse than we would.
You can avoid this problem passing a String[] object as a splitter. The following is a pseudo version of the process:

double result = veryBadResult;
for(Object obj : files) {
  for(String spl : splitter) {
    for(String s : files.toString().split(spl)) {
      // keep the better result
      result = keepBetter(result, Compare.compare(s, s1, alg));
    }
  }
}
return result;

You can use the splitter feature with every method in the List comparison.

Deadline + Top N

This method returns a MyFile[] object which contains only MyFile objects whose comparison result with s1 is greater than or equal to deadline. The length of the array will be between 0 and n.

float deadline = 5;
int n = 2;
Algorithm lcs = AlgMap.NormSimAlg.LCS.buildAlg();
MyFile[] objs = CompareObjects.topMatchesWithDeadline(MyFile.class, s1, files.size(), files, n, deadline, splitter, damerau, AlgMap.NormSimAlg.SORENSENDICE);

Algorithms

Please refer to tdebatty/java-string-similarity for a detailed description for each algorithm.
Some algorithms are listed two or three times. This means that they comes in more than one version (Normalized distance, Normalized similarity, ...).

Category Algorithm Needed args Optional args
Distance LCS / /
Distance OSA / /
Distance QGRAM / int
Distance WLEVENSHTEIN CharacterSubstitutionInterface CharacterInsDelInterface
Normalized distance COSINE / int
Normalized distance JACCARD / int
Normalized distance JAROWRINKLER / int
Normalized distance METRICLCS / /
Normalized distance NGRAM / int
Normalized distance NLEVENSHTEIN / /
Normalized distance SORENSENDICE / int
Normalized similarity COSINE / int
Normalized similarity JACCARD / int
Normalized similarity JAROWRINKLER / int
Normalized similarity NLEVENSHTEIN / /
Normalized similarity SORENSENDICE / int
Metric distance DAMERAU / /
Metric distance JACCARD / int
Metric distance LEVENSHTEIN / /
Metric distance METRICLCS / /

Result ranges

Category Equals Different
Distance 0 +Infinity
Normalized distance 0 1
Normalized similarity 1 0
Metric distance 0 +Infinity

Known users

Please, let me know if you used CompareString2 in your project :D