Click here to see a small ppt on the potential use of n gram graphs for text classification
The JINSECT toolkit is a Java-based toolkit and library that supports and demonstrates the use of n-gram graphs within Natural Language Processing applications, ranging from summarization and summary evaluation to text classification and indexing. This repository has parts of the collaborative work that Ayush Pareek did with Dr. George Giannakopoulos on the upcoming 2nd version of the tool. It also contains a rudimentary version of the toolkit in Python with functional N-gram Graph operations and AutoSumENG, MeMoG, NPower algorithms for multilingual summary evaluation.
- Create an n-gram graph from a string:
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
...
// The string we want to represent
String sTmp = "Hello graph!";
// The default document n-gram graph with min n-gram size
// and max n-gram size set to 3, and dist parameter set to 3
DocumentNGramGraph dngGraph = new DocumentNGramGraph();
// Create the graph
dngGraph.setDataString(sTmp);
- Create an n-gram graph from a file
...
import gr.demokritos.iit.jinsect.documentModel.comparators.NGramCachedGraphComparator;
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
import gr.demokritos.iit.jinsect.structs.GraphSimilarity;
import gr.demokritos.iit.jinsect.utils;
import java.io.IOException;
...
// The filename of the file the contents of which will form the graph
String sFilename = "ayush_file.txt";
DocumentNGramGraph dngGraph = new DocumentNGramGraph();
// Load the data string from the file, also dealing with exceptions
try {
dngGraph.loadDataStringFromFile(sFilename);
} catch (IOException ex) {
ex.printStackTrace();
}
- Output graph to DOT file
import
gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
import gr.demokritos.iit.jinsect.utils;
...
// create the n-gram graph
String sData = "Hello there, graph world!";
DocumentNGramGraph dngGraph = new DocumentNGramGraph();
dngGraph.setDataString(sData);
/* The following command gets the first n-gram graph level (with the
minimum n-gram size) and renders it, using the utils package,
as a DOT string */
System.out.println(utils.graphToDot(dngGraph.getGraphLevel(0), true));
- Compare two graphs, extracting their similarity
...
import gr.demokritos.iit.jinsect.documentModel.comparators.NGramCachedGraphComparator;
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
import gr.demokritos.iit.jinsect.structs.GraphSimilarity;
import gr.demokritos.iit.jinsect.utils;
import java.io.IOException;
...
String sTmp = "Hello graph!";
DocumentNGramGraph dngGraph = new DocumentNGramGraph();
dngGraph.setDataString(sTmp);
String sTmp2 = "Hello other graph!";
DocumentNGramGraph dngGraph2 = new DocumentNGramGraph();
dngGraph2.setDataString(sTmp2);
// Create a comparator object
NGramCachedGraphComparator ngc = new NGramCachedGraphComparator();
// Extract similarity
GraphSimilarity gs = ngc.getSimilarityBetween(dngGraph, dngGraph2);
// Output similarity (all three components: containment, value and size)
System.out.println(gs.toString());
- Merge two graphs
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
...
// create the two graphs
String sTmpA = "Hello graph A!";
String sTmpB = "Hello graph B!";
DocumentNGramGraph dngGraphA = new DocumentNGramGraph();
DocumentNGramGraph dngGraphB = new DocumentNGramGraph();
dngGraphA.setDataString(sTmpA);
dngGraphB.setDataString(sTmpB);
// perform merging with weight factor 0.5 (averaging)
// result is on dngGraphA
dngGraphA.mergeGraph(dngGraphB, 0.5);
- Load and save a graph to a file
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
import gr.demokritos.iit.jinsect.storage.INSECTFileDB;
...
// string to be represented
String sTmp = "Hello there, I am an example string!";
DocumentNGramGraph dngGraph = new DocumentNGramGraph();
INSECTFileDB<DocumentNGramGraph> db = new INSECTFileDB<DocumentNGramGraph>();
// if the file already exists
if (db.existsObject("test", "graph")) {
dngGraph = db.loadObject("test", "graph");
}
else {
// Create the graph
dngGraph.setDataString(sTmp);
// save object to file
db.saveObject(dngGraph, "test", "graph");
}
- The n-gram graphs (NGG) representations.
- The NGG operators update/merge, intersect, allNotIn, etc.
- The AutoSummENG summary evaluation family of methods.
- INSECTDB storage abstraction for object serialization.
- A very rich (and useful!) utils class which one must consult before trying to work with the graphs.
- Tools for the estimation of optimal parameters of n-gram graphs
- Support for DOT language representation of NGGs. ...and many many side-projects that are hidden including a chunker based on something similar to a language model, a semantic index that builds upon string subsumption to determine meaning and many others. Most of these are, sadly, not documented or published.
This library version:
- Has increased modularity compared to the previous version i.e. JInsect 1.0
- supports efficient multi-threaded execution
- contains examples of application for classification
- contains examples of application for clustering
- contains command-line application for language-neutral summarization