/Ngram-Graphs

:mortar_board:RESEARCH [NLP] Analysis of N-gram Graphs and their applications in the domain of Text Classification and Extraction based Summarization

Primary LanguageJavaApache License 2.0Apache-2.0

N-grams-graphs and their applications in NLP

Report

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Gitter

18 Click here to see a small ppt on the potential use of n gram graphs for text classification

JInsect

The JINSECT toolkit is a Java-based toolkit and library that supports and demonstrates the use of n-gram graphs within Natural Language Processing applications, ranging from summarization and summary evaluation to text classification and indexing. This repository has parts of the collaborative work that Ayush Pareek did with Dr. George Giannakopoulos on the upcoming 2nd version of the tool. It also contains a rudimentary version of the toolkit in Python with functional N-gram Graph operations and AutoSumENG, MeMoG, NPower algorithms for multilingual summary evaluation.

Code Snippets

  • Create an n-gram graph from a string:
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;

...

// The string we want to represent
String sTmp = "Hello graph!";

// The default document n-gram graph with min n-gram size 
// and max n-gram size set to 3, and dist parameter set to 3
DocumentNGramGraph dngGraph = new DocumentNGramGraph();

// Create the graph
dngGraph.setDataString(sTmp);
  • Create an n-gram graph from a file
...

import gr.demokritos.iit.jinsect.documentModel.comparators.NGramCachedGraphComparator;
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
import gr.demokritos.iit.jinsect.structs.GraphSimilarity;
import gr.demokritos.iit.jinsect.utils;
import java.io.IOException;

...
        
	// The filename of the file the contents of which will form the graph
	String sFilename = "ayush_file.txt";
	DocumentNGramGraph dngGraph = new DocumentNGramGraph(); 
	// Load the data string from the file, also dealing with exceptions
	try {
		dngGraph.loadDataStringFromFile(sFilename);
	} catch (IOException ex) {
		ex.printStackTrace();
	}
  • Output graph to DOT file
import
gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
import gr.demokritos.iit.jinsect.utils;

...

	// create the n-gram graph
	String sData = "Hello there, graph world!";
	DocumentNGramGraph dngGraph = new DocumentNGramGraph();
	dngGraph.setDataString(sData);

	/* The following command gets the first n-gram graph level (with the
	minimum n-gram size) and renders it, using the utils package, 
	as a DOT string */
	System.out.println(utils.graphToDot(dngGraph.getGraphLevel(0), true));
  • Compare two graphs, extracting their similarity
...

import gr.demokritos.iit.jinsect.documentModel.comparators.NGramCachedGraphComparator;
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
import gr.demokritos.iit.jinsect.structs.GraphSimilarity;
import gr.demokritos.iit.jinsect.utils;
import java.io.IOException;

...

    String sTmp = "Hello graph!";
    DocumentNGramGraph dngGraph = new DocumentNGramGraph(); 
    dngGraph.setDataString(sTmp);
    String sTmp2 = "Hello other graph!";
    DocumentNGramGraph dngGraph2 = new DocumentNGramGraph(); 
    dngGraph2.setDataString(sTmp2);

    // Create a comparator object
    NGramCachedGraphComparator ngc = new NGramCachedGraphComparator();
    // Extract similarity
    GraphSimilarity gs = ngc.getSimilarityBetween(dngGraph, dngGraph2);
    // Output similarity (all three components: containment, value and size)
	System.out.println(gs.toString());
  • Merge two graphs
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;

...

	// create the two graphs
	String sTmpA = "Hello graph A!";
	String sTmpB = "Hello graph B!";
	DocumentNGramGraph dngGraphA = new DocumentNGramGraph();
	DocumentNGramGraph dngGraphB = new DocumentNGramGraph();
	dngGraphA.setDataString(sTmpA);
	dngGraphB.setDataString(sTmpB);

	// perform merging with weight factor 0.5 (averaging)
	// result is on dngGraphA
	dngGraphA.mergeGraph(dngGraphB, 0.5);
  • Load and save a graph to a file
import gr.demokritos.iit.jinsect.documentModel.representations.DocumentNGramGraph;
import gr.demokritos.iit.jinsect.storage.INSECTFileDB;

...

		// string to be represented
		String sTmp = "Hello there, I am an example string!";
		DocumentNGramGraph dngGraph = new DocumentNGramGraph();
		INSECTFileDB<DocumentNGramGraph> db = new INSECTFileDB<DocumentNGramGraph>();
		
		// if the file already exists
		if (db.existsObject("test", "graph")) { 
			dngGraph = db.loadObject("test", "graph");
		}
		else {
			// Create the graph
			dngGraph.setDataString(sTmp);

			// save object to file
			db.saveObject(dngGraph, "test", "graph");
		}

This version of the library has implementations of the following features-

  • The n-gram graphs (NGG) representations.
  • The NGG operators update/merge, intersect, allNotIn, etc.
  • The AutoSummENG summary evaluation family of methods.
  • INSECTDB storage abstraction for object serialization.
  • A very rich (and useful!) utils class which one must consult before trying to work with the graphs.
  • Tools for the estimation of optimal parameters of n-gram graphs
  • Support for DOT language representation of NGGs. ...and many many side-projects that are hidden including a chunker based on something similar to a language model, a semantic index that builds upon string subsumption to determine meaning and many others. Most of these are, sadly, not documented or published.

This library version:

  • Has increased modularity compared to the previous version i.e. JInsect 1.0
  • supports efficient multi-threaded execution
  • contains examples of application for classification
  • contains examples of application for clustering
  • contains command-line application for language-neutral summarization