ActiveWatch: A Java repository from prohippo

ActiveWatch (AW) is a set of Java software modules for building various statistical text processing capabilities. It is unusual in its indexing of text according to a finite set of selected lexical features instead of whole words. It represents individual text items as N-dimensional numerical vectors, where N will be on the order of 10⁴.

For a full AW User Guide, see the file AWug.pdf in this GitHub repository. For a mathematical and historical description of finite and other kinds of text indexing in general, see the repository file HowtoIndex.pdf.

Finite indexing lets us reliably estimate the probability that a given lexical feature occurs in a text item of a given length. A multinomial model can then be applied to compute a statistically scaled inner-product similarity measure between pairs of vectors. This offers an alternative to the normalized, but unscaled, cosine similarity of Gerard Salton.

AW lexical features for indexing text currently fall into three types: (1) all alphanumeric 2-grams, like TH, F1, 2X, or 00; (2) selected alphabetic n-grams (for n > 2), like QUE, REVE, and CLASS; and (3) a fixed number of user-defined alphanumeric word beginnings and endings, like -000000000, THERMO-, and -MOTHER.

Indexing with word fragments will tend to be noisier than with whole words. For example, if text item has a long word like CONFABULATE that is missing as a indexing feature, AW will break it into overlapping fragments like CONF, NFA, FAB, ABU, ULAT, and LATE to represent the word in a finite indexing vector. This may seem outrageous, but any crossword puzzle fan knows that word fragments do carry ihformation.

So, how big would our finite index set have to be to support useful text analysis? The ActiveWatch demonstration makes the case that 10⁴ should be enough in English for automatic clustering of short text items by content or for detecting highly unusual content in a dynamic text stream. You should look elsewhere, however, if you just want to find all documents containing a specific word.

The advantage of a finite vector representation of text is that it lets us organize information processing at a level of abstraction that simplifies the computations cwa system must carry out. Once we encode text as vectors, it should not matter where these vectors came from. We care only that they are convenient to work with and carry enough information for the needs and purposes of information users.

Vector data of finite dimensionality makes a statistically scaled measure of similarity possible. Such scaling makes a measure easier to interpret and allows a text processing system to make decisions reliably on its own. Human users must otherwise hover around like a helicopter parent for quality control. This is especiallly critical in real-time systems with dynamic data; they become more manageable and more resiliant in unexpected situations.

AW will score similarity by the number of standard deviations that a raw vector inner product similarity score falls above the mean of a theoretical noise distribution. This noise will be roughly Gaussian, so that an AW scaled similarity of 3 standard deviations should be significant at about p = .003. With actual text data, AW should typically work with scaled similarity well above 6 standard deviations.

Some index tuning is needed to achieve such performance. This will mainly involve adjustments of the indexing features defined by a users for particular target text data. Automatic stemming and stopword deletion also allows AW users to exclude purely grammatical instances of n-grams like ING, MENT, or ATION when indexing text for content.

AW was first written in C around 1982 for information discovery in unfamiliar text data. The current Java version dates back to around 1999, but has some recent tweaks in its linguistic analysis and addition of 4- and 5-letter word fragments for more precise indexing. Only 2- and 3-letter fragments, plus user-defined indices, were built into AW previously.

The modules included in the AW GitHub repository mainly provide support for simple clustering of text items by content. The code is organized functionally Java packages. It was originally written on Apple home computers running versions 7., 8., or 9.* of the Macintosh OS. This was when Java was still a somewhat new programming language.

Java AW eventually evolved to support many kinds of statistical natural language processing, but this GitHub repository includes only a small subset of modules to demonstrate automatic clustering of text items in particular. This software should give you a good overall idea of what you can do with AW finite indexing and statistically scaled similarity between pairs of finite item vectors.

The latest AW release includes fifteen prebuilt AW modules. These might support military intelligence operations or the tagging of news streams for resale to commercial businesses or the organization of text documents obtained by legal discovery. The modules are in separate runnable jar files in the jars subdireectory of this repository.

All Java source code is included in the GitHub repository. You can build out all the AW modules by running the 'build' shell script included with the AW GitHub download. The script is for macOS Darwin Unix and should be edited for your own computing platform. You will have to install a Java JDK if you do not have one already. Everything in the AW demonstration still has to run from a command line.

AW software is free for all uses and is released under BSD licensing.

Release History:

v0.1    16jun2021  Initial upload of original AW Java source code.

v0.2    10ju12021  Clean up and reorganize code for SEGMTR module
                   Add code to dump AW output files
                   Collect news data set for demonstration

v0.3    13jul2021  Clean up and debug code in AW table building modules
                   Add unit testing

v0.4    16jul2021  Clean up and reorganize code for INDEXR module
                   Add code to dump AW output files
                   Update to expect UTF-8 text input, not ASCII

v0.5    06aug2021  Clean up code for SEGMTR module
                   Fix problems in UTF-8 handling
                   Add diagnostic tools
                   Build initial versions of AW clustering modules to test

v0.6    12aug2021  Clean up SEGMTR, UPDATR, SEQNCR, SQUEZR, SUMRZR, KEYWDR modules
                   Fix problems in UTF-8 handling, subsegmenting long text items
                   Add and extend diagnostic tools
                   Clean up text data sample for clustering demonstration
                   Edit and update documentation

v0.6.1  14aug2021  Fix mishandling of multi-segment items in AW clustering

v0.7    21aug2021  add and extend diagnostic tools
                   remove phonetic indexing
                   add 4- and 5-gram indices
                   reorganize startup of n-gram analysis
                   clean up problems with deprecation and type casting

v0.7.1  25aug2021  fix bug in LexicalGram in breaking out of extraction loop
                   replace incorrect 4- and 5-grams in GramMap lists
                   clean up literal file to align with current 4- and 5-grams
                   fix inconsistent stemming rules in suffix file
                   simply Updater command line arguments
                   update Dprb for v0.7 changes in n-gram initialization
                   update documentation

v0.7.2  30aug2021  fix integration of AW-defined and user-defined indices
                   update documentation

v0.7.3  04sep2021  add suffix rule to fix stemming glitch
                   add general writeup on AW and finite indexing
                   upload jars of compiled and linked AW modules
                   update documentation

v0.8    20sep2021  update to latest inflectional stemming logic
                   update implementation of inflectional stemmer
                   update documentation

v0.9    30sep2021  expand morphological stemming
                   fix longstanding problems in stemming code
                   clean up and and extend code commentary
                   update documentation

v0.9.1  07oct2021  fix problems in n-gram extraction
                   fix problems in morphological stemming rule extension
                   fix typo in AW 5-gram table
                   update documentation

v1.0    14oct2021  clean up definitions of n-gram limits
                   clean up literal n-grams and morphological stemming
                   extend builtin 4-and 5-grams
                   fix a lookup bug for 4- and 5-grams
                   update documentation

v1.0.1  22oct2021  extend built-in 4-grams by 200
                   add built-in 5-grams not chosen by general frequency
                   edit default literal to reduce indexing redundancy
                   update documentation

v1.1    27oct2021  expand WATCHR module to help monitor residuals
                   make index vector operations more transparent
                   update documentation

v1.1.1  07nov2021  expand 4- and 5-grams to reduce indexing noise
                   clean up file defining default literal n-grams
                   fix problem with build file for WATCHR
                   update documentation

v1.1.2  11nov2021  expand 4-grams to reduce indexing noise
                   extend and clean up default literal n-grams
                   clean up various source files for readability
                   clean up clustering source code, add comments
                   update documentation

v1.1.3  16nov2021  expand 4- and 5-grams to reduce indexing noise
                   extend and clean up default literal n-grams
                   add tools to identify where frequent n-grams come from
                   comment out diagnostic print statements in SQUEZR
                   update documentation

v1.1.4  19nov2021  expand 4- and 5-grams to reduce indexing noise
                   clean up poorly formatted source files
                   update documentation

v1.1.5  21nov2021  fix bug in arguments for WATCHR
                   add diagnostic tools
                   clean up source files
                   update documentation

v1.1.6  24nov2021  expand 4- and 5-grams to reduce indexing noise
                   clean up literals
                   add diagnostic tools
                   add scripting to build AW tools
                   update documentation

v1.1.7  26nov2021  expand 4- and 5-grams to reduce indexing noise
                   clean up literals
                   add diagnostic tools
                   add scripting to build AW tools
                   update documentation

v1.1.8  01dec2021  expand 4- and 5-grams to reduce indexing noise
                   clean up literals
                   update documentation

v1.1.9  10dec2021  fix bug on prioriority of leading, trailing literals
                   expand 4- and 5-grams to reduce indexing noise
                   clean up literals
                   update documentation
                   add AW User Guide

v1.1.10 20dec2021  expand 4- and 5-grams to reduce indexing noise
                   update documentation 

v1.1.11 23dec2021  expand 4- and 5-grams to reduce indexing noise
                   fix typo in AW banner
                   update documentation

v1.1.12 30dec2021  expand 4-grams to reduce indexing noise
                   update documentation

v1.2    01jan2022  expand 4- and 5-grams to reduce indexing noise
                   add diagnostic tools for profiles and match lists
                   update documentation

v1.2.1  06jan2022  expand 4- and 5-grams to reduce indexing noise
                   allow indexing to stop at 3- or 4-grams
                   update documentation

v1.2.2  07jan2022  expand 4- and 5-grams to reduce indexing noise
                   update documentation

v1.3    11jan2022  expand 4- and 5-grams to reduce indexing noise
                   add new AW modules PROFLR and EXMPLR
                   add missing source files
                   update documentation

v1.3.1  15jan2022  expand 4- and 5-grams to reduce indexing noise
                   clean up literals
                   clean up and extend stemming
                   update documentation, fix this file for MD formatting

v1.3.2  17jan2022  expand 4- and 5-grams to reduce indexing noise
                   update documentation

v1.3.3  19jan2022  expand 4-grams to reduce indexing noise
                   add literals to reduce noise
                   update documentation

v1.3.4  20jan2022  expand 4-grams to reduce indexing noise
                   add literals to reduce noise
                   update documentation

v1.4    25jan2022  add PHRASR module to AW complement
                   update documentation

v1.4.1  26jan2022  move Lines.java and Inputs.java to aw package

v1.4.2  02feb2022  add ANALZR module to AW complement
                   clean up CharArray source code
                   clean up phrase exrracrion source code
                   update documentation

v1.4.3  10feb2022  add PATBLD and ENDBLD support modules for phrase analysis
                   add language processing tables for phrase analysis
                   add reporting on failure to load language processing tables
                   add bannder to ANALZR and PHRASR modules
                   clean up source files
                   update documentation

v1.4.4  20feb2022  extensive reworking of ANALZR module
                   fix incomplete LexicalAtomStream
                   clean up CharArray class
                   update documentation

v1.4.5  04mar2022  extensive reworking of Start and Parse classes
                   reorganize AW special hash tables
                   clean up source files for entity type classes
                   rework ANALZR and PHRASR modules
                   add DXPH and DSPH tools
                   update documentation

v1.5    15jul2022  clean up AW CharArray classes
                   clean up text char normalization
                   clean up and simplify AW hash table code
                   debug text lining methods
                   improve source code commentary
                   replace typo in 4-gram index list
                   update documentation

v2.0    26sep2022  change parsing data structures for longer text
                   clean up reparsing for phrase extraction
                   simplify signatures for phrase selection
                   rework code and comments in ANALZR and PHRASR
                   add diagnostic tools to check AW phrase analysis
                   fix bug in token scoring for KEYWDR and PHRASR
                   fix bug in word hash table for KEYWDR and PHRASR
                   fix bugs for phrase scoring
                   update documentation

v2.1	28oct2022  expand builtin 4-grams to 2,500
                   remove POLY- and MONO- from default literals
                   fix bug in building stopword table
                   fix bug in reading in syntactic type definitions
                   fix bugs in loading rewriting rules
                   fix ByteTool bug not keeping upper and lower case
                   fix bug in syntax symbol lookup
                   debug, clean up, and simplify syntax symbol table
                   clean up DPRO output for content profiles
                   fix problems in feature coding for phrase analysis
                   clean up and test joining and splitting in Reparser
                   make rules file for Reparser self-documenting
                   update documentation

v2.1.1  01nov2022  let users try n-gram index sets with different n
                   update documentation

v2.2    08nov2022  fix bug with stopwords having periods and apostrophes 
                   update documentation

v2.3    20nov2022  expand builtin 4-grams past previous upper limit
                   change numbering of n-gram indices to be more logical
                   fix bug in squeezing of index vectors for clustering
                   update documentation

v2.4    06dec2022  fix bugs in buffering UTF-8 text input
                   clean up and simplify Unicode conversion of UTF-8
                   clean up AW segmentation code
                   update documentation

v2.5    17dec2022  fix display of match lists to indicate text subsegments
                   improve reporting on cluster analysis
                   add command line control of subsegmentation
                   document command line control of cluster profile generation
                   clean up and extend default literal n-grams for English
                   add 4-grams to round up total to 2,600
                   clean up indentation in source code
                   update documentation

v2.6	28dec2022  allow up to 16,384 vectors to be clustered in a batch
                   allow for more clusters
                   clean up indentation in source code
                   fix subsegment designations in tools output
                   update documentation 

v2.6.1  10jan2023  fix bugs in inflectional stemming and test
                   fix bug  in token substitution rule file name
                   add TKNZR and DINF tools to AW repository
                   clean build file for tools
                   update documentation

v2.7    18jan2023  add PLOTTR, RANKER, HUBBER analysis modules
                   add general class to select items by degree of linkage
                   move LinkMatrix to object package
                   clean up clustering source file formatting
                   update documentation

v2.7.1  22jan2023  add DSRV, DQBK, DQBE tools for command line search
                   clean up source code
	           update documentation

v2.7.2  31jan2023  improve DSRV output
                   clean up source code
                   add DSMX and DSIM tools for testing
                   update documentation

v2.7.3  07mar2023  add DKYW tool to aid profile building
                   add command line check for DQBE
                   add command line option for DPRO
                   simplify profile output
                   clean up indentation of DQBK and DLST source
                   update documentation

v2.8    21mar2023  add 100 alphabetic 4-grams to built-in index featkures
                   update documentation

v2.8.1  28jun2023  add 60 alphabetic 4-grams
                   add DCMS tool to get cluster assignments
                   change DNGM tool for user-defined n-grams
                   update documentation

v2.8.2  19aug2023  add 40 alphabetic 4-grams
                   fix bug with n-gram ranges for types
                   upgrade DNGM, DLST, DLSS, DLNK tools
                   add and update documentation	 

v2.8.3  15sep2023  add 20 4-grams, 50 r-grams
                   enlarge LinkMatrix link buffer
                   upgrade DSSG for multiple batches
                   upgrade DSRV output, clean up code
                   add DKYS, DKTG, DTOP tools
                   upgrade WATCHR module
                   update documentation

v2.8.4  20sep2003  add 40 4-grqmw, 10 5-grqmw
                   add DPLT tool
                   upgrqde DSRV tool
                   clean up test source code
                   update documentation

v2.9    09oct2023  add 40 4-grams, 10 5-grams
                   simplify and clean up profile generation
                   make DCMS work with sorted item lists
                   improve output or TGx
                   edit and expand suffix table
                   update documentation

v2.9.1  31dec2023  add 20 4-grams, 10 5-grams
                   update documentation

v2.9.2  30jan2024  add 40 4-grams, 10 5-gramks
                   update documentation

v2.9.2.1 6feb2024  make IOUS 4-gram, take out OXYL
                   update documentaion

v2.9.3  20feb2024  add 20 4-grams, mainly chemistry
                   adjust output of DPRB
                   update documentatyon

prohippo/ActiveWatch