/java-ml-text-utils

A collection of Java utilities useful in Machine Learning NLP tasks

Primary LanguageJavaApache License 2.0Apache-2.0

Overview

This project implements a number of Java utilities which might be useful in Machine Learning NLP tasks.

Namely it allows for:

  • laying out a training/test corpus into the file system
  • NLP preprocsssing the documents (tokenization, stemming, POS filtering...)
  • extracting features (e.g. TfIdf), and converting them to LibSVM file format

This project is featured in the essay "What is the best method for Automatic Text Classification?".

WARNING: for the time being, this project must not considered production ready due the lack of adequate automatic testing.

Pipeline example

  1. lay out the corpus into the file system
    • have a look to this example from another related project
  2. NLP preprocess
    • class com.ml_text_utils.shell.PreProcessCorpusShell
    • parameters --corpusFolderRoot <corpus_root> --preprocessedCorpusFolderRoot <corpus_root>_preprocessed --iso6391Language it
  3. TfIdf export to LibSVM
    1. Build Lucene Index
      • class com.ml_text_utils.shell.BuildCorpusWordsStatisticsShell
      • parameters --corpusFolderRoot <corpus_root>_preprocessed --luceneIndexFolder <corpus_root>_preprocessed_lucene
    2. Export "Terms Dictionary" from Lucene Index
      • class com.ml_text_utils.shell.JSONExportTermsDictionaryFromLuceneShell
      • parameters --luceneIndexFolder <corpus_root>_preprocessed_lucene\ --termsDictionaryOutputJSONFile <corpus_root>_preprocessed_lucene_terms.json --maxTerms 10000
        • --maxTerms is optional
    3. Compute TfIdf and export to LibSVM
      • class com.ml_text_utils.shell.FormatCorpusAsLibSVMShell
      • parameters --corpusFolderRoot <corpus_root>_preprocessed\ --libSVMExportFilePrefix <corpus_name> --libSVMExportFolder <libsvm_files_output_folder> --luceneIndexFolder <corpus_root>_preprocessed_lucene\ --termsDictionaryJSONFile <corpus_root>_preprocessed_lucene_terms.json

Customize NLP for language other than Italian

Yuo basically just have to implement for your language the classes located in the package com.ml_text_utils.nlp.impl.italian.

Export to Google Cloud AutoML CSV

  • class com.ml_text_utils.shell.ExportCorpusToGoogleAutoMLCSVShell
  • parameters --corpusFolderRoot <corpus_root> --googleAutoMlCsvFile <corpus>.csv --googleCloudStorageFolderUri gs://<your bucket>/<your folder path if any>