Overview

This project implements a number of Java utilities which might be useful in Machine Learning NLP tasks.

Namely it allows for:

WARNING: for the time being, this project must not considered production ready due the lack of adequate automatic testing.

Pipeline example

lay out the corpus into the file system
- have a look to this example from another related project
NLP preprocess
- class com.ml_text_utils.shell.PreProcessCorpusShell
- parameters --corpusFolderRoot <corpus_root> --preprocessedCorpusFolderRoot <corpus_root>_preprocessed --iso6391Language it
TfIdf export to LibSVM
1. Build Lucene Index
  - class com.ml_text_utils.shell.BuildCorpusWordsStatisticsShell
  - parameters --corpusFolderRoot <corpus_root>_preprocessed --luceneIndexFolder <corpus_root>_preprocessed_lucene
2. Export "Terms Dictionary" from Lucene Index
  - class com.ml_text_utils.shell.JSONExportTermsDictionaryFromLuceneShell
  - parameters --luceneIndexFolder <corpus_root>_preprocessed_lucene\ --termsDictionaryOutputJSONFile <corpus_root>_preprocessed_lucene_terms.json --maxTerms 10000
    - --maxTerms is optional
3. Compute TfIdf and export to LibSVM
  - class com.ml_text_utils.shell.FormatCorpusAsLibSVMShell
  - parameters --corpusFolderRoot <corpus_root>_preprocessed\ --libSVMExportFilePrefix <corpus_name> --libSVMExportFolder <libsvm_files_output_folder> --luceneIndexFolder <corpus_root>_preprocessed_lucene\ --termsDictionaryJSONFile <corpus_root>_preprocessed_lucene_terms.json

Yuo basically just have to implement for your language the classes located in the package com.ml_text_utils.nlp.impl.italian.

class com.ml_text_utils.shell.ExportCorpusToGoogleAutoMLCSVShell
parameters --corpusFolderRoot <corpus_root> --googleAutoMlCsvFile <corpus>.csv --googleCloudStorageFolderUri gs://<your bucket>/<your folder path if any>