A machine learning tool for the analysis of large document collections

MIT LicenseMIT


A project funded by: FEDER/Ministerio de Ciencia, Innovación y Universidades – Agencia Estatal de Investigación/ _Proyecto TEC2017-83838-R.

Machine learning tools for the analysis of large document collections. A brief description of each of them is given, even though each has it's own documentation.

The available repositories, divided by tasks are:

Spark implementation of parallel algorithms:

  • DistributedCWLM: distributed, interpretable regression algorithm in Spark.
  • SVM_spark: distributed, non-linear, semiparametric SVM implementation in Spark.
  • pysparkMVA: distributed implementation of Deep Learning for Margin Valuation Adjustment with regularization.

Topic Models tools:

  • TMtweets: topic modeling pipeline. Designed to work with tweets, but can be extended to other documents.
  • labelFactory: a generic application for labelling a subset of sites in a web site collection, or a subset of docs in a text collection.
  • one_def_classification: dataless text classification using definitions as labeled data.
  • WeakLabelModel: a library for training multiclass classifiers with weak labels.

Integration of modules:

  • PTL_data: ETL & lemmatization tool for PTL.
  • PdbManager: Managing mySQL databases (replaces older DB_expl).
  • dbManager: Python class for managing mySQL or sqlite databases.
  • menuNavigator: a generic template application to generate command-line menus.

Feature Selection and Feature Extraction:

  • regMVA: Deep Learning for Margin Valuation Adjustment with regularization.
  • SSHIBA: generalized Bayesian approach to feature extraction for heterogeneous data.
  • KSSHIBA: kernelized observation SSHIBA.

Graph synthesis, processing and analysis.

  • supergraph: a software package for the synthesis, analysis and processing of large graphs.


To clone the project with all their submodules:

git clone --recurse-submodules https://github.com/ML4DS/ML4BIHECOL

If you have already done

git clone https://github.com/ML4DS/ML4BIHECOL

you can do

git submodule update --init --recursive

Alternatively, you can use specific submodules only. The best way to work with a specific submodule is to clone the specific github project associated to the submodule

git clone https://github.com/URL/of/the/project