/metanome-cli

Run Metanome algorithms from the command line

Primary LanguageJavaApache License 2.0Apache-2.0

Metanome CLI

The main purpose of this project is to provide a command-line interface for Metanome to allow for easy automation of data profiling tasks, e.g., to conduct experiments or to profile datasets batchwise. Besides that, this project integrates Metanome with

  • Metacrate, a storage and analytics tool for data profiles, and
  • ProfileDB, a tiny tool to collect and store experimental data.

Furthermore, HDFS is supported as input source.

Installation

Either obtain a Metanome CLI release or build it yourself. Conceretely, the project can be built with Maven:

.../metanome-cli$ mvn package -Pdistro

This command creates a "fatjar" (target/metanome-cli-0.1-SNAPSHOT.jar or similar) that contains Metanome and the Metanome CLI along with all their dependencies (except for Metanome algorithms, though).

Note that this project might depend on unstable snapshot versions of Metanome, Metacrate, and ProfileDB. In case of build errors related to these projects, you might need to clone, build, and install (i.e., mvn install) them yourself. Then, re-run the build with

.../metanome-cli$ mvn package -Pdistro --offline

Usage

Once you have obtained above described fatjar, you can simply put it on the Java classpath along with your algorithm jar files and run them as a normal Java application. As an example, assume you have an algorithm jar file called my-algorithm.jar with the main algorithm class com.example.MyAlgorithm. Then you can execute it via

$ java -cp metanome-cli.jar:my-algorithm.jar de.metanome.cli.App --algorithm com.example.MyAlgorithm <parameters...>

To learn about the various parameters of the Metanome CLI, you can also execute it without any parameters (including --algorithm) and get an output like the following:

Usage: <main class> [options]
  Options:
  * -a, --algorithm
       name of the Metanome algorithm class
    --algorithm-config
       algorithm configuration parameters (<name>:<value>)
       Default: []
    --db-connection
       a PGPASS file that specifies the database connection; if given, the
       inputs are treated as database tables
    --db-type
       the type of database as it would appear in a JDBC URL
    --escape
       escapes special characters
       Default:
    --header
       first row is a header
       Default: false
    --ignore-leading-spaces
       ignore leading white spaces in each field
       Default: false
  * --file-key, --input-key, --table-key
       configuration key for the input files/tables
  * --files, --inputs, --tables
       input file/tables to be analyzed and/or files list input files/tables
       (prefixed with 'load:')
       Default: []
    --null
       representation of NULLs
       Default: <empty string>
    -o, --output
       how to output results (none/print/file[:run-ID]/crate:file:scope)
       Default: file
    --profiledb
       location of a ProfileDB to store a ProfileDB experiment at
    --profiledb-conf
       additional configuration to store with a ProfileDB experiment
       Default: []
    --profiledb-key
       experiment key to store a ProfileDB experiment
    --profiledb-tags
       tags to store with a ProfileDB experiment
       Default: []
    --quote
       delimits fields in the input file
       Default: "
    --separator
       separates fields in the input file
       Default: ;
    --skip
       numbers of lines to skip
       Default: 0
    --skip-differing-lines
       skip lines with incorrect number of fields
       Default: false
    --strict-quotes
       enforce strict quotes
       Default: false

In case of problems, feel free to file an issue.