/DataSetTransformer

Converts any Data Format into any (other) Data Format

Primary LanguageJava

DataSetTransformer
------------------

DataSetTransformer is a tool that can read any data that can be organized in tables(rows and columns)
can apply transformations and safe the transformed data.


Contents
--------
1. Overview
2. License
3. Usage
4. Creating Job Files
5. Example Job File


1. Overview
   --------
   The Transformation splits up into 5 Steps.
    - Importing
    - Import Selection
    - Data Filtering
    - Export Style
    - Exporting

   The Importer reads the Data stream.
   The import Filter then extracts the Data from the Stream.
   This Data is then filtered.
   The filtered Data is then converted into the Export format.
   Finally the data in export format is written out.

   currently supported input formats:
    - CSV (character separated values)
    - HTML
    - SQL Database (MySql,...)

   currently supported data filters:
    - combine fields
    - apply regular expression
    - remove empty lines

   currently supported output formats:
    - CSV (character separated values)
    - HTML
    - SQL Database (MySql,...)


2. License
   -------
   Copyright (C) 2011  Lars Pötter <Lars_Poetter@gmx.de>
   All Rights Reserved.

   This program is free software; you can redistribute it and/or
   modify it under the terms of the GNU General Public License version 2
   as published by the Free Software Foundation.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License along
   with this program; if not, see <http://www.gnu.org/licenses/>


3. Usage
   -----
   Start the DataSetTarnsformer by doing "java -jar DataSetTransformer.jar jobfile.job".
   The Job File contains all the settings needed to do the transformation.


4. Creating Job Files
   ------------------
   Job Files are simple Text Files. Job Files contain Sections and Settings.
   A Section starts with a Line "[Setting Name]" and ends with the end of the File or the next Section.
   Settings are lines like "Setting Name = Value".

   A Job File must have the following sections:
    - "Importer"
    - "Import Selector"
    - "Exporter"
    - "Export Style"
   A Job File may also have the following sections:
    - "Data Filter", "Data Filter_0", "Data Filter_1", ... "Data Filter_99"
   These sections directly correspond to the 5 steps needed for the transformation. Each of these steps can be thought
   of as a separate module. Each section must have at least on setting. That setting is the "type"-setting. The type
   selects the kind of module used in this step. The different kind of modules may have different settings. Refer to
   the following sections to find which kinds of modules are available and which settings they need.

4.1. Importer
     --------

     Description
     -----------
     The Importer opens the data source and gives the Import Selector access to the raw data.
     There must be exactly one Importer in each Job.

     currently available Importers:

4.1.1. File Importer
       ------------
       type = FileImporter

       Description
       -----------
       Reads bytes from a file in the local file system.

       Parameters
       ----------
       FileName = path to file with input data.


4.1.2. URL Importer
       -----------
       type = UrlImporter

       Description
       -----------
       Reads bytes from a resource identified by a Unified Resource Locator (URL).
       This module creates a folder cache where copies of the imported resources are kept. The second import then uses
       the cached copy and will not request the resource again. If you want to request the resource again, then remove
       the cached copy from the cache folder.

       Parameters
       ----------
       SourceUrl = URL from that the bytes will be read.

4.1.4. URL List Importer
       -----------------
       type = UrlListImporter

       Description
       -----------
       Reads URL's from the specified file and import all of them using the URL Importer.

       Parameters
       ----------
       source = path to text file that contains one URL per line.

4.1.5. Paged URL Importer
       ------------------
       type = PagedUrlImporter

       Description
       -----------
       Uses URL Importer to import bytes from the specified URL. If the resulting Data that comes out of the import
       filter has found a data element named Next with a valid URL as value, then this URL is also imported. This
       repeats until no Next data element is found or the next element does not contain a valid URL.

       Parameters
       ----------
       SourceUrl = URL from that the bytes will be read.

4.1.6. JDBC Importer
       -------------
       type = JdbcImporter

       Description
       -----------
       Opens the Connection to a JDBC Database.

       Parameters
       ----------
       dbDriver = the class name of the database driver. (example: com.mysql.jdbc.Driver)
       dbUrl =  the URL needed to acces the database. (example: jdbc:mysql://localhost/javatestdb)
       dbUser = the username needed to access the database. May be empty or missing.
       dbPassword = the password needed to access the database. May be empty or missing.

4.2. Import Selector
     ---------------

     Description
     -----------
     The Import Selector reads the raw data and extracts the data sets.
     There must be exactly one Import Selector in each Job.

     currently available Import Selectors:

4.2.1. CSV Import Selector
       -------------------
       type = CsvImportSelector

       Description
       -----------
       Interprets the imported bytes as text containing data elements separated by the separator character.
       With one Data Set per line.

       Parameters
       ----------
       separator = character that splits the data elements.
       mapping_0 = name of the first data element in each row.
       mapping_1 = name of the second data element in each row.
       mapping_2 = name of the third data element in each row.
       mapping_. = name of the ... data element in each row.
       mapping_. = name of the ... data element in each row.
       mapping_. = name of the ... data element in each row.
       mapping_N = name of the (N+1)th data element in each row.


4.2.2. Tree Import Selector
       --------------------
       type = TreeImportSelector

       Description
       -----------
       Interprets the imported bytes as HTML File. A Location and a name need to be provided for each data element
       that shall be extracted. Locations are defined as follows:
       If a location starts with "@html@" then the HTML code contained in the defined tag will be the value of the
       data element. Without this the text defined by the tag will be the value of the data element.
       If only a value of a attribute in the tag is needed then this can be achieved with @attributeName. a Location
       ending with "a@href" will create a data element that contains the URL in the specified link.
       The Location presents the Path to the tag. In HTML this will always start with the tag "html". This is then
       followed by either "head" or "body". If a tag has more than one child tags of the same type an attribute that
       has to match can be supplied. "div(id:main)" matches only the div Tag that has the attribute "id" with the
       value "main".
       An alternative is the tag name "*". That matches all Tags also on several levels. The "*" can also be used with
       parts of the tag name, then is mean the tag name must contain the supplied part. This means "ca*" would match
       "cake" and "captain" but not "general".
       The tags in the location are separated by "/".
       Example: "html/head/link(type:text/css)@href" applied on a HTML data like this:

                <html>
                <head>
                <link href="styles/theme.css" rel="stylesheet" type="text/css">
                </head>
                <body>
                </body
                </html>

       will create a data element that contains "styles/theme.css".

       Parameters
       ----------
       Mapping_Name_0 = name of the first data element.
       Mapping_Name_1 = name of the second data element.
       Mapping_Name_N = name of the (N+1)th data element.
       Mapping_Position_0 = position of the first data element.
       Mapping_Position_1 = position of the second data element.
       Mapping_Position_N = position of the (N+1)th data element.

4.2.3. SQL Import Selector
       --------------------
       type = SqlImportSelector

       Description
       -----------
       Executes the SQL Query and creates data sets for each returned row.
       The name of the data element is the column name.

       Parameters
       ----------
       SqlQuery = the SQL Query that will be executed to get the Data. (example: SELECT * FROM testtable;)


4.3. Data Filter
     -----------

     Description
     -----------
     The data filter reads the data sets and applies the filter. The result are the changed/ filtered data sets.
     There can be zero to 100 data filters in each job. The data filters will be executed according their number
     starting with 0 and going up to 99. A data filter section without number will be executed before data filter 0.

     currently available Data Filters:

4.3.1. Concatenate Filter
       ------------------
       type = ConCatFilter

       Description
       -----------
       This filter will return only one Data Set. The filter checks that in all Data Sets that are provided to the
       filter the data elements specified in "FieldThatMustBeEqual" have always the same value. If that is the case
       then the values of all that Data elements specified in "FieldsThatWillBeConCatenated" will be concatenated with
       the defined separator.

       Parameters
       ----------
       ConCatSeparator = string that will be inserted between concatenated values.
       FieldThatMustBeEqual_0 = name of data element that shall be the same on all data sets.
       FieldThatMustBeEqual_1 = name of data element that shall be the same on all data sets.
       FieldThatMustBeEqual_N = name of data element that shall be the same on all data sets.
       FieldsThatWillBeConCatenated_0 = name of data element that shall be the same on all data sets.
       FieldsThatWillBeConCatenated_1 = name of data element that shall be the same on all data sets.
       FieldsThatWillBeConCatenated_N = name of data element that shall be the same on all data sets.

4.3.2. Regular Expression Filter
       -------------------------
       type = RegExpFilter

       Description
       -----------
       This filter only returns those data set from the provided data sets that contain the data element specified by
       "NameOfAtom" and where the value of that data element also matched the specified regular expression defined in
       "RegularExpression"

       Parameters
       ----------
       NameOfAtom = Name of the data element that will be processed.
       RegularExpression = the regular expression that must be matched.

4.3.3. Remove Empty Lines Filter
       -------------------------
       type = RemoveEmptyLinesFilter

       Description
       -----------
       Returns from the provided data set only those that have at least one data element.

       Parameters
       ----------
       no parameters available

4.3.4. Search and Replace Filter
       -------------------------
       type = ReplaceFilter

       Description
       -----------
       replaces all occurrences of the searched for string with the replacement in the specified data element off all
       data sets.

       Parameters
       ----------
       Element to change = Name of the data element to change
       Search for = string that will be removed if found
       Replace with = replacement for the found string

4.3.5. Add Index data element Filter
       -----------------------------
       type = AddIndexDataElementFilter

       Description
       -----------
       adds a data element to each data set that contains the index/ row number of the data set.
       Counting starts with 0.

       Parameters
       ----------
       Index element name = name of the added data element

4.4. Exporter
     --------
     currently available Exporters:

4.4.1. File Exporter
       -------------
       type = FileExporter

       Description
       -----------
       Writes the bytes produced by the Export Style module into that file specified in "target".

       Parameters
       ----------
       target = File Name of result file.

4.4.1. One File per Row Exporter
       -------------------------
       type = OneFilePerRowExporter

       Description
       -----------
       creates a file for each data set.
       The filenames for these files are created by prefix + <value of data element> + postfix.

       Parameters
       ----------
       filename prefix = the prefix
       filename postfix = the postfix
       filename data element = the name of the data element used to construct the filename

4.4.3. JDBC Exporter
       -------------
       type = JdbcExporter

       Description
       -----------
       Opens the Connection to a JDBC Database.

       Parameters
       ----------
       dbDriver = the class name of the database driver. (example: com.mysql.jdbc.Driver)
       dbUrl =  the URL needed to acces the database. (example: jdbc:mysql://localhost/javatestdb)
       dbUser = the username needed to access the database. May be empty or missing.
       dbPassword = the password needed to access the database. May be empty or missing.

4.5. Export Style
     ------------
     currently available Export Styles:

4.5.1. CSV Export Style
       ----------------
       type = CsvExportStyle

       Description
       -----------
       creates a CSV byte stream from the provided data sets. Each data set results in one row. The mapping defines
       the sequence of the date element of the data sets. The separator defines the used separator between the data
       elements.

       Parameters
       ----------
       separator = character used to separate the data atoms of the data set.
       mapping_0 = name of first data element.
       mapping_1 = name of second data element.
       mapping_N = name of (N+1)th data element.

4.5.2. HTML Export Style
       -----------------
       type = HtmlExportStyle

       Description
       -----------
       Creates a byte stream by interleaving the data from the data sets with the Global and Row Style definition.
       Example:
       GlobalStyleDefinition = <html><head></head><body>%ROW%</body></html>
       RowStyleDefinition = party: %n% had %c% cakes
       Data = [n = Birthday c = 5][n = Christmas c = 0]
       Resulting HTML = <html><head></head><body>party: Birthday had 5 cakes
                                                 party: Christmas had 0 cakes </body></html>

       Parameters
       ----------
       GlobalStyleDefinition = specifies the HTML tags before and after the data.
       RowStyleDefinition = specifies the HTML tags used for each data set row.

4.5.3. SQL Export Style
       ----------------
       type = SqlExportStyle

       Description
       -----------
       Inserts the Data into a SQL Database Table. It does a "" for each data set

       Parameters
       ----------
       tableName = Name of the SQL table that shall receive the data.
       fieldNames = Names of all fields/columns of the table that shall receive data elements.
       mapping = list of data elements that will be written into the table. "mapping_0" will be inserted into the
                 column specified by the first column name in the fieldNames parameter.


5. Example Job File
   ----------------

[Importer]
type = FileImporter
FileName = testData/small.csv

[Import Selector]
type = CsvImportSelector
separator = ,
mapping_0 = number
mapping_1 = german
mapping_2 = english

[Data Filter_0]
type = RemoveEmptyLinesFilter

[Data Filter_1]
type = ReplaceFilter
Element to change = english
Search for = pi
Replace with = pie

[Exporter]
type = FileExporter
target = exported.csv

[Export Style]
type = CsvExportStyle
separator = ,
mapping_0 = number
mapping_1 = english
mapping_2 = german