/table-boxformat

Parsing the tabular data format generated by database SELECTs

Primary LanguagePerlOtherNOASSERTION

NAME

Table::BoxFormat - Parsing the tabular data format generated by database SELECTs

VERSION

Version 0.02

SYNOPSIS

use Table::BoxFormat;
# Reading input from a "dbox" temp file
my $dbx = Table::BoxFormat->new( input_file => '/tmp/select_result.dbox' );
my $data = $self->data;  # array of arrays, header in first row

# Input dbox from a string
my $dbx = Table::BoxFormat->new( input_data => $dboxes_string );
my $data = $self->data;  # array of arrays, header in first row

# input from dbox file, output directly to a tsv file
my $dbx = Table::BoxFormat->new();
$dbx->output_to_tsv( '/tmp/select_result.dbox', '/tmp/select_result.tsv' );

# input dbox from a string, output directly to a tsv file
$dbx = Table::BoxFormat->new( input_data => $dbox_string );
$dbx->output_to_tsv( $output_tsv_file );

DESCRIPTION

Table::BoxFormat is a module to work with data in the tabular text format(s) commonly used in database client shells (postgresql's "psql", mysql's "mysql", or sqlite's "sqlite3"), where a SELECT will typical display data in a form such as this (mysql):

+-----+------------+---------------+-------------+
| id  | date       | type          | amount      |
+-----+------------+---------------+-------------+
|  11 | 2010-09-01 | factory       |   146035.00 |
|  15 | 2011-01-01 | factory       |   191239.00 |
|  16 | 2010-09-01 | marketing     |   467087.00 |
|  17 | 2010-10-01 | marketing     |   409430.00 |
+-----+------------+---------------+-------------+

Or this (postgresql's "ascii" form):

 id |    date    |   type    | amount
----+------------+-----------+--------
  1 | 2010-09-01 | factory   | 146035
  4 | 2011-01-01 | factory   | 191239
  6 | 2010-09-01 | marketing | 467087
  7 | 2010-10-01 | marketing | 409430

These formats are human-readable, but not suitable for other purposes such as feeding to a graphics program, or inserting into another database table.

This code presumes these text tables of "data boxes" are either stored in a string or saved to a file. When stored in a file, I suggest using the extension ".dbox".

This code works with at least three different formats: mysql, psql and unicode psql.

implementation notes

The main method here is read_dbox, which works by first looking for a horizontal ruler line near the top of the data, for example:

+-----+------------+---------------+-------------+
----+------------+-----------+--------
────┼────────────┼───────────┼────────

These ruler lines are used to identify the boundary columns, after which the header and data lines are treated as fixed-width fields. Leading and trailing whitespace are stripped from each value.

An earlier (now deprecated) method named read_simple takes an opposite approach, ignoring the horizontal rules entirely and doing regular expression matches looking for data delimiters on each line. In comparison, the read_dbox should run faster and be able to handle strings with delimiter characters embedded in them.

METHODS

  • new

    Creates a new Table::BoxFormat object.

    Takes a list of attribute/setting pairs as an argument.

    • input_encoding

      Default's to "UTF-8". Change to suit text encoding (e.g. "ISO-8859-1"). Must work as a perl ":encoding(...)" layer.

    • output_encoding

      Like input_encoding. Default: "UTF-8".

    • input_file

      File to input data from. Can be supplied later, e.g. when read_dbox is called. Only required if input_data was not defined directly. (( TODO change this: make it required ? ))

    • input_data

      SQL SELECT output in the fixed-width-plus-delimiter form discussed above.

    • the parsing regular expressions (type: RegexpRef)

      • separator_rule

        The column separators (vertical bar)

      • ruler_line_rule

        Matches the Horizontal ruler lines (typically just under the header line)

      • cross_rule

        Match cross marks the horizontal bars typically use to mark column boundaries.

      • left_edge_rule

        Left border delimiters (we strip these before processing).

      • right_edge_rule

        Right border delimiters (we strip these before processing).

  • slurp_input_data

    Example usage:

      $self->slurp_input_data( $input_file_name );
    
  • read_dbox

    Given data in tabular boxes from a multiline string, convert it into an array of arrays.

      my $data =
            $bxs->read_dbox();
    

    Converts the boxdata from the object's input_data into an array of arrays, with the field names included in the first row.

    As a side-effect, copies the header (first row of returned data) in the object's header, and puts some format metadata in the object's meta.

  • analyze_ruler

    Internal method that analyzes the given ruler line and location to determine column widths and the dbox format.

    Returns an ordered list like so:

      format:
        'mysql', 'postgres', 'postgres_unicode', 'sqlite'
    
      header location:
        a row number: 0 or 1
    
      first_data:
        the row number where data begins:  2 or 3
    
      positions:
        a list of column boundary positions
    

    Example usage:

      ( $format, $header_loc, $first_data, @pos ) = $self->analyze_ruler( $line, $i );
    
  • read_simple

    This is DEPRECATED. See read_dbox.

    Given data in tabular boxes from a multiline string, convert it into an array of arrays.

      my $data =
            $bxs->read_simple();
    

    Goes through the boxdata slurped into the object field input_data, returns it as an array of arrays, including the field names in the first row.

    As a side-effect, stores the header (first row of boxdata) in the object's header.

  • output_to_tsv

    A convenience method that runs read_dbox and writes the data to a tsv file specified by the given argument.

    Returns a reference to the data (array of arrays).

    Example usage:

      $dbx->output_to_tsv( $input_dbox_file, $output_tsv_file );
    

    Or:

      $dbx = Table::BoxFormat->new( input_file => $input_dbox_file );
      $dbx->output_to_tsv( $output_tsv_file );
    

    Or:

      $dbx = Table::BoxFormat->new( input_data => $dbox_string );
      $dbx->output_to_tsv( $output_tsv_file );
    
  • output_to_csv

    A convenience method that runs read_dbox and writes the data to a csv file specified by the given argument.

    Example usage:

      $dbx->output_to_csv( $input_dbox_file, $output_csv_file );
    

    Or:

      $dbx = Table::BoxFormat->new( input_file => $input_dbox_file );
      $dbx->output_to_csv( $output_csv_file );
    

    Or:

      $dbx = Table::BoxFormat->new( input_data => $dbox_string );
      $dbx->output_to_csv( $output_csv_file );
    

SEE ALSO

See Graphics::Skullplot, which uses emacs lisp code to pick out a box format table from a field of text, e.g. a database shell window.

AUTHOR

Joseph Brenner, doom@kzsu.stanford.edu, 05 Jun 2016

LIMITATIONS

memory limited

As implemented, this presumes the entire data set can be held in memory. Future versions may be more stream-oriented: there's no technical reason this couldn't be done.

what you get is what you get

This code is only guaranteed to cover input formats from mysql, psql and some from sqlite3. It may work with other databases, but hasn't been tested.

At present it is not easily extensible (implementing a plugin system ala DBI/DBD seemed like overkill).

sqlite3

This code does not support the default output from sqlite3, it requires the use of these settings:

.header on
.mode column

While sqlite3 is very flexible, unfortunately the default output does not seem very useable:

SELECT * from expensoids;
|2010-09-01|factory|146035.0
|2010-11-01|factory|218866.0
|2011-01-01|factory|191239.0
|2010-10-01|marketing|409430.0

This is separated by the traditional ascii vertical bar, but without the usual bracketing spaces, and without any attempt at using fixed width columns. Somewhat oddly, the left edge has a vertical bar, but the right edge does not, but even worse there's no header with column labels.

If you use the sqlite settings indicated above, you get a more conventional tabular text format:

id          date        type        amount
----------  ----------  ----------  ----------
1           2010-09-01  factory     146035.0
2           2010-10-01  factory     208816.0
3           2010-11-01  factory     218866.0

That's very similar to the psql format using "\pset border 0" (though that has one space column breaks instead of two): both are supported by read_dbox using the analyze_ruler routine.

COPYRIGHT AND LICENSE

Copyright (C) 2016 by Joseph Brenner

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.