NAME
Table::BoxFormat - Parsing the tabular data format generated by database SELECTs
VERSION
Version 0.02
SYNOPSIS
use Table::BoxFormat;
# Reading input from a "dbox" temp file
my $dbx = Table::BoxFormat->new( input_file => '/tmp/select_result.dbox' );
my $data = $self->data; # array of arrays, header in first row
# Input dbox from a string
my $dbx = Table::BoxFormat->new( input_data => $dboxes_string );
my $data = $self->data; # array of arrays, header in first row
# input from dbox file, output directly to a tsv file
my $dbx = Table::BoxFormat->new();
$dbx->output_to_tsv( '/tmp/select_result.dbox', '/tmp/select_result.tsv' );
# input dbox from a string, output directly to a tsv file
$dbx = Table::BoxFormat->new( input_data => $dbox_string );
$dbx->output_to_tsv( $output_tsv_file );
DESCRIPTION
Table::BoxFormat is a module to work with data in the tabular text format(s) commonly used in database client shells (postgresql's "psql", mysql's "mysql", or sqlite's "sqlite3"), where a SELECT will typical display data in a form such as this (mysql):
+-----+------------+---------------+-------------+
| id | date | type | amount |
+-----+------------+---------------+-------------+
| 11 | 2010-09-01 | factory | 146035.00 |
| 15 | 2011-01-01 | factory | 191239.00 |
| 16 | 2010-09-01 | marketing | 467087.00 |
| 17 | 2010-10-01 | marketing | 409430.00 |
+-----+------------+---------------+-------------+
Or this (postgresql's "ascii" form):
id | date | type | amount
----+------------+-----------+--------
1 | 2010-09-01 | factory | 146035
4 | 2011-01-01 | factory | 191239
6 | 2010-09-01 | marketing | 467087
7 | 2010-10-01 | marketing | 409430
These formats are human-readable, but not suitable for other purposes such as feeding to a graphics program, or inserting into another database table.
This code presumes these text tables of "data boxes" are either stored in a string or saved to a file. When stored in a file, I suggest using the extension ".dbox".
This code works with at least three different formats: mysql, psql and unicode psql.
implementation notes
The main method here is read_dbox, which works by first looking for a horizontal ruler line near the top of the data, for example:
+-----+------------+---------------+-------------+
----+------------+-----------+--------
────┼────────────┼───────────┼────────
These ruler lines are used to identify the boundary columns, after which the header and data lines are treated as fixed-width fields. Leading and trailing whitespace are stripped from each value.
An earlier (now deprecated) method named read_simple takes an opposite approach, ignoring the horizontal rules entirely and doing regular expression matches looking for data delimiters on each line. In comparison, the read_dbox should run faster and be able to handle strings with delimiter characters embedded in them.
METHODS
-
new
Creates a new Table::BoxFormat object.
Takes a list of attribute/setting pairs as an argument.
-
input_encoding
Default's to "UTF-8". Change to suit text encoding (e.g. "ISO-8859-1"). Must work as a perl ":encoding(...)" layer.
-
output_encoding
Like input_encoding. Default: "UTF-8".
-
input_file
File to input data from. Can be supplied later, e.g. when read_dbox is called. Only required if input_data was not defined directly. (( TODO change this: make it required ? ))
-
input_data
SQL SELECT output in the fixed-width-plus-delimiter form discussed above.
-
the parsing regular expressions (type: RegexpRef)
-
separator_rule
The column separators (vertical bar)
-
ruler_line_rule
Matches the Horizontal ruler lines (typically just under the header line)
-
cross_rule
Match cross marks the horizontal bars typically use to mark column boundaries.
-
left_edge_rule
Left border delimiters (we strip these before processing).
-
right_edge_rule
Right border delimiters (we strip these before processing).
-
-
-
slurp_input_data
Example usage:
$self->slurp_input_data( $input_file_name );
-
read_dbox
Given data in tabular boxes from a multiline string, convert it into an array of arrays.
my $data = $bxs->read_dbox();
Converts the boxdata from the object's input_data into an array of arrays, with the field names included in the first row.
As a side-effect, copies the header (first row of returned data) in the object's header, and puts some format metadata in the object's meta.
-
analyze_ruler
Internal method that analyzes the given ruler line and location to determine column widths and the dbox format.
Returns an ordered list like so:
format: 'mysql', 'postgres', 'postgres_unicode', 'sqlite' header location: a row number: 0 or 1 first_data: the row number where data begins: 2 or 3 positions: a list of column boundary positions
Example usage:
( $format, $header_loc, $first_data, @pos ) = $self->analyze_ruler( $line, $i );
-
read_simple
This is DEPRECATED. See read_dbox.
Given data in tabular boxes from a multiline string, convert it into an array of arrays.
my $data = $bxs->read_simple();
Goes through the boxdata slurped into the object field input_data, returns it as an array of arrays, including the field names in the first row.
As a side-effect, stores the header (first row of boxdata) in the object's header.
-
output_to_tsv
A convenience method that runs read_dbox and writes the data to a tsv file specified by the given argument.
Returns a reference to the data (array of arrays).
Example usage:
$dbx->output_to_tsv( $input_dbox_file, $output_tsv_file );
Or:
$dbx = Table::BoxFormat->new( input_file => $input_dbox_file ); $dbx->output_to_tsv( $output_tsv_file );
Or:
$dbx = Table::BoxFormat->new( input_data => $dbox_string ); $dbx->output_to_tsv( $output_tsv_file );
-
output_to_csv
A convenience method that runs read_dbox and writes the data to a csv file specified by the given argument.
Example usage:
$dbx->output_to_csv( $input_dbox_file, $output_csv_file );
Or:
$dbx = Table::BoxFormat->new( input_file => $input_dbox_file ); $dbx->output_to_csv( $output_csv_file );
Or:
$dbx = Table::BoxFormat->new( input_data => $dbox_string ); $dbx->output_to_csv( $output_csv_file );
SEE ALSO
See Graphics::Skullplot, which uses emacs lisp code to pick out a box format table from a field of text, e.g. a database shell window.
AUTHOR
Joseph Brenner, doom@kzsu.stanford.edu, 05 Jun 2016
LIMITATIONS
memory limited
As implemented, this presumes the entire data set can be held in memory. Future versions may be more stream-oriented: there's no technical reason this couldn't be done.
what you get is what you get
This code is only guaranteed to cover input formats from mysql, psql and some from sqlite3. It may work with other databases, but hasn't been tested.
At present it is not easily extensible (implementing a plugin system ala DBI/DBD seemed like overkill).
sqlite3
This code does not support the default output from sqlite3, it requires the use of these settings:
.header on
.mode column
While sqlite3 is very flexible, unfortunately the default output does not seem very useable:
SELECT * from expensoids;
|2010-09-01|factory|146035.0
|2010-11-01|factory|218866.0
|2011-01-01|factory|191239.0
|2010-10-01|marketing|409430.0
This is separated by the traditional ascii vertical bar, but without the usual bracketing spaces, and without any attempt at using fixed width columns. Somewhat oddly, the left edge has a vertical bar, but the right edge does not, but even worse there's no header with column labels.
If you use the sqlite settings indicated above, you get a more conventional tabular text format:
id date type amount
---------- ---------- ---------- ----------
1 2010-09-01 factory 146035.0
2 2010-10-01 factory 208816.0
3 2010-11-01 factory 218866.0
That's very similar to the psql format using "\pset border 0" (though that has one space column breaks instead of two): both are supported by read_dbox using the analyze_ruler routine.
COPYRIGHT AND LICENSE
Copyright (C) 2016 by Joseph Brenner
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.