/csv

csv parser, optimized for performance

Primary LanguageMakefileBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

CSV-file parser
***************

Copyright (C) 2012, Dmitry Kolesnikov

   This file is free documentation; unlimited permisions are give to copy, 
distribute and modify the documentation. 


   This library is free software; you can redistribute it and/or modify
it under the terms of the the 3-clause BSD License (the "License");
as published by http://www.opensource.org/licenses/BSD-3-Clause.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                                                   !!!
!!!                            WARNING                                !!!
!!!                 The library is not supported.                     !!!
!!!        Use CSV feature of https://github.com/fogfish/feta         !!!
!!!                                                                   !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
      
      
Introduction
============

    The simple CSV-file parser based on event model. The parser generates an
    event/callback when the CSV line is parsed. The parser supports both
    sequential and parallel parsing. The major goal is an performance of 
    intake procedure with an parsing target of 3 - 4 micro seconds pr line on
    the reference hardware.

                           Acc
                       +--------+
                       |        |
                       V        |
                  +---------+   |
    ----Input---->| Parser  |--------> AccN
          +       +---------+
         Acc0          |
                       V
                   Event Line 

    The parser takes as input binary stream, event handler function and
    initial state/accumulator. Event function is evaluated agains current 
    accumulator and parsed line of csv-file. Note: The accumaltor allows to
    carry-on application specific state throught event functions.

Compile and build
=================

   The library source code is available at git repository

   git clone https://github.com/fogfish/pts.git
   
   Briefly, the shell command `./configure; make; make install' should 
configure, build, and assembly distribution package. The following
instructions are specific to this package; see the `INSTALL' file for
instructions specific to GNU build tools. 

   The `configure' shell script attempts to guess dependencies and system 
configuration required to build library, the following build time dependencies exists:

   --with-erlang={prefix_to_otp} supplied to `./configure' binds the library
                                 with chosen Erlang runtime, if you have
                                 multiple Erlang environments available at 
                                 build machine 
                                 
   High performance version of library shall be build with native targets
   
   make BUILD=native
   
Interface
=========   
   Briefly, the sequence of operations for data parse/intake is following; 
   see the src/csv.erl file for detailed interface specification and/or 
   example parser at priv/csv_example.erl
   
   %% define an event funtion that takes two arguments line value and
   %% accumulator. The function shall return a new accumulator state.
   %% The structure of accumulator is an application specific, that might
   %% vary from integer to comprex record.
   Fun = fun({line, L}, #my_record{count = C} = Acc0) ->
      do_my_intake_to_somewhere(lists:reverse(L)),
      Acc0#my_record{count = C + 1}
   end
   
   %%
   %% A sequential parse, parses whole data stream in client process
   csv:parse(CSV, Fun, #myrecord{})
   
   %%
   %% a parallel parse splits the CSV into multiple chunks;
   %% spawns multiple processes (process per chunk)
   %% results agregated in the client process.
   csv:parse(CSV, 20, Fun, #myrecord{})
   
Performance
===========

   Reference platform: 
     * MacMini, Lion Server, 
     * 1x Intel Core i7 (2 GHz), 4x cores
     * L2 Cache 256KB per core
     * L3 Cache 6MB
     * Memory 4GB 1333 MHZ DDR3
     * Disk 750GB 7200rpm WDC WD7500BTKT-40MD3T0
     * erlang R15B + native build of the library   
   
   The data set is has following patterns: key, date, time, float numbers and 
   zz suffix
     * key{1..300 000},2012-03-25,23:26:15.543,166.280,...,zz
   
   The numbers of keys is 300.000, and number of float fields varies from 8, 
   24 and 40 in reference data. Reference data set is generated by command
   
   make example or perl priv/gen_set.pl 300 40 > priv/set-300K-40.txt
   
   
   version 0.0.1
   
   E/Parse         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722     350.000         1.16
   300K, 24 flds     50.42      489.303     697.739         2.33
   300K, 40 flds     77.43      780.296     946.003         3.15
   
   
   ET/hash         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722     384.598         1.28
   300K, 24 flds     50.42      489.303     761.414         2.54
   300K, 40 flds     77.43      780.296    1047.329         3.49
   
   
   ET/tuple         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722     228.306         0.76
   300K, 24 flds     50.42      489.303     601.025         2.00
   300K, 40 flds     77.43      780.296     984.676         3.28
   
   ETL/ets          Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722    1489.543         4.50
   300K, 24 flds     50.42      489.303    2249.689         7.50
   300K, 40 flds     77.43      780.296    2519.401         8.39
   
   ETL/pts          Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
   -------------------------------------------------------------------
   300K,  8 flds     23.41       91.722     592.886         1.98
   300K, 24 flds     50.42      489.303    1190.745         3.97
   300K, 40 flds     77.43      780.296    1734.898         5.78