CSV-file parser *************** Copyright (C) 2012, Dmitry Kolesnikov This file is free documentation; unlimited permisions are give to copy, distribute and modify the documentation. This library is free software; you can redistribute it and/or modify it under the terms of the the 3-clause BSD License (the "License"); as published by http://www.opensource.org/licenses/BSD-3-Clause. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!! !!! !!! WARNING !!! !!! The library is not supported. !!! !!! Use CSV feature of https://github.com/fogfish/feta !!! !!! !!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Introduction ============ The simple CSV-file parser based on event model. The parser generates an event/callback when the CSV line is parsed. The parser supports both sequential and parallel parsing. The major goal is an performance of intake procedure with an parsing target of 3 - 4 micro seconds pr line on the reference hardware. Acc +--------+ | | V | +---------+ | ----Input---->| Parser |--------> AccN + +---------+ Acc0 | V Event Line The parser takes as input binary stream, event handler function and initial state/accumulator. Event function is evaluated agains current accumulator and parsed line of csv-file. Note: The accumaltor allows to carry-on application specific state throught event functions. Compile and build ================= The library source code is available at git repository git clone https://github.com/fogfish/pts.git Briefly, the shell command `./configure; make; make install' should configure, build, and assembly distribution package. The following instructions are specific to this package; see the `INSTALL' file for instructions specific to GNU build tools. The `configure' shell script attempts to guess dependencies and system configuration required to build library, the following build time dependencies exists: --with-erlang={prefix_to_otp} supplied to `./configure' binds the library with chosen Erlang runtime, if you have multiple Erlang environments available at build machine High performance version of library shall be build with native targets make BUILD=native Interface ========= Briefly, the sequence of operations for data parse/intake is following; see the src/csv.erl file for detailed interface specification and/or example parser at priv/csv_example.erl %% define an event funtion that takes two arguments line value and %% accumulator. The function shall return a new accumulator state. %% The structure of accumulator is an application specific, that might %% vary from integer to comprex record. Fun = fun({line, L}, #my_record{count = C} = Acc0) -> do_my_intake_to_somewhere(lists:reverse(L)), Acc0#my_record{count = C + 1} end %% %% A sequential parse, parses whole data stream in client process csv:parse(CSV, Fun, #myrecord{}) %% %% a parallel parse splits the CSV into multiple chunks; %% spawns multiple processes (process per chunk) %% results agregated in the client process. csv:parse(CSV, 20, Fun, #myrecord{}) Performance =========== Reference platform: * MacMini, Lion Server, * 1x Intel Core i7 (2 GHz), 4x cores * L2 Cache 256KB per core * L3 Cache 6MB * Memory 4GB 1333 MHZ DDR3 * Disk 750GB 7200rpm WDC WD7500BTKT-40MD3T0 * erlang R15B + native build of the library The data set is has following patterns: key, date, time, float numbers and zz suffix * key{1..300 000},2012-03-25,23:26:15.543,166.280,...,zz The numbers of keys is 300.000, and number of float fields varies from 8, 24 and 40 in reference data. Reference data set is generated by command make example or perl priv/gen_set.pl 300 40 > priv/set-300K-40.txt version 0.0.1 E/Parse Size (MB) Read (ms) Handle (ms) Per Line (us) ------------------------------------------------------------------- 300K, 8 flds 23.41 91.722 350.000 1.16 300K, 24 flds 50.42 489.303 697.739 2.33 300K, 40 flds 77.43 780.296 946.003 3.15 ET/hash Size (MB) Read (ms) Handle (ms) Per Line (us) ------------------------------------------------------------------- 300K, 8 flds 23.41 91.722 384.598 1.28 300K, 24 flds 50.42 489.303 761.414 2.54 300K, 40 flds 77.43 780.296 1047.329 3.49 ET/tuple Size (MB) Read (ms) Handle (ms) Per Line (us) ------------------------------------------------------------------- 300K, 8 flds 23.41 91.722 228.306 0.76 300K, 24 flds 50.42 489.303 601.025 2.00 300K, 40 flds 77.43 780.296 984.676 3.28 ETL/ets Size (MB) Read (ms) Handle (ms) Per Line (us) ------------------------------------------------------------------- 300K, 8 flds 23.41 91.722 1489.543 4.50 300K, 24 flds 50.42 489.303 2249.689 7.50 300K, 40 flds 77.43 780.296 2519.401 8.39 ETL/pts Size (MB) Read (ms) Handle (ms) Per Line (us) ------------------------------------------------------------------- 300K, 8 flds 23.41 91.722 592.886 1.98 300K, 24 flds 50.42 489.303 1190.745 3.97 300K, 40 flds 77.43 780.296 1734.898 5.78