/panacea

Parsing unstructured text data

Primary LanguageC++GNU General Public License v3.0GPL-3.0

Panacea

GNU General Public License GitHub GitHub Workflow Status (with branch) Codecov

Demo

demo

Package name: Portable ANalytical data Aggregation and Coordination for database Entry and Access (panacea)

This package is currently a concept. See also the ISC proposal 1.

Panacea is aimed at making data streams in e.g. analytical laboratory settings more transparent and easily accessible. This is needed as closed sourced vendor-supplied software for analytical instruments often act as black boxes, thereby inhibiting access to raw data and full-disclosure of critical processing and transformations.

Panacea would constitute at it most fundamental level a parser for text based data in poorly structured formats (e.g. non-tabular). It detects variable, value, and, optionally, units triplets.

Ultimately, panacea could help establish fully integrated laboratories with centralised data management. Hence this solution contributes to the FAIR [2] guiding principles for data, thereby stimulating innovation, and inclusiveness through open science.

Configure and build panacea

After copying the source files, start the build process of panacea as follows:

./configure

JSON for Modern C++ is needed to enable parsing of results into the JSON format. This can be achieved by cloning the nlohmann/json repository and performing a complete build. Alternatively one can just copy the header file json.hpp to your desired location and do the following during the build process:

./configure CXXFLAGS="-I/path/to/nlohmann/json.hpp"

Note, that the nlohmann/json.hpp is not a breaking build requirement for panacea. Results can alternatively be streamed to the terminal.

Installation

To install panacea use the install target of the makefile, like so:

make install

To customise the installation query the GNU make manual [3].

Basic usage

Currenlty, panacea only has four control options, wich can be modified with the following flags:

  • --data: for the input text file (e.g. test.txt).
  • --output: for the output JSON file (e.g. test.json)
  • --white: the relative white level (a value in between 0 and 1) to detect tables (defaults to 0.7)
  • --verbose: whether input filename and output results should be streamed to the terminal (defaults to 1)

Example

The source files contain an example text file (extdata/test.txt) to demonstrate the core functionality of panacea.

panacea --data "extdata/test.txt" --output. "extdata/test.json"

Input text file:

Output json file:
Parsed as a table for convenience

Variable Value Unit Text field Line number Character number
foo a,b,c,d 2 5,6,7,8,9 1,1,1,1
bar 5,6,7,8 2 5,6,7,8,9 7,7,7,7
baz x,y,z,z 2 5,6,7,8,9 13,13,13,13
qux 1,2,3,4 s 2 5,6,7,8,9 19,19,19,19
quz x,z,, 2 5,6,7,8,9 29,29,32,32
x 42e-3,42e-3 3 12,13,14 1,1
y 4.3e-02,4.3e-02 3 12,13,14 9,9
z 4.4e-01,4.4e-01 3 12,13,14 19,19
numeric 42 4 24 1
numeric 42 um 5 25 1
foo 42 6 26 1
bar -41 7 26 11
baz 40 8 26 23
foo 42 9 27 1
bar -41 10 27 11
baz 40 11 27 22
foo 42 12 28 1
bar -41 13 28 11
baz 40 14 28 23
x -12761 um 15 29 23
y -13469 um 16 29 59
z 3709 um 17 29 73
x 1 um 18 30 21
y 2 um 19 30 50
z 3 um 20 30 59
x 4 um 21 30 89
x -12761 um 22 31 42
y -13469 um 23 31 97
z 3709 um 24 31 111
numeric 42 25 32 14
LoremIpsum 42 numeric 26 33 1
LoremIpsum 42 numeric 27 34 1
LoremIpsum 42 numeric 28 35 1

References

[2]: Findable, Accessible, Interoperable, and Reusable
[3]: https://www.gnu.org/software/make/manual