What is everything were a/could be turned into a table?
Entab is a parsing framework to turn a variety of record-based scientific file formats into usable tabular data across a variety of programming languages.
Entab supports reading a variety of bioinformatics, chemoinformatics, and other formats.
- Agilent Chemstation CH, FID, MS, MWD, and UV formats
- Agilent Masshunter DAD format1
- FASTA and FASTQ sequence formats
- FCS flow cytometry format
- Inficon Hapsite mass specotrometry format
- PNG image format
- SAM and BAM alignment formats
- Thermo continuous flow isotope mass spectrometry formats
- Thermo RAW files
- CSV & TSV files
Entab has a CLI that allows piping in arbitrary files and outputs TSVs. Install with:
cargo install entab-cli
Example usage to see how many records are in a file:
cat test.fa | entab | sed '1d' | wc -l
There are bindings for two languages, Python and JavaScript, that support reading data streams and converting them into a series of records.
The Javascript library can be installed with:
npm install entab
The Python library can be installed with:
pip install entab
The R bindings can be installed from inside R with (note you will need Cargo and a Rust buildchain locally):
library(devtools)
devtools::install_github("bovee/entab", subdir="entab-r")
-
Handling many formats: Support as many record-based, streamable scientific formats as possible. Formats like HDF5 with complex headers and already existing, well-supported parsers are not considered a priority though.
-
Correctness: Formats should be parsed with good error messages, consistant failure states, and well-tested code.
-
Language bindings: Support using Entab from a decent selection of the programming languages currently used for science, data science, and related fields. Currently supporting Python, Javascript, and experimentally R with possible support for Julia in the future.
-
Speed: Entab should be as fast as possible while still prioritizing the above issues. Parsers are split into two forms: a fast one that produces a specialized struct and a slow one that produces a generic record and is capable of being switched to at run time.
Footnotes
-
This format uses multiple files so it's not supported in streaming mode or in e.g. the JS bindings. ↩