/carefree-data

A data processing module implemented with numpy

Primary LanguagePythonMIT LicenseMIT

carefree-data

carefree-data implemented a data processing module with numpy.

Update 2021.02.04

carefree-data now uses datatable as backend, which significantly improves the performances on file inputs!

Why carefree-data?

carefree-data is a data processing module which is capable of handling 'dirty' and 'messy' datasets.

For tabular datasets, carefree-data is able to:
  • Elegantly deal with data pre-processing.
    • A Recognizer to recognize whether a column is STRING, NUMERICAL or CATEGORICAL.
    • A Converter to convert a column into friendly format (["one", "two"] -> [0, 1]).
    • A Processor to further process columns (OneHot, Normalize, MinMax, ...).
    • And all the transforms could be inverse! (See tests\unittests\test_tabular.py -> test_recover_labels & test_recover_features).
    • And these procedures are all completed AUTOMATICALLY!
  • Handle datasets saved in files (.txt, .csv).
    • For .txt, " " will be the default delimiter.
    • For .csv, "," will be the default delimiter, and the first row will be skipped as default.
    • delimiter, label index, skip first could be set manually.

Pandas-free

There is one more thing we'd like to mention: carefree-data is 'Pandas-free'. Pandas is an open source library providing easy-to-use data structures on structured datasets. Although it is a widely used library in almost every famous Machine Learning and Deep Learning module, we finally decided to escape from it, and the reasons are listed below:

  • carefree-data wants to have full control on the data, and Pandas is not flexible enough.
  • carefree-data needs higher performances. Pandas is fast, but not as fast as pure numpy (and sometimes cython) codes on some critical code paths.
  • Pandas provides many powerful functions, but carefree-data doesn't need that much, which means Pandas is a little 'heavy' for carefree-data.

In short, Pandas is a more general library, and that's why we've written some codes to cover our needs instead of directly utilizing it.

Currently carefree-data only supports tabular datasets.

Installation

carefree-data requires Python 3.8 or higher.

pip install carefree-data

or

git clone https://github.com/carefree0910/carefree-data.git
cd carefree-data
pip install -e .

Basic Usages

Get scikit-learn datasets

from cfdata.tabular import TabularDataset

iris = TabularDataset.iris()

Read from array / dataset

from cfdata.tabular import *

iris = TabularDataset.iris()
x, y = iris.xy
assert TabularData().read(x, y) == TabularData.from_dataset(iris)

Read from file

from cfdata.tabular import TabularData

file = "/path/to/your/file"
data = TabularData().read(file)
assert data.processed == data.transform(file)

License

carefree-data is MIT licensed, as found in the LICENSE file.