/dptools

Python package with utilities for data processing, aggregation, feature engineering and data versioning

Primary LanguagePythonMIT LicenseMIT

dptools: data preprocessing functions for Python


PyPI Latest Release Python 3.7 Project Status: Active – The project has reached a stable, usable state and is being actively developed. Licence Build Status Downloads


Overview

The dptools Python package provides helper functions to simplify common data processing tasks in a data science pipeline, including feature engineering, data aggregation, working with missing values and more.

The package currently encompasses the following functions:

  • Feature engineering:
    • add_date_features(): create date and time-based features
    • add_text_features(): create text-based features (including counts and TF-IDF)
    • aggregate_data(): aggregate data and create features based on aggregated statistics
    • encode_factors(): perform label or dummy encoding of categorical features
  • Data processing:
    • split_nested_features(): split features nested in a single column
    • fill_missings(): replace missings with specific values
    • correct_colnames(): correct column names to be unique and remove foreign symbols
    • print_missings(): print information on features with missing values
    • print_factor_levels(): print levels of categorical features
  • Data cleaning:
    • find_correlated_features(): identify features with a high pairwise correlation
    • find_constant_features(): identify features with a single unique value
  • Import and versioning:
    • read_csv_with_json(): read CSV where some columns are in JSON format
    • save_csv_version(): save CSV with an automatically assigned version to prevent overwriting

Installation

The latest stable release of dptools can be installed from PyPI:

pip install dptools

You may also install the development version from Github:

pip install git+https://github.com/kozodoi/dptools.git

After the installation, you can import the included functions:

from dptools import *

Examples

This section contains a few examples of using functions from dptools for different data preprocessing tasks. Please refer to the docstring documentation in the implemented functions for further examples.

Creating a toy data set

First, let us create a toy data frame to demonstrate the package functionality.

# import dependencies
import pandas as pd
import numpy as np

# create data frame
data = {'age': [27, np.nan, 30, 25, np.nan],
        'height': [170, 168, 173, 177, 165],
        'gender': ['female', 'male', np.nan, 'male', 'female'],
        'income': ['high', 'medium', 'low', 'low', 'no income']}
df = pd.DataFrame(data)
age height gender income
27.0 170 female high
NaN 168 male medium
30.0 173 NaN low
25.0 177 male low
NaN 165 female no income

Aggregating features

# aggregating the data
from dptools import aggregate_data
df_new = aggregate_data(df, group_var = 'gender', num_stats = ['mean', 'max'], fac_stats = 'mode')   
gender age_mean age_max height_mean height_max income_mode
female 27.0 27.0 167.5 170 'high'
male 25.0 25.0 172.5 177 'low'

Creating text-based features

# creating text-based features
from dptools import add_text_features
df_new = add_text_features(df, text_vars = 'income')
age height gender income_word_count income_char_count income_tfidf_0 ... income_tfidf_3
27.0 170 female 1 4 1.0 ... 0.0
NaN 168 male 1 6 0.0 ... 1.0
30.0 173 NaN 1 3 0.0 ... 0.0
25.0 177 male 1 3 0.0 ... 0.0
NaN 165 female 2 9 0.0 ... 0.0

Working with missings

# print statistics on missing values
from dptools import print_missings
print_missings(df)
Total Percent
age 2 0.4
gender 1 0.2

Finding correlated features

# displays one correlated feature from each pair
from dptools import find_correlated_features
feats = find_correlated_features(df, cutoff = 0.4, method = 'spearman')
feats

Found 1 correlated features.

['age']

Data versioning

# first call saves df as 'data_v1.csv'
from dptools import save_csv_version
save_csv_version('data.csv', df, index = False)

# second call saves df as 'data_v2.csv' as data_v1.csv already exists
save_csv_version('data.csv', df, index = False)

Dependencies

Installation requires Python 3.7+ and the following packages:

Feedback

In case you need help on the included data preprocessing functions or you want to report an issue, please do so at the corresponding GitHub page.