/categorical_features_euclidean_distance

A python package to compute pairwise Euclidean distances on datasets with categorical features in little time

Primary LanguagePythonMIT LicenseMIT

Categorical Features Pairwise Euclidean Distances

forthebadge made-with-python

PyPi Version MIT License

A python package to compute pairwise Euclidean distances on datasets with categorical features in little time

Motivation

In machine learning model development I often ran into datasets with categorical features. Most times dealing with these categorical features was fairly straight forward (I would use the pandas get_dummies() function to convert each feature into a one-hot-encoded representaion).

But when the number of categories embedded in these categorical features became massive, I ran into the problem of extremely slow Euclidean distance computation between each sample and every other sample.

This is where this package comes in. Running my own tests, I concluded that this code runs significantly faster than the SKLearn pairwise Euclidean distances function on one-hot-encoded categorical features.

Prerequisites

See requirements.txt for the full list of prerequisite libraries.

Installation

To start using this package, simply run this command in terminal

pip install cfed

Usage

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1 = pd.DataFrame.from_dict({
    'col1': [1, 2, 3],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2 = pd.DataFrame.from_dict({
    'col1': [1, 4, 7],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

distances = euclidean_distances(df1, df2, categorical_columns=['col1'])

Or without specifying categorical_columns

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1 = pd.DataFrame.from_dict({
    'col1': ['c1', 'c2', 'c1'],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2 = pd.DataFrame.from_dict({
    'col1': ['c1', 'c3', 'c2'],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

distances = euclidean_distances(df1, df2)

Or

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1_numerical = pd.DataFrame.from_dict({
    'col1': [1, 2, 3],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2_numerical = pd.DataFrame.from_dict({
    'col1': [1, 4, 7],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

df1_categorical = pd.DataFrame.from_dict({
    'col4': ['c1', 'c1', 'c2'],
})
df2_categorical = pd.DataFrame.from_dict({
    'col4': ['c1', 'c2', 'c2'],
})

distances = euclidean_distances_from_split(df1_numerical, df1_categorical, df2_numerical, df2_categorical)