Open source library for basic data science tasks.
implements in part:
- sklearn API compatible transformers that act on pandas DataFrames.
- a Visualizer object with some nice visualization functions and api.
- helpers in writing clean and understandable data pipelines.
the dsbasic.frame.preprocessing.impute module implements the fImputer transformer used to impute dataframe at selected columns.
fImputer(strategy='mean', copy=True, na_sentinel=-1, columns=None)
- columns - list of columns names to impute
- strategy - a string from 'mean', 'median', 'most_frequent', 'na_sentinel' where each one specifies which method of imputation is to be used.
- copy - whether the returned frame should be a copy or not.
- na_sentinel - if strategy = 'na_sentinel' fills all columns Na's with the na_sentinel variable.
fImputer accepts both pandas.DataFrame and pandas.Series objects.
Example
from dsbasic.frame.preprocessing.impute import fImputer
from sklearn.pipeline import make_pipeline
numeric = ['n1', 'n2', 'n3']
categorical = ['c1', 'c2', 'c3']
imputer = make_pipeline(
fImputer(strategy='median', columns = numeric, copy=True),
fImputer(strategy='most_frequent', columns = categorical, copy=True)
)
X = pandas.read_csv(...)
Y = pandas.read_csv(...)
X_imputed = imputer.fit_transform(X)
Y_imputed = imputer.transform(Y)
the dsbasic.frame.preprocessing.categorical module implements useful transformers to deal with categorical features. specifically the fOrdinalEncoder, fOneHotEncoder, fLabelEncoder
fLabelEncoder(dtype=np.uint8, nan_handle='soft' )
assigns a natural number to each unique label of the pandas series.
- dtype - dtype of ordinal oncoded columns.
- nan_handle - nan_handle is one of ['soft', 'hard', 'ignore']
soft - nans will be encoded in transform only if nans are present during fit. hard - nans are assigned a label in transform even if not present during fit. ignore - ignores nan's all-together.
note : if nan_handle is set to 'ignore' dtype argument is ignored and is set to float32
fLabelEncoder accepts only a pandas.Series object. to encode several columns see fOrdinalEncoder
Example :
from dsbasic.frame.preprocessing.categorical import fLabelEncoder
from sklearn.pipeline import make_pipeline
labels = pandas.Series(['a', 'b', 'a', 'c', numpy.nan, 'a'])
y1 = fLabelEncoder(nan_handle='ignore').fit_transform(labels)
y2 = fLabelEncoder(nan_handle='soft').fit_transform(labels)
y3 = fLabelEncoder(nan_handle='hard').fit_transform(labels)
print('y1\n{}\n\n{}\n\n{}'.format(y1, y2, y3))
output :
0 0.0
1 1.0
2 0.0
3 2.0
4 NaN
5 0.0
dtype: float32
0 0
1 1
2 0
3 2
4 3
5 0
dtype: uint8
0 0
1 1
2 0
3 2
4 3
5 0
dtype: uint8
fOrdinalEncoder(dtype=np.uint8, nan_handle='soft', columns=None, copy=True)
Label encodes each column in "columns" using fLabelEncoder
- dtype - dtype of ordinal oncoded columns.
- nan_handle - nan_handle is one of ['soft', 'hard', 'ignore']
soft - nans will be encoded in transform only if nans are present during fit. hard - nans are assigned a label in transform even if not present during fit. ignore - ignores nan's all-together.
- columns - list of strings describing the columns to be encoded.
- copy - whether the returned frame should be a copy or not.
fOneHotEncoder(sep='_', dummy_na=False, columns=None)
One hot encodes selected columns of a dataframe and discards the original columns (pandas get_dummies style).
- sep - new one hot encoded column names are set to be column_name + sep + label_name
- dummy_na - whether to one hot encode Na's.
- columns - list of strings describing the columns to be encoded.