awslabs/datawig

about application on categorical and numerical data

phoeller opened this issue · 8 comments

Hi there, I am trying to run the example to apply datawig on both categorical and numerical data. The categorical data has integral values while numerical data is a positive real number. I read the documentation, it seems that datawig takes multiple columns as input and impute on a specific column instead of imputing on all missing values across all columns, am I correct? I have a dataset with 4 columns, A, B, C, and Y. Y is the conclusion (label) while A, B, C are preditors. all columns contain missing values. Here is what I am trying to do with datawig if I understand it correctly

  1. take A, B, C as input for imputation onto Y, to get Y_imputed, then replace Y with Y_imputed
  2. take B, C, Y as input for imputation on to A, to get A_imputed, then replace A with A_imputed
  3. take A, C, Y as input for imputation on to B, to get B_imputed, then replace B with B_imputed
  4. take A, B, Y as input for imputation on to C, to get C_imputed, then replace C with C_imputed

It seems quite tedious if there are 1000 columns, how do I manage to run all those iterations.

My second question is about handling the categorical data. In the section of quick example, it tells how to handle the text but what happens if the categorical data is a number (integer) in a given range while some data are real numbers, how could I specify the data type? I am trying the following example


df = pd.DataFrame([[np.nan, 2,     np.nan, 0],
                   [0,      3,     np.nan, 1],
                   [np.nan, np.nan,  np.nan, 1],
                   [1,      4,     3,      0],
                   [3,      1,     0,      np.nan],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [1,      3,     0,      np.nan],
                   [np.nan, np.nan,  0,     np.nan],
                   ],
                  columns = list('ABCY'))

df_train, df_test = datawig.utils.random_split(df)
categorial_encoder_cols = [CategoricalEncoder('A')]
label_encoder_cols = 'Y'
print(df)
imputer = datawig.SimpleImputer(
    label_encoders=label_encoder_cols,
    data_encoders=categorial_encoder_cols,
    output_path = 'imputer_model' # stores model data and metrics
    )
dout = imputer.fit(train_df=df_train)

but it turns out with an error "TypeError: init() got an unexpected keyword argument 'label_encoders'"

you could try just calling

datawig.SimpleImputer.complete(you_DF)

and it will use SimpleImputer with defaults to impute all columns from all respective other columns.

In your case I would treat categorical variables as strings, that's the default in SimpleImputer and it usually yields ok results.

The problem in your code was that SimpleImputer doesn't accept encoders. If you want to explicitly define the encoders for each input column, you could use the Imputer class instead.

But often the defaults set in SimpleImputer work well, so I'd try them first.

you could try just calling

datawig.SimpleImputer.complete(you_DF)

and it will use SimpleImputer with defaults to impute all columns from all respective other columns.

In your case I would treat categorical variables as strings, that's the default in SimpleImputer and it usually yields ok results.

The problem in your code was that SimpleImputer doesn't accept encoders. If you want to explicitly define the encoders for each input column, you could use the Imputer class instead.

But often the defaults set in SimpleImputer work well, so I'd try them first.

Thanks. I tried the SimpleImputer, it does impute all missing data but it takes all columns numerical values, however, column A in my example should be an integer (categorical). I am still looking for help to implement an imputer that could specify the type of each column. Here is the code I am trying to use for the data, in which column A is categorical and B, C are numerical, the outcome Y is also categorical

`df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[0, 3, np.nan, 1],
[np.nan, np.nan, np.nan, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[1, 4, 3, 0],
[3, 1, 0, 0],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, 1],
[np.nan, np.nan, 0, 0],
],
columns = list('ABCY'))

df_train, df_test = datawig.utils.random_split(df)

data_encoder_cols = [CategoricalEncoder('A'), NumericalEncoder('B'), NumericalEncoder('C')]
label_encoder_cols = [CategoricalEncoder('Y')]
data_featurizer_cols = [EmbeddingFeaturizer('A'), NumericalFeaturizer('B'), NumericalFeaturizer('C')]

imputer = datawig.Imputer(
label_encoders=label_encoder_cols,
data_encoders=data_encoder_cols,
data_featurizers=data_featurizer_cols,
output_path = 'imputer_model' # stores model data and metrics
)
imputer.fit(train_df=df_train)
dfi = imputer.predict(df_test)

print(dfi)`

This code gives the result of
A B C Y Y_imputed Y_imputed_proba
51 0.0 3.0 3.0 1 1 0.549999
49 3.0 1.0 0.0 0 1 0.691801
33 1.0 4.0 3.0 0 1 0.507750
62 1.0 3.0 0.0 1 0 0.500124
54 3.0 1.0 0.0 0 1 0.691801
11 0.0 3.0 3.0 1 1 0.549999
16 0.0 3.0 3.0 1 1 0.549999
36 0.0 3.0 3.0 1 1 0.549999
40 2.0 2.0 1.0 1 1 0.573608
0 NaN 2.0 NaN 0 1 0.536813
8 1.0 4.0 3.0 0 1 0.507750
29 3.0 1.0 0.0 0 1 0.691801
28 1.0 4.0 3.0 0 1 0.507750
64 3.0 1.0 0.0 0 1 0.691801
15 2.0 2.0 1.0 1 1 0.573608

It only imputes column Y (but Y has no missing value) and I still see the NaN values in columns A, B, C. How can I set it up such that it will impute NaN values on the input columns only. thanks.

If you provide a data frame with numerical values they will be treated as numerical values.

If you want to force a column to be treated as non-numerical you could cast them to strings like

df[col] = df[col].astype(str)

If you provide a data frame with numerical values they will be treated as numerical values.

If you want to force a column to be treated as non-numerical you could cast them to strings like

df[col] = df[col].astype(str)

Thanks. Now I see what you mean. I change all the columns in concern of category into str and use the CategoricalEncoder and token_to_idx to map the (valid and missing) values, now it works. The only problem is I have to loop each of the columns that contain missing values for imputation and apply the Imputer on them one by one. But anyway, the code is working well so far. Thank you for your help anyway.

... or you simply do the above numerical-to-string cast before calling any datawig functions and then just call

datawig.SimpleImputer.complete(df)

which will loop through the columns and select the right encoders in one line of python.

Thanks for the feedback, we'll try to improve the documentation.

Closing this for now, feel free to reopen.

... or you simply do the above numerical-to-string cast before calling any datawig functions and then just call

datawig.SimpleImputer.complete(df)

which will loop through the columns and select the right encoders in one line of python.

Thanks for the feedback, we'll try to improve the documentation.

Closing this for now, feel free to reopen.

I did try that at the very beginning. But it failed to take care of the nan ... I don't know why so I manually taking care of the NAN by token_to_idx :(

hm, that's strange - I believe we did have that use case in our tests, where it worked

hm, that's strange - I believe we did have that use case in our tests, where it worked

oh ... let me try again, here is my code

df = pd.DataFrame([[np.nan, 2,     np.nan, 0],
                   [0,      3,     np.nan, 1],
                   [np.nan, np.nan,  np.nan, 1],
                   [1,      4,     3,      0],
                   [3,      1,     0,      0],
                   [2,      2,     1,      1],
                   [0,      3,     3,      1],
                   [1,      3,     0,      1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [1, 4, 3, 0],
                   [3, 1, 0, 0],
                   [2, 2, 1, 1],
                   [0, 3, 3, 1],
                   [1, 3, 0, 1],
                   [np.nan, np.nan,  0,    0],
                   ],
                  columns = list('ABCY'))

df['A'] = df['A'].astype('Int64') # if I don't convert that to interger first, it takes the number as float
df['A'] = df['A'].astype('str') # now convert that column (being imputed to) into text
print(df.dtypes)

df = datawig.SimpleImputer.complete(df)
print(df)

Here is what i get

____A B C Y
0 nan 2.000000 1.144811 0.0
1 0 3.000000 2.832242 1.0
2 nan 2.348050 1.049826 1.0
3 1 4.000000 3.000000 0.0
4 3 1.000000 0.000000 0.0
5 2 2.000000 1.000000 1.0
6 0 3.000000 3.000000 1.0
7 1 3.000000 0.000000 1.0
8 1 4.000000 3.000000 0.0
9 3 1.000000 0.000000 0.0
10 2 2.000000 1.000000 1.0
11 0 3.000000 3.000000 1.0
12 1 3.000000 0.000000 1.0
13 1 4.000000 3.000000 0.0
14 3 1.000000 0.000000 0.0
15 2 2.000000 1.000000 1.0
16 0 3.000000 3.000000 1.0
17 1 3.000000 0.000000 1.0
18 1 4.000000 3.000000 0.0
19 3 1.000000 0.000000 0.0
20 2 2.000000 1.000000 1.0
21 0 3.000000 3.000000 1.0
22 1 3.000000 0.000000 1.0
23 1 4.000000 3.000000 0.0
24 3 1.000000 0.000000 0.0
25 2 2.000000 1.000000 1.0
26 0 3.000000 3.000000 1.0
27 1 3.000000 0.000000 1.0
28 1 4.000000 3.000000 0.0
29 3 1.000000 0.000000 0.0
30 2 2.000000 1.000000 1.0
31 0 3.000000 3.000000 1.0
32 1 3.000000 0.000000 1.0
33 1 4.000000 3.000000 0.0
34 3 1.000000 0.000000 0.0
35 2 2.000000 1.000000 1.0
36 0 3.000000 3.000000 1.0
37 1 3.000000 0.000000 1.0
38 1 4.000000 3.000000 0.0
39 3 1.000000 0.000000 0.0
40 2 2.000000 1.000000 1.0
41 0 3.000000 3.000000 1.0
42 1 3.000000 0.000000 1.0
43 1 4.000000 3.000000 0.0
44 3 1.000000 0.000000 0.0
45 2 2.000000 1.000000 1.0
46 0 3.000000 3.000000 1.0
47 1 3.000000 0.000000 1.0
48 1 4.000000 3.000000 0.0
49 3 1.000000 0.000000 0.0
50 2 2.000000 1.000000 1.0
51 0 3.000000 3.000000 1.0
52 1 3.000000 0.000000 1.0
53 1 4.000000 3.000000 0.0
54 3 1.000000 0.000000 0.0
55 2 2.000000 1.000000 1.0
56 0 3.000000 3.000000 1.0
57 1 3.000000 0.000000 1.0
58 1 4.000000 3.000000 0.0
59 3 1.000000 0.000000 0.0
60 2 2.000000 1.000000 1.0
61 0 3.000000 3.000000 1.0
62 1 3.000000 0.000000 1.0
63 1 4.000000 3.000000 0.0
64 3 1.000000 0.000000 0.0
65 2 2.000000 1.000000 1.0
66 0 3.000000 3.000000 1.0
67 1 3.000000 0.000000 1.0
68 1 4.000000 3.000000 0.0
69 3 1.000000 0.000000 0.0
70 2 2.000000 1.000000 1.0
71 0 3.000000 3.000000 1.0
72 1 3.000000 0.000000 1.0
73 1 4.000000 3.000000 0.0
74 3 1.000000 0.000000 0.0
75 2 2.000000 1.000000 1.0
76 0 3.000000 3.000000 1.0
77 1 3.000000 0.000000 1.0
78 nan 1.073437 0.000000 0.0

The nan stays un-imputed as in the results