schuderer/mllaunchpad

Type guessing fails when connecting to (Oracle) datasource

Vegacyrez opened this issue · 5 comments

  • ML Launchpad version: 0.0.7
  • Model Type used: Python
  • DataSource type(s) used: Oracle
  • Python version: 3.6.10
  • Operating System: Windows 10

Description

When connecting to an oracle datasource, some columns that contain missing values (NULL) have their datatypes misinterpreted. When passing this data along to a project via pandas, datatypes for columns with missing values become "noneType". This issue does not appear when using identical data (including missing values) from a .csv file.

Steps to reproduce

mllaunchpad -c config_dev.yml -t

Resulted in:
TypeError: '<' not supported between instances of 'NoneType' and 'str'

(and a very large number of traceback lines)

The error occurred inside a RandomForestClassifier in a project (NBQ), which means that this issue should then be fixed either in the source data (replacing missing values) or in the modelmaker phase.

That's somewhat surprising. Ideally, the pandas dataframes provided by different DataSources, should be pretty much identical no matter the actual source (CSV, Oracle, others). As they use pandas functionality internally, this must have been taken for granted.

@Vegacyrez I don't have (write) access to an Oracle database right now -- could you please try out whether putting the following in your model's code fixes the dataframe succesfully?

df.fillna(np.nan, inplace=True)

(you might have to import numpy as np in your relevant model file)

If this works as a workaround, we have some idea what the problem was and can look how/where to apply a fix in OracleDataSource so that you don't need the workaround from the next release on.

That's somewhat surprising. Ideally, the pandas dataframes provided by different DataSources, should be pretty much identical no matter the actual source (CSV, Oracle, others). As they use pandas functionality internally, this must have been taken for granted.
@Vegacyrez I don't have (write) access to an Oracle database right now -- could you please try out whether putting the following in your model's code fixes the dataframe succesfully?
df.fillna(np.nan, inplace=True)
(you might have to import numpy as np in your relevant model file)
df.fillna(np.nan, inplace=True)
(you might have to import numpy as np in your relevant model file)
If this works as a workaround, we have some idea what the problem was and can look how/where to apply a fix in OracleDataSource so that you don't need the workaround from the next release on.

On testing the workaround, we received the following error:
X_train = data_train.drop(target_col, axis=1)
AttributeError: 'NoneType' object has no attribute 'drop'

It seems that the type of the object (data frame) has been changed by the workaround?

Thanks for testing. However, there seems to be something else going wrong as well. I don’t know where data_train comes from exactly, but it should not be None (which the error says it is).

Hmm, maybe you tried to assign a return value from the df.fillna(...) I suggested? This won‘t work when used in the inplace=True variant (see pandas documentation at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).

Try to use my suggestion without an assignment.

Received verification that the workaround does indeed solve the issue. IMO, the way to fix this would be to add a df.fillna(np.nan, inplace=True) step to the relevant (or all?) DataSources and/or make it a config option like coerce_numpy_nans that is True by default.

PR welcome (ideally including a regression test).

Some code I used when investigating the issue:

>>> import pandas as pd
>>> import pandas as pd
>>> df=pd.DataFrame({"a":[1,2,3,None,6], "b":["a", None, "c", "d", "e"]})
>>> df
     a     b
0  1.0     a
1  2.0  None
2  3.0     c
3  NaN     d
4  6.0     e
>>> df.fillna(pd.np.nan, inplace=True)
>>> df
     a    b
0  1.0    a
1  2.0  NaN
2  3.0    c
3  NaN    d
4  6.0    e

Minimum PR: add this line to the get_dataframe method of OracleDataSource (+ tests + doc).
Maximum PR: add to super get_dataframes (problematic) and make it an option with default True

In progress (picked up by @bobplatte )