AI-team-UoA/pyJedAI

Normalization of NaN is not working as intended

mrckzgl opened this issue · 2 comments

The data class tries to normalize na / nan values into empty strings.
This is done here:

self.dataset_1 = self.dataset_1.astype(str)
self.dataset_1.fillna("", inplace=True)
if not self.is_dirty_er:
self.dataset_2 = self.dataset_2.astype(str)
self.dataset_2.fillna("", inplace=True)

but it does not work as intended.
When casting the DataFrame to str, all nan values will be replaced with the string "nan" and fillna does nothing anymore.
see:

>>> pandas.DataFrame.isnull(pandas.DataFrame([numpy.nan]).astype(float))
      0
0  True
>>> pandas.DataFrame.isnull(pandas.DataFrame([numpy.nan]).astype(str))
       0
0  False
>>> 

Though, I do not know the best way to handle the intended conversion. One way could be to just change the order, first do fillna and later cast to string. But I don't know what happens if fillna('', inplace=True) is thrown against dtypes incompatible with / other than a string.

best

I also wonder if it is necessary and good practice to convert the dataframe to string, as then there is no distinction between na and empty string anymore ...

Hello, and I'm sorry for the late reply.

Yeah you're right on your remarks. Indeed NaN handling has no effect this way. So changing rows I think will do the trick.

        # Fill NaN values with empty string
        self.dataset_1.fillna("", inplace=True)
        self.dataset_1 = self.dataset_1.astype(str)
        if not self.is_dirty_er:
            self.dataset_2.fillna("", inplace=True)
            self.dataset_2 = self.dataset_2.astype(str)

As far as the str transformation, it is necessary in order to assure that no other types will be handled. It caused issues in many other steps, and that's why we decided to handle it this way.

The above fix will be uploaded in the next release.

Thank you that you shared it with us!

Konstantinos