ResidentMario/missingno

Improve freq argument in matrix's freq argument

wangsen992 opened this issue · 7 comments

This is probably a very simple bug to fix. If nobody tackles it I will probably do it some time when I'm free and give a pull request. The error is that the freq argument from msno.matrix(), when activated in missingno.py (the code is shown below), it initiates a date_range starting from the beginning of the day. When it does not get a index value from df.index.get_loc(value), the KeyError is catched and the operation is halted.

The issue is many times this timeseries data might not begin or end on a full day cycle, aka 00:00 am. So maybe simply cut off with the range of input df will solve the problem.

if freq:
        ts_list = []

        if type(df.index) == pd.PeriodIndex:
            ts_array = pd.date_range(df.index.to_timestamp().date[0],
                                     df.index.to_timestamp().date[-1],
                                     freq=freq).values

            ts_ticks = pd.date_range(df.index.to_timestamp().date[0],
                                     df.index.to_timestamp().date[-1],
                                     freq=freq).map(lambda t:
                                                    t.strftime('%Y-%m-%d'))

        elif type(df.index) == pd.DatetimeIndex:
            ts_array = pd.date_range(df.index.date[0], df.index.date[-1],
                                     freq=freq).values

            ts_ticks = pd.date_range(df.index.date[0], df.index.date[-1],
                                     freq=freq).map(lambda t:
                                                    t.strftime('%Y-%m-%d'))
        else:
            raise KeyError('Dataframe index must be PeriodIndex or DatetimeIndex.')
        try:
            for value in ts_array:
                ts_list.append(df.index.get_loc(value))
        except KeyError:
            raise KeyError('Could not divide time index into desired frequency.')

PS: Hopefully the format of the issue is clear. This is my first time to raise issue so any suggestion on modifying this issue would be welcomed.

And great work with this project!

Actually I think by simply putting the try-except clause inside the for loop might just work.

for value in ts_array:
    try:
        ts_list.append(df.index.get_loc(value))
    except KeyError:
        logging.warning('Could not divide time index into desired frequency.')

Something like that without breaking the for-loop.

If you go ahead and submit a PR I'm happy to take a look at that. :)

What is the status of this issue? I just installed the package and obviously this bug is still NOT fixed?

This bug probably still exists. I didn't look at freq the last time I did an OSS maintenance day, I'll try to look at it the next time I have time.

heyej commented

If someone has this problem and cannot cut off their timeseries (gaps between days), another solution could be to reindex time series with a complete range of dates (hh:mm:ss as necessary) and fill the value gaps with NaN.

Try removing the .values from the code.

This is probably a very simple bug to fix. If nobody tackles it I will probably do it some time when I'm free and give a pull request. The error is that the freq argument from msno.matrix(), when activated in missingno.py (the code is shown below), it initiates a date_range starting from the beginning of the day. When it does not get a index value from df.index.get_loc(value), the KeyError is catched and the operation is halted.

The issue is many times this timeseries data might not begin or end on a full day cycle, aka 00:00 am. So maybe simply cut off with the range of input df will solve the problem.

if freq:
        ts_list = []

        if type(df.index) == pd.PeriodIndex:
            ts_array = pd.date_range(df.index.to_timestamp().date[0],
                                     df.index.to_timestamp().date[-1],
                                     freq=freq).values

            ts_ticks = pd.date_range(df.index.to_timestamp().date[0],
                                     df.index.to_timestamp().date[-1],
                                     freq=freq).map(lambda t:
                                                    t.strftime('%Y-%m-%d'))

        elif type(df.index) == pd.DatetimeIndex:
            ts_array = pd.date_range(df.index.date[0], df.index.date[-1],
                                     freq=freq).values

            ts_ticks = pd.date_range(df.index.date[0], df.index.date[-1],
                                     freq=freq).map(lambda t:
                                                    t.strftime('%Y-%m-%d'))
        else:
            raise KeyError('Dataframe index must be PeriodIndex or DatetimeIndex.')
        try:
            for value in ts_array:
                ts_list.append(df.index.get_loc(value))
        except KeyError:
            raise KeyError('Could not divide time index into desired frequency.')

PS: Hopefully the format of the issue is clear. This is my first time to raise issue so any suggestion on modifying this issue would be welcomed.

And great work with this project!

Try removing the .values from the code.

Hi,

My index on the dataframe has the value in "yyyy-mm-dd hh:mi:ss" format and each row is at 15 min interval. Can you tell me how to use the frequency parameter on the matrix plot?