shawnbrown/datatest

NaT issue

Belightar opened this issue · 5 comments

Greetings, @shawnbrown

to be short,

my pd.Series is like:
Date
0 NaT
1 NaT
2 NaT
3 2010-12-31
4 2010-12-31
Name: Date, dtype: datetime64[ns]
the type of NaT is:
<class 'pandas._libs.tslibs.nattype.NaTType'>
when I use the following code:

with accepted(Extra(pd.NaT)):
validate(data, requirement)

I found that it the NaTs can not be recognized. I tried many types of Extra and tried using function but all faild.

here I need your help. Thanks for your work.

Hello--thanks for filing this issue. I'd like to replicate your problem as accurately as I can before I start addressing the issue.

I have some sample code below but I'm not sure what you're using as the requirement:

from datetime import datetime
import pandas as pd
from datatest import validate

data = pd.Series([
    None,
    None,
    None,
    datetime(2010, 12, 31),
    datetime(2010, 12, 31),
])

requirement = ???  # <- What is this?
validate(data, requirement)

Can you tell me what your requirement value is?

Thanks for you reply.

from datetime import datetime, timedelta
import pandas as pd
from datatest import validate, accepted, Extra

data = pd.Series([
    None,
    None,
    None,
    datetime(2010, 12, 31),
    datetime(2010, 12, 31),
])

Today = datetime.today()
Tomorrow = Today + timedelta(days=1)

def date_requirement(var_datetime):
    return pd.Timestamp(year=2000, month=1, day=1) < var_datetime < \
            pd.Timestamp(year=Tomorrow.year, month=Tomorrow.month, day=Tomorrow.day)

with accepted(Extra(pd.NaT)):
    validate(data, date_requirement)

Here I want to accept the NaT type data. I tried pd.NaT, np.datetime64('NaT'), or NanToken method mentioned in the document and the results are the same:

datatest.ValidationError: does not satisfy date_requirement() (3 differences): [
    Invalid(numpy.datetime64('NaT')),
    Invalid(numpy.datetime64('NaT')),
    Invalid(numpy.datetime64('NaT')),
]

Ah, OK. As a stopgap, you can use the accepted.args() method together with the pd.isna() function:

...

with accepted.args(pd.isna):
    validate(data, date_requirement)

The accepted.args() method accepts differences whose args satisfy a given predicate. And by using pd.isna() as the predicate, you can accept differences that contain NaT, NaN, or other "missing value" objects.

For a longer term solution, I want to bring the handling of these NaT values inline with how datatest handles other NaN values (as documented here). I will follow up when I have addressed this issue more thoroughly.

Thank you so much.
Your code works well in my project.
And yes, I also used pd.isna to judge whether it is pd.NaT or not. (Is this the only way?) I simply droped those rows then do the datatest.
I've used python and programed for 3 years and haven't realized there're differences among bool, np.bool_ or pd.NaT, pd.Nan, np.nan, nan before.
I've learnt alot from your work, and thanks for your patience again.

I'm glad you found it helpful. I noticed that your date_requirement() function is checking for an interval. If it suits your needs, you could also use the validate.interval() method:

...

begin_date = pd.Timestamp(year=2000, month=1, day=1)
tomorrow = pd.Timestamp(datetime.today() + timedelta(days=1))

with accepted.args(pd.isna):
    validate.interval(data, begin_date, tomorrow)

One difference with this approach is that time differences trigger Deviation objects that contain a timedelta. There are some how-to documents for date handling that you mignt find helpful as well: