skrub-data/skrub

DatetimeEncored can add holiday/weekend binary features

Opened this issue · 6 comments

Problem Description

Add weekends/holiday features to the data

Feature Description

The DatetimeEncoder could enjoy a new add_weekends param that generates a binary dummy variable that is set to 1 when the day is a weekend day.

A holidays: str | None = None param could accept a world country supported by the holidays package and generate another binary feature col. It could technically be its own encoder (HolidayEncored) if you think the DateTimeEncoder is doing too much.

Alternative Solutions

No response

Additional Context

No response

Hey @baggiponte, thanks for the suggestion!

  1. The weekend binary variable would bring value and is straightforward to implement!
  2. The holidays package would add a new runtime dependency, and we prefer avoiding that.

Instead, we could improve one of our existing examples (and preferably increase its classification or regression score) using the holidays package as a documentation dependency. This way, we would demonstrate its usefulness while showcasing its usage with skrub.

Would you be interested in implementing one of those (or both)?

I would love to work on both!

About the improved example: I currently cannot open it from the website. If I click on the link at this page nothing happens. Is that because the docs are not in sync with the latest commits, am I right? EDIT it's #716

As a reference, I guess the source code for the example you mention is here.

EDIT2: I was thinking about creating a custom function transformer to generate holiday features following the example Olivier wrote here. What do you think Vincent?

  1. Yes indeed, the documentation is being fixed, but you can access the source code file you mentioned.

    This file is actually generated via nbconvert, we sync a notebook to a python file, so that changes in the notebook are reflected in the file. We then manually process it to make it "sphinx compatible", i.e. we replace #% with ####### for cell breaks.

    Therefore, I suggest you reproduce example 03 in your own notebook, make the appropriate changes, and put these changes into the source file, trying to respect the sphinx syntax. I'll help you debug it if needed.

  2. Seems like a good idea to use a light FunctionTransformer, go ahead!

Hello! We talked about holidays as features during this week's meeting, and I found this API that might be useful for adding holidays as features: https://api-ninjas.com/api/holidays

Of course this would mean relying on an external API (maybe problematic), and that the user should choose the proper calendar for the dataset they're working with (e.g., if the dataset is about French data, no point in using German holidays).

Another issue is that some holidays like Easter do not fall on fixed dates, which is what an API like the one I linked would be useful for.

In general, I think that even a set of fixed calendars to use for binary features may be useful to have. Then, it would fall on the user to choose which calendar they think is most relevant to the data.

thanks! there is also python-holidays

Let's try to revive this, with python-holidays :)
also it might be interesting to add weather eg with https://openweathermap.org/api as suggested by @ogrisel