Datafable/epu-index

Determine time zone of articles

Closed this issue · 9 comments

Hypothesis:

Every newspaper returns the date time in the time zone that it was published. Meaning, if an article was published in winter, the time zone is UTC+01 while if the article was published during the summer, the time zone is UTC+02.

I can test wether the returned time zone is indeed fixed:

Visit an article for each journal and note the published date. Reset your systems locale and visit the articles again. If the time zone of one of the articles changed, then that journal uses the users time zone to return timestamps. If not, the returned timezone is fixed.

However if the time zone is indeed fixed, I don't see a way to determine that time zone (is it UTC+01 or UTC+02 or - for some crazy reason - UTC?). @peterdesmet do you have an idea how we could determine the time zone?

Why do we need to know the time zone of the article?

How else should we store the "date_published"? We could make it a timestamp that is unaware of a timezone, but our app might include data for other countries in the future too. If we store all articles without time zone information, this could lead to confusing results.

After a chat we decided to store the date times as verbatim (as is: without time zone information).

If journals do not return the time zone with their published articles, it is impossible to find an algorithm that will always find the correct time zone. (e.g. the night that the time zone changes to winter time, 3 A.M. becomes 2 A.M. so the time zone of an article with a published datetime that night at 2:30 A.M. cannot be deduced).

If we store the verbatim date time, the researchers can always try to find the correct time zone afterwards.

Should discuss this with the client (hence the "question" label).

Related: in what time zone should the frontend show the dates? In what time zone is the API currently returning dates? I assume a "day" returned by the API should be the same "day" bucket in which the EPU is calculated and aggregated?

Customer agrees with the proposed solution.

@peterdesmet I think the front end should do the same thing: return the datetime exactly how it was stored in the database (so timezone unaware).

Frontend currently shows all dates as UTC:

  • Overview chart: API gives average per month, chart shows months (with day set to 01 and time set to UTC midnight)
  • Detailed chart: API gives average per day (I'm unaware what timezone is used to define a day by the API), chart shows days (with time set to UTC midnight)
  • No date and time are shown for highest ranking article.

Is this OK?

This is not entirely correct. The input datetimes are timezone unaware, so it's basically impossible to show them as UTC. What we are actually doing is assuming they are all in the same time zone (whatever that may be), and the front end should show them in that time zone.

Everything else is ok.

I should verify whether all date times are saved without time zone, because if this is inconsistent, it might lead to unexpected results.

Had to update the standaard spider. Now all spiders return date time information without time zone.