PySport/kloppy

[Opta] Ordering of events

probberechts opened this issue · 14 comments

I noticed that Opta events can sometimes be slightly out of order. The F24 docs specify that the following attributes (in the given order) should be used to order each team's match events chronologically:

image

Only sorting by timestamp does not always give the same result. For example:

<Event id="1889768843" event_id="358" type_id="1" period_id="1" min="32" sec="3" player_id="59062" team_id="174" outcome="0" x="21.6" y="39.2" timestamp="2018-08-20T21:32:27.98" last_modified="2018-08-20T21:32:28" version="1534797148460"></Event>         
<Event id="1592827425" event_id="228" type_id="1" period_id="1" min="32" sec="4" player_id="80908" team_id="957" outcome="0" x="60.4" y="52.0" timestamp="2018-08-20T21:32:27.635" last_modified="2018-08-21T16:43:18" version="1534866198424"></Event>

Since the Opta deserializer currently only parses the "timestamp" field, it does not seem possible to order events chronologically.

koenvo commented

Are there any details on how to properly sort on correctly and maintain millisecond precision?

A solution could be to extract timestamp from “min” and “sec” attributes but than we lose the precision.

My documentation doesn't mention the precision of the "timestamp" field. However, my version of the documentation is extremely outdated. Maybe @JanVanHaaren has something more up-to-date.

I find it strange that the "timestamp" field does not align with the "min" and "sec" fields. If the precision of the "timestamp" field would be inferior to the "min" and "sec" fields, I don't see why we would infer an (incorrect) millisecond precision from it.

Looking at a few more timestamps, I now realize that Opta does not add leading zeros to the milliseconds. So, "2018-08-20T21:32:27.98" is actually "2018-08-20T21:32:27.098000".

Python's %f pads zeros to the right, while we should pad zeros to the left to parse the Opta timestamp. We should simply adapt the timestamp parser and then it should work.

%f is an extension to the set of format characters in the C standard (but implemented separately in datetime objects, and therefore always available). When used with the strptime() method, the %f directive accepts from one to six digits and zero pads on the right.

The min and sec fields on one hand and the timestamp field on the other hand provide different pieces of information about an event. The min and sec fields provide the game time in minutes and seconds when the event occurred, whereas the timestamp field provides the date and time when the event was logged in UK time. Hence, the timestamp field can be used as a tie-breaker to order events but not to derive the time when the event occurred in the match.

Documentation Opta F24

  • timestamp - "The UK time/date at which this event was initially entered into Opta’s database"
  • min - "Minute of the event"
  • sec - "Second of the event"

Documentation Stats Perform MA3

  • timestamp - "The UK time/date at which this event was initially entered into Opta's database"
  • timeMin - "Game time in minutes"
  • timeSec - "Game time in seconds"

So, to conclude, would it be okay to fill the "timestamp" field in Kloppy with min + sec and order events based on min + sec + timestamp?

That suggestion sounds good to me. The Wyscout V3 deserializer fills the timestamp field based on the minute and second fields too although it would probably be better to use the provided matchTimestamp field. The StatsBomb deserializer uses the provided timestamp.

Should we explicitly store a sequence number for each event as well? StatsBomb and Wyscout explicitly provide a sequence number in the index and eventIndex fields, respectively.

Should we explicitly store a sequence number for each event as well? StatsBomb and Wyscout explicitly provide a sequence number in the index and eventIndex fields, respectively.

I would just make sure that the records in a dataset are chronologically ordered. Storing a sequence number then does not provide any added value since you would be able to infer it from the position in the list of records.

koenvo commented

Small question about the timestamp vs min/sec: when the record is not altered afterwards, does the timestamp match the min/sec?
so only when the record is altered the timestamp loses value, correct?

Small question about the timestamp vs min/sec: when the record is not altered afterwards, does the timestamp match the min/sec? so only when the record is altered the timestamp loses value, correct?

My understanding is that the timestamp field is never updated. The timestamp field reflects the time when the event was initially entered in the database and the last_modified field reflects the time when the event was last updated in the database.

I suspect that the timestamp field is reasonably accurate for events that are recorded live. However, not all event data is recorded live and events can occasionally be inserted at a later time during the match or even after the match.

Although, according to my old documentation the timestamp field reflects the time that the event occured within the match. 😕

image

I will contact the Stats Perform support desk. The official documentation is confusing.

Documentation website

  • timestamp - "The UK time/date at which this event was initially entered into Opta's database"
  • timestamp_utc - "The UTC timestamp of when the event occurred, or when the data was entered in Opta DB"
  • last_modified - "The UK time/date at which this event was last modified by Opta"

I haven't heard back yet from Stats Perform, but I think I finally understand how the timestamps work. I suspect the meaning of the timestamp field depends on the coverage level. The event timestamps are detailed to the millisecond for some but not all coverage levels.

For example, the event data for this friendly match between Salzburg and Ried has coverage level 14. The game took place on 12 October 2023, but the timestamp for the kick-off event is 2023-10-15T08:49:39.373Z.

{
	"id": "9130ocq9mdrosrd4mv7a666tw",
	"coverageLevel": "14",
	"date": "2023-10-12Z",
	"time": "12:00:00Z",
	"localDate": "2023-10-12",
	"localTime": "14:00:00",
	"numberOfPeriods": 2,
	"periodLength": 45,
	"overtimeLength": 15,
	"lastUpdated": "2023-11-25T12:46:38Z",
	"description": "Salzburg vs Ried",
	...
},
{
	"id": 2604454267,
	"eventId": 3,
	"typeId": 1,
	"periodId": 1,
	"timeMin": 0,
	"timeSec": 0,
	"contestantId": "do3l4dhs0ooog6se728jxc06z",
	"playerId": "3rmiekqhf431q783nhdc2m12h",
	"playerName": "W. Eza",
	"outcome": 1,
	"x": 49.8,
	"y": 50.0,
	"timeStamp": "2023-10-15T08:49:39.373Z",
	"lastModified": "2023-10-16T00:39:15Z",
	"qualifier": [
		...
	]
},

The question is rather whether they can be used as a reliable way to measure the relative time that has passed since the "period start" event.

I don't know yet, but my feeling is that it should be possible for the highest coverage levels. I'll investigate a few more matches. Unfortunately, I don't have access to much event data that was collected at lower coverage levels.