statsbomb/open-data

Documentation mismatch

Justice4Joffrey opened this issue · 4 comments

Some fields use hyphens instead of underscores for variable names and certain fields (e.g. 'off_camera') aren't described at all.

I'd also really like to know what variables like density, density.incone, AngleDeviation, Shot5, Shot6, etc... mean, and whether variables like DistToGoal and DistToKeeper are given in metres or arbitrary pitch units.

Those fields described by @JoGall are generated by the data cleaning functions of https://github.com/statsbomb/StatsBombR.

Got a decent understanding because I'm finishing to port them all to Python (check out https://github.com/ElSaico/pyStatsBomb in the next few days - I'll owe you all the API functionality because I lack the necessary $resources$ to access it).

Shots

Shot5, Shot6, etc. seem to be earlier glitches from importing that already got fixed: statsbomb/StatsBombR@2e38647

All distance variables use the same unit as the positions, i.e. they're scaled to a 120x80 pitch.
DistToGoal is exactly what it implies, but DistToKeeper refers, counter-intuitively, to the distance between keeper and goal (!). The distance between shot and goal is in DistSGK.

All angular variables are in degrees. AngleToGoal and AngleToKeeper are the opening angles formed by DistToGoal and DistToKeeper, respectively, while AngleDeviation is the opening angle between both.

Freeze frames

density and density.income are both described in the README:

  • Density is calculated as the aggregated inverse distance for each defender behind the ball.
  • Density in the cone is the density filtered for only defenders who are in the cone between the shooter, and each goal post.

The other variables are:

  • DefendersInCone- amount of defending players between the shooter and the goal
  • distance.ToD1 - distance between shooter and nearest defending player
  • distance.ToD2 - distance between shooter and second-nearest defending player
  • InCone.GK - whether the goalkeeper is in the path between the shooter and the goal
  • AttackersBehindBall and DefendersBehindBall - self-explanatory
  • DefArea - area of the smallest square that covers all opposite defenders (which means centre-backs and full-backs only)

All variables exclude the defending goalkeeper, except obviously for InCone.GK

Time

All extra time-related variables are in milliseconds and seem to have pretty descriptive names.

Thanks for taking the time for such a detailed reply @ElSaico!

I thought DistToKeeper was much lower than expected so wondered if it was given in an unexpected unit of measurement, that makes more sense! For anyone else reading, DistToKeeper is the distance from the GK to the centre of the goal (not the nearest part of the goal line).

I didn't notice density and density.incone in the documentation when I first looked -- seems they'd be very useful for xG models. I haven't seen several of the other variables (e.g. DistSGK, AttackersBehindBall, DefArea) as I don't think they're available in the free data but good to know.

Good luck with pyStatsBomb and making the data accessible to more people!

At some point we'll tidy up StatsBombR and document the inner workings of @YamStats brain, but for the most part it's provided as is to give people a bit of a leg up using the data. Happy to see issues raised in the other repo for any other improvements. In the meantime, the docs have been updated today so there shouldn't be anything in the raw data that's not covered now.