Documentation mismatch
Justice4Joffrey opened this issue · 4 comments
Some fields use hyphens instead of underscores for variable names and certain fields (e.g. 'off_camera') aren't described at all.
I'd also really like to know what variables like density
, density.incone
, AngleDeviation
, Shot5
, Shot6
, etc... mean, and whether variables like DistToGoal
and DistToKeeper
are given in metres or arbitrary pitch units.
Those fields described by @JoGall are generated by the data cleaning functions of https://github.com/statsbomb/StatsBombR.
Got a decent understanding because I'm finishing to port them all to Python (check out https://github.com/ElSaico/pyStatsBomb in the next few days - I'll owe you all the API functionality because I lack the necessary
Shots
Shot5
, Shot6
, etc. seem to be earlier glitches from importing that already got fixed: statsbomb/StatsBombR@2e38647
All distance variables use the same unit as the positions, i.e. they're scaled to a 120x80 pitch.
DistToGoal
is exactly what it implies, but DistToKeeper
refers, counter-intuitively, to the distance between keeper and goal (!). The distance between shot and goal is in DistSGK
.
All angular variables are in degrees. AngleToGoal
and AngleToKeeper
are the opening angles formed by DistToGoal
and DistToKeeper
, respectively, while AngleDeviation
is the opening angle between both.
Freeze frames
density
and density.income
are both described in the README:
- Density is calculated as the aggregated inverse distance for each defender behind the ball.
- Density in the cone is the density filtered for only defenders who are in the cone between the shooter, and each goal post.
The other variables are:
DefendersInCone
- amount of defending players between the shooter and the goaldistance.ToD1
- distance between shooter and nearest defending playerdistance.ToD2
- distance between shooter and second-nearest defending playerInCone.GK
- whether the goalkeeper is in the path between the shooter and the goalAttackersBehindBall
andDefendersBehindBall
- self-explanatoryDefArea
- area of the smallest square that covers all opposite defenders (which means centre-backs and full-backs only)
All variables exclude the defending goalkeeper, except obviously for InCone.GK
Time
All extra time-related variables are in milliseconds and seem to have pretty descriptive names.
Thanks for taking the time for such a detailed reply @ElSaico!
I thought DistToKeeper
was much lower than expected so wondered if it was given in an unexpected unit of measurement, that makes more sense! For anyone else reading, DistToKeeper
is the distance from the GK to the centre of the goal (not the nearest part of the goal line).
I didn't notice density
and density.incone
in the documentation when I first looked -- seems they'd be very useful for xG models. I haven't seen several of the other variables (e.g. DistSGK
, AttackersBehindBall
, DefArea
) as I don't think they're available in the free data but good to know.
Good luck with pyStatsBomb and making the data accessible to more people!
At some point we'll tidy up StatsBombR and document the inner workings of @YamStats brain, but for the most part it's provided as is to give people a bit of a leg up using the data. Happy to see issues raised in the other repo for any other improvements. In the meantime, the docs have been updated today so there shouldn't be anything in the raw data that's not covered now.