Player names in Baseball Reference stats contain mis-encoded non-ASCII characters
Closed this issue · 5 comments
The FanGraphs functions pitching_stats()
and batting_stats()
appear to convert names from that site such as Ronald Acuña Jr. and José Abreu to Ronald Acuna Jr.
, Jose Abreu
etc., which are recognizable if not entirely correct.
On the other hand, the Baseball Reference functions batting_stats_bref()
and pitching_stats_bref()
return what seems like mis-converted HTML encodings of those names, resulting in lower readability, although the names on the site itself appear correct.
For example:
import pybaseball as pb
df_bref_batting = pb.batting_stats_bref(2023).sort_values("mlbID")
df_bref_pitching = pb.pitching_stats_bref(2023).sort_values("mlbID")
for side, df in zip(["batting","pitching"],[df_bref_batting,df_bref_pitching]):
print(side)
print(df[df["Name"].str.contains("x")][["Name","mlbID"]].head().to_string(index=False)+"\n")
prints:
batting
Name mlbID
Manny Pi\xc3\xb1a 444489
Mart\xc3\xadn Maldonado 455117
Sandy Le\xc3\xb3n 506702
Avisa\xc3\xadl Garc\xc3\xada 541645
Carlos P\xc3\xa9rez 542208
pitching
Name mlbID
Max Scherzer 453286
Mart\xc3\xadn Maldonado 455117
Luis Garc\xc3\xada 472610
Jos\xc3\xa9 Quintana 500779
Alex Cobb 502171
My current workaround is to use playerid_reverse_lookup
to bridge to FanGraphs names and use those instead. (I like to use the Baseball Reference batting stats because of how it labels players who played in multiple teams/leagues in a given season, providing both team names instead of "---".)
I love pybaseball
... thank you!
Hi, these are "tildes" for spanish words,
I have a workaround for this in an internal project that maybe can be useful.
Will try to replicate your example and apply my workaround
Hi, @AndrewsOR this now has been merged into master
just need to wait for the next pybaseball release or use the project directly from github/master branch.
Feel free to close the issue.
Thank you @BrayanMnz !