jldbc/pybaseball

Player names in Baseball Reference stats contain mis-encoded non-ASCII characters

Closed this issue · 5 comments

The FanGraphs functions pitching_stats() and batting_stats() appear to convert names from that site such as Ronald Acuña Jr. and José Abreu to Ronald Acuna Jr., Jose Abreu etc., which are recognizable if not entirely correct.

On the other hand, the Baseball Reference functions batting_stats_bref() and pitching_stats_bref() return what seems like mis-converted HTML encodings of those names, resulting in lower readability, although the names on the site itself appear correct.

For example:

import pybaseball as pb
df_bref_batting = pb.batting_stats_bref(2023).sort_values("mlbID")
df_bref_pitching = pb.pitching_stats_bref(2023).sort_values("mlbID")

for side, df in zip(["batting","pitching"],[df_bref_batting,df_bref_pitching]):
    print(side)
    print(df[df["Name"].str.contains("x")][["Name","mlbID"]].head().to_string(index=False)+"\n")

prints:

batting
                        Name  mlbID
           Manny Pi\xc3\xb1a 444489
     Mart\xc3\xadn Maldonado 455117
           Sandy Le\xc3\xb3n 506702
Avisa\xc3\xadl Garc\xc3\xada 541645
         Carlos P\xc3\xa9rez 542208

pitching
                   Name  mlbID
           Max Scherzer 453286
Mart\xc3\xadn Maldonado 455117
     Luis Garc\xc3\xada 472610
   Jos\xc3\xa9 Quintana 500779
              Alex Cobb 502171

My current workaround is to use playerid_reverse_lookup to bridge to FanGraphs names and use those instead. (I like to use the Baseball Reference batting stats because of how it labels players who played in multiple teams/leagues in a given season, providing both team names instead of "---".)

I love pybaseball... thank you!

Hi, these are "tildes" for spanish words,
I have a workaround for this in an internal project that maybe can be useful.

Will try to replicate your example and apply my workaround

The issue is because we are wrongly encoding a bytes object parsed to string
what we need to do instead is, decode the bytes object directly.

I'll submit a PR to fix this.

How it looks like after my fix:
image

Hi, @AndrewsOR this now has been merged into master
just need to wait for the next pybaseball release or use the project directly from github/master branch.

Feel free to close the issue.

This issue can be closed since the solution was merged. @schorrm

Thank you @BrayanMnz !