FanMatch Conference Tourneys impacting team name parsing
Closed this issue · 3 comments
With the conference tournaments going on this week, I noticed some of the FanMatch output showing the conference tournament name in them.
I ran the FanMatch module and stepped through the logic for 3/3.
I noticed that if the home team is the PredictedLoser and it is a conference tourney game, then the result of the team name will include the tournament abbreviation.
However, if the above scenario is not true, then the conference tourney is parsed to the possessions column.
Here is how the PredictedLoser is getting parsed for the above scenario.
It is retrieving everything after the team's ranking, therefore adding tourney abbreviation.
x[0] = " ".join(x[0].split()[1:])
x[1] = " ".join(x[1].split()[1:])
and here is the logic for how possessions are getting parsed:
pos = fm_df.Game.str.split(r" \[").str[1]
fm_df["Game"], fm_df["Possessions"] = fm_df.Game.str.split(r" \[").str[0], pos.astype("str")
fm_df.Possessions = fm_df.Possessions.str.strip(r"\]")
This is just a string splitting issue. I was wondering if either of you guys had any ideas on how to go about this.
We could create a list of conference tourney's and remove them from the Game
column as soon as we retrieve the FanMatch table. Obviously, we would need to collect all of KenPom's tourney abbreviations for this to work without error.
Thanks, let me know what you think.
Good call out. The ideal solution is to just tighten up on the pattern matching to account for the conference tournament label potentially being present around this time of year. It would be significantly preferable over having to hard-code any conference labeling data so as to avoid the resulting maintenance overhead altogether.
This should be a somewhat easy fix for someone with the right level of RegExp knowledge. I'll have time to take a peek at this on Monday at the latest.
Without looking at them directly, this also smells like a test that should be edited to raise this type of issue in the future and I would strongly encourage any PR for this to include changes to the existing test suite which would help to highlight this moving forward.
Yeah I think Regex is probably the best way to alleviate errors in the long run.
I did just work up a quick fix for the time being if anyone needs a solution:
- Pulled all unique conferences from the Kenpom summary page and added '-T' to the end of all labels
- Created a variable self.conf_tourneys to hold the labels
And added this before any other parsing begins:
p = re.compile('|'.join(map(re.escape, self.conf_tourneys)))
fm_df['Game'] = [p.sub('', tourney) for tourney in fm_df['Game']]
This is not a concrete solution as labels can change over time like you said. I'll look into some Regex and poke at it some more.
Let me know if you find a better solution!