/atp-world-tour-tennis-data

Using Python to scrape ATP World Tour tennis data

Primary LanguagePython

ATP World Tour tennis data

This repository contains Python scripts that scrape tennis data from the ATP World Tour website, as of Dec 2016. Note that if the site layout is subsequently redesigned, then these scripts will no longer work.

Contents

A. Scraping the match data by player name and by year ^

A1. The atp_match_data_player.py script ^

The following Python script:

collects all of the tournament and match data for a single player in a given year from the ATP World Tour website, and exports the following example CSV file:

The Python script takes input arguments from the command line, so for this example it would be:

$ python atp_match_data_player.py roger-federer f324 1998 2016

Note that you must locate the player activity year URL to find the player name slug roger-federer and the player id f324:

image

The script scrapes all the match data on this page, as well as iterates through each match to find the match stats url to scrape the match stats:

image

A2. Command line output ^

In addition to the CSV output, the command line output is the following, for debugging purposes, since the ATP website is error-prone, and there are lots of inconsistencies in the ATP website HTML. These errors and inconsistencies lead to scraping errors, upon which I would have to revise the XPaths and/or the code accordingly. This console output allows me to figure out exactly which where in the site (i.e. which match) the scraper breaks down.

$ python atp_match_data_player.py roger-federer f324 1998 2016
1998 | Basel | Round of 32 | Andre Agassi
1998 | Toulouse | Quarter-Finals | Jan Siemerink
1998 | Toulouse | Round of 16 | Richard Fromberg
1998 | Toulouse | Round of 32 | Guillaume Raoux
1998 | Geneva | Round of 32 | Orlin Stanoytchev
1998 | Gstaad | Round of 32 | Lucas Arnold Ker
1999 | Brest | Finals | Max Mirnyi
1999 | Brest | Semi-Finals | Martin Damm
1999 | Brest | Quarter-Finals | Michael Llodra
1999 | Brest | Round of 16 | Rodolphe Gilbert
1999 | Brest | Round of 32 | Lionel Roux
1999 | Lyon | Round of 32 | Lleyton Hewitt
1999 | Lyon | Round of 64 | Daniel Vacek
1999 | Vienna | Semi-Finals | Greg Rusedski
1999 | Vienna | Quarter-Finals | Karol Kucera
⋮
[etc]
⋮

On my end, it takes ~3.22 seconds to scrape the data for each match, so for some players with long careers (e.g. Roger Federer), the total time it takes to scrape all the data for a given player will be over an hour.

A3. CSV headers ^

The following are the 99 column headers. Note that I didn't include the "tournament prize money" and "player prize money" data because of problems with outputting unicode to CSV format in my version of Python 2.7.5. I think the more recent versions of Python have rectified this problem, however updating my version of Python is non-trivial, and I don't have the time to do it right now. In any case, the unicode problem is due to the pound sterling £ and euro characters.

tourney_year
tourney_name
tourney_name_slug
tourney_id
tourney_location
tourney_dates
tourney_singles_draw
tourney_doubles_draw
tourney_conditions
tourney_surface
player_name
player_slug
player_id
player_event_points
player_ranking
match_round
opponent_name
opponent_name_slug
opponent_player_id
opponent_rank
match_win_loss
match_score
sets_won
sets_lost
sets_total
games_won
games_lost
games_total
tiebreaks_won
tiebreaks_lost
tiebreaks_total
match_time
match_duration
player_aces
player_double_faults
player_first_serves_in
player_first_serves_total
player_first_serve_percentage
player_first_serve_points_won
player_first_serve_points_total
player_second_serve_points_won
player_second_serve_points_total
player_break_points_saved
player_break_points_serve_total
player_service_points_won
player_service_points_total
player_first_serve_return_won
player_first_serve_return_total
player_second_serve_return_won
player_second_serve_return_total
player_break_points_converted
player_break_points_return_total
player_service_games_played
player_return_games_played
player_return_points_won
player_return_points_total
player_total_points_won
player_total_points_total
opponent_aces
opponent_double_faults
opponent_first_serves_in
opponent_first_serves_total
opponent_first_serve_percentage
opponent_first_serve_points_won
opponent_first_serve_points_total
opponent_second_serve_points_won
opponent_second_serve_points_total
opponent_break_points_saved
opponent_break_points_serve_total
opponent_service_points_won
opponent_service_points_total
opponent_first_serve_return_won
opponent_first_serve_return_total
opponent_second_serve_return_won
opponent_second_serve_return_total
opponent_break_points_converted
opponent_break_points_return_total
opponent_service_games_played
opponent_return_games_played
opponent_return_points_won
opponent_return_points_total
opponent_total_points_won
opponent_total_points_total

B. Scraping the match data by year ^

B1. The atp_match_data_year_no_stats.py script ^

The following Python script:

collects all of the tournament and match data in a given year from the ATP World Tour website (but not the individual match stats, because of runtime issues, that's for a different script), and exports the following example CSV file: