Today I collected all the brief competition descriptions on the Kaggle website. I'm interested in how often data science for good type competitions were held and under what conditions. I'd also like to know how they compare to other types of competitions held.
library(tidyverse)
library(lubridate)
library(here)
kaggle <- read_csv(here("data/kaggle_competitions.csv"))
str(kaggle, give.attr = F)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 339 obs. of 9 variables:
## $ title : chr "Predict Future Sales" "iMaterialist (Fashion) 2019 at FGVC6" "iNaturalist 2019 at FGVC6" "Google Landmark Recognition 2019" ...
## $ short_desc : chr "Final project for \"How to win a data science competition\" Coursera course" "Fine-grained segmentation task for fashion and apparel" "Fine-grained classification spanning a thousand species" "Label famous (and not-so-famous) landmarks in images" ...
## $ category : chr "Playground" "Research" "Research" "Research" ...
## $ prize : chr "Kudos" "Kudos" "Kudos" "$25,000" ...
## $ tags : chr NA NA NA NA ...
## $ kernels_comp : chr NA NA NA NA ...
## $ submission_details: chr NA NA NA NA ...
## $ teams_entered : num 3281 203 209 281 144 ...
## $ deadline : Date, format: "2020-01-02" "2019-06-10" ...
summary(kaggle$deadline)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2010-06-06" "2013-06-05" "2015-06-05" "2015-10-22" "2018-06-04"
## Max. NA's
## "2030-06-01" "3"
There's 339 competitions listed today, most of which have likely finished, lets start with deadlines to get a sense of how many competitions are closing at any time.
kaggle %>%
filter( deadline < ymd(20210101)) %>%
ggplot(aes(deadline)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In recent years, there's been about 35 competitions per year. For the most recent year we can get monthly data, but the deadline is aggregated by year for others because it was derived from statements like "2 years ago".
We can break out the type of comptition held in most recent years.
kaggle %>%
filter( between(deadline, ymd(20180701), ymd(20190701)) ) %>%
ggplot(aes(deadline, fill = category)) +
geom_histogram() +
facet_wrap(~ category) +
scale_x_date(date_labels = "%b") +
labs(title = "Kaggle competitions held from Jul 2018 to Jul 2019",
x = "",
y = "number of competitions") +
guides(fill = F)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Just to be sure, what type of competitions to data science for good one's get categorized as?
kaggle %>%
filter(str_detect(title, "Good")) %>%
select(title, short_desc)
## # A tibble: 2 x 2
## title short_desc
## <chr> <chr>
## 1 Data Science for Good: Cit~ Help the City of Los Angeles to structure an~
## 2 Data Science for Good: Car~ Match career advice questions with professio~
There's been two other Data Science for Good competitions (Data Science for Good: Center for Policing Equity and Data Science for Good: PASSNYC), however they've been taken off the listings for some reason.
The Data Science for Good competitions are also unique in that they don't have a leaderboard so there is no measure for teams entered. One possible way to still count this would be to perhaps count the unique user codes that are active on the competition (through some measure of kernels or discussions).
That cuts this comparison short, but we can still look at some broad summaries of the rest of Kaggle's competitions. For instance, many teams typically enter a competition?
kaggle %>%
ggplot(aes(teams_entered)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (stat_bin).
Most appear to have something in the 10 to 500 range, but there are some serious outliers with over 6,000 teams. Which one's were these?
kaggle %>%
filter(teams_entered > 6000) %>%
select(-kernels_comp, -submission_details, -tags, -category, -deadline)
## # A tibble: 3 x 4
## title short_desc prize teams_entered
## <chr> <chr> <chr> <dbl>
## 1 Titanic: Machine ~ Start here! Predict survival on ~ Knowl~ 11128
## 2 Santander Custome~ Can you identify who will make a~ $65,0~ 8802
## 3 Home Credit Defau~ Can you predict how capable each~ $70,0~ 7198
The ongoing Titanic competition takes the lion's share, a gateway to data science that many people try when they first hear of it (including me!). The others look like they have to do with financial data.
Speaking of which, there's a prize amount for some Kaggle competitions so we should definitely take a look at how much prize money typically gets offered and how it can influence the number of teams entered.
kaggle %>%
filter(prize != "USD") %>%
mutate( prize = if_else(str_detect(prize, "[\\$\\€]"), "Money", prize)) %>%
count(prize) %>%
mutate(prize = reorder(prize, n)) %>%
ggplot(aes(prize, n)) +
geom_col() +
coord_flip() +
labs( title = "Number of Kaggle competitions held by type of prize",
subtitle = "from 2010 to present")
Money is certain the most popular prize.
# need to clean the prize variable so it's a number
kaggle_prizes <- kaggle %>%
# we'll drop the one prize offered in euros
filter(str_detect(prize, "[\\$]") ) %>%
mutate(prize = as.numeric(str_replace_all(prize, "[\\$,]", "")))
summary(kaggle_prizes$prize)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 5000 20000 43298 30000 1500000
Wait up, some competitions offered how much as prize money? These prizes are often a pool split amongst several top finishers, but that's a lot of zeros! Lets filter out anything over $500,000 as these are truly unique. Which one's were they anyway?
kaggle_prizes %>%
filter(prize > 500000) %>%
select(-kernels_comp, -submission_details, -tags, -category, -deadline)
## # A tibble: 3 x 4
## title short_desc prize teams_entered
## <chr> <chr> <dbl> <dbl>
## 1 Zillow Prize: Zillow~ Can you improve the algorithm~ 1.20e6 3779
## 2 Passenger Screening ~ Improve the accuracy of the D~ 1.50e6 518
## 3 Data Science Bowl 20~ Can you improve lung cancer d~ 1.00e6 1972
Interesting, there's a Public Sector competition on passenger screening from the US Department of Homeland Security. The goal was to "Improve the accuracy of the Department of Homeland Security's threat recognition algorithms". Here's a link. It certainly looks like the number of teams that entered didn't necessarily increase proportionally to the size of the prize of $1.5 million dollars - the largest Kaggle has ever seen.
kaggle_prizes %>%
filter(prize < 500000) %>%
ggplot(aes(prize, teams_entered)) +
geom_point(alpha = 0.2) +
geom_smooth() +
labs(title = "How does the number of teams entering vary with prize money?")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 8 rows containing non-finite values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).
There's likely some explanatory power of prize money, but the number of teams entering is likely a little more complicated story. Especially beyond $100,000 where it actually begins to decrease. Perhaps this is because of the nature of competitions offering greater prizes: they could be more challenging.
There are few Data Science for Good competitions, but Kaggle seems to be growing interested in hosting them. It's notable that the largest prize pool ever offered on Kaggle was from the public sector and while it's difficult to compare directly on team participation that isn't the be-all end-all of the value of these competitions.
Now that I've looked at this data I'll think about how else the potential of Kaggle competitions to solve public issues or Data Science for Good can be analyzed and their benefits understood.