Steam is an online platform used by most video game publishers for PC game distribution. User purchase and playtime data are recorded and are available publicly by default. Game recommenders are therefore important tools for game discovery. The Steam Game Recommender utilizes part of a 170GB dataset that was scraped from the Steam API and stems into two recommendation models. First, a content-based filtering approach where users are matched to unplayed games that have similar features to games they have significant playtime in. Second, a item-item collaborative filtering approach where unplayed game playtime are estimated from the user’s playtime on games weighted by similarity. The top unplayed games are then used as recommendations.
- Does a user's playtime as implicit ratings properly represent their game preferences when used in a recommender system?
- How can additional game metadata(e.g. publisher, developer, release date) beyond game genre(s) improve content-based filtering recommendations (i.e. rating estimation accuracy)?
- Source: https://steam.internet.byu.edu/
- 109 million users
- 716 million games
One row per user per purchased game.
- steamid: a unique identifier for a user.
- appid: a unique identifier for a game.
- playtime_forever: total time the user has played the game in minutes.
One row per game.
- appid: a unique identifier for a game.
- genre: the name of genre associated with the game (multiple possible)
- developer: the name of the game’s developer (multiple possible)
- publisher: the name of the game’s publisher (multiple possible)
- release_date: the date when the game was first made available on the Steam storefront.
For each game, we calculate the mean and standard deviation. We then create buckets for each rating:
The cut points are scaled on a per-user basis since some users are more casual gamers while others may spend a lot more time gaming. The scaling factor is calculated as follows:
(user_playtime_average)/(global_playtime_average)
- Cut point 1: (mean - std_dev*0.5) * scaling_factor if > 0, else 0
- Cut point 2: mean
- Cut point 3: (mean + std_dev*0.5) * scaling_factor
- Cut point 4: (mean + std_dev) * scaling_factor
- Rating 1: 0 < x < cut point 1
- Rating 2: cut point 1 < x < cut point 2
- Rating 3: cut point 2 < x < cut point 3
- Rating 4: cut point 3 < x < cut point 4
- Rating 5: cut point 5 < x < inf
- Build the item profiles: vector of the game’s genre, developer and publisher from the dataset.
- Build the user profile: weighted average of rated item profiles.
- Prediction heuristics: cosine distance of an item and user profile.
- Recommend top N games by estimated ratings
User by game rating matrix
- Compute estimated ratings for all games that user x has not played:
- Use cosine similarity to define similarities to unplayed game i
- Obtain KNNs of unplayed game i and estimate rating through the average weighted by similarities
- Recommend top N games by estimated ratings