Recommeder System Datasets

This repository contains a list of public and compatible datasets, noting other major repositories containing newer, and popular real-world datasets that are available, along with reference of sample code for respective recomendation tasks. Most of the datasets presented are for non-commercial use by academics, for example faculty, university researchers and other scientists. The datasets are free, however datasets may ask for citation.

In addition, there are a few links that may contain some sample code from existing works by their respective author. Before using these datasets, please review their sites and/ or README files for their respective usage licenses, acknowledgments and other details as a few datasets have additional citation requests. These requests can be found on the bottom of each dataset's web page.

Contributors

Name: Jamell Dacon
Email: daconjam at msu dot edu (daconjam@msu.edu)

If you publish material based on material and/ or information obtained from this repository, then, in your acknowledgements, please note the assistance you received from utilizing this repository. By citing our paper as follows below, feel free to star and/ or fork the repository so that academics i.e. university researchers, faculty and other scientists may have quicker access to the available datasets. This will aid in directing others in obtaining the same datasets, thus allowing the replication and improvement of experiments.

Addition Information: Correspondence

Personal Page: Portfolio

Lab Page: DSELab@MSU

Citation

Here is a BiBTeX citation:

@inbook{10.1145/3442442.3452325, author = {Dacon, Jamell and Liu, Haochen}, title = {Does Gender Matter in the News? Detecting and Examining Gender Bias in News Articles}, year = {2021}, isbn = {9781450383134}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3442442.3452325}, abstract = {To attract unsuspecting readers, news article headlines and abstracts are often written with speculative sentences or clauses. Male dominance in the news is very evident, whereas females are seen as “eye candy” or “inferior”, and are underrepresented and under-examined within the same news categories as their male counterparts. In this paper, we present an initial study on gender bias in news abstracts in two large English news datasets used for news recommendation and news classification. We perform three large-scale, yet effective text-analysis fairness measurements on 296,965 news abstracts. In particular, to our knowledge we construct two of the largest benchmark datasets of possessive (gender-specific and gender-neutral) nouns and attribute (career-related and family-related) words datasets1 which we will release to foster both bias and fairness research aid in developing fair NLP models to eliminate the paradox of gender bias. Our studies demonstrate that females are immensely marginalized and suffer from socially-constructed biases in the news. This paper individually devises a methodology whereby news content can be analyzed on a large scale utilizing natural language processing (NLP) techniques from machine learning (ML) to discover both implicit and explicit gender biases. }, booktitle = {Companion Proceedings of the Web Conference 2021}, pages = {385–392}, numpages = {8} }

Major repositories with several datasets

Arizona State University: Social Computing Data Repository

Note: ASU Social Computing Data Respository contains several Network Datasets

Note: Yahoo Research Ratings and Classification Data Music, Movies, Tags, Clicks, Images & Videos: This sets of datasets contains music ratings, movie ratings, popular URLs and tags, click log dataset, face images of celebrities and 22K videos.

Kaggle Datasets
GroupLens Datasets
Recommnder Systems Datasets. Contributors: Julian McAuley

Datasets links via Catergories

The following datasets are very popular in Recommender Systems, below are also brief dataset descriptions.

News

MIND dataset was collected from the Microsoft News website, for more detailed information about the MIND dataset, you can refer to the following paper: MIND paper, (Wu et al., 2020). They randomly sampled news from from October 12 to November 22, 2019 for 6 weeks creating two datasets i.e., MIND and MIND-small both totalling in 161,013 news articles. Each news article contains a news ID, a category label, a title, and a body (url); however, not every article contains an abstract resulting in 96,112 abstracts. We used the training set (largest set of news articles) since both the validation and test sets are subsets of the training set. MIND is created to serve as a new news recommendation benchmark dataset.
NCD dataset was collected from Huffpost. The news articles were sampled from news headlines from the year 2012 to 2018 totalling in 202,372 news articles. Each news article contains a category label, headline, authors, link, and date; however, not every article contains a short description (abstract) resulting in 200,853 abstracts. NCD serves as a news classification and recommendation benchmark dataset.
ANTCD dataset was collected by Zhang et al. from over 2000 news sources by ComeToMyHead (an online academic news search engine) for a under 2 years of activity. They access the original AG's News Corpus which contained 496,835 news articles, and by choosing the 4 categories with largest samples (30,000 articles each), thus creating the ANTCD Dataset with 120,000 news articles. Each news article contains a category (class index), a title and an abstract. We used the training set (largest set of news articles) since the test set is a subset that only contains 7600 testing samples. ANTCD serves as a news classification and recommendation benchmark dataset.

E-commerce

Amazon: This Amazon dataset consists of reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs) spanning from May 1996 to July 2014.
Amazon - Ratings (Beauty Products): This is a dataset related to over 2 Million customer reviews and ratings of Beauty related products sold on their website.
Toy Products on Amazon: This is a pre-crawled dataset, taken as subset of a bigger dataset (more than 115k products) that was created by extracting data from Amazon.com.
Slashdot: The network cotains friend/foe links between the users of Slashdot which was obtained in February 2009.
Taobao: This dataset contains anonymized users' shopping logs in the past 6 months before and on the "Double 11" day,and the label information indicating whether they are repeated buyers. Due to privacy issue, data is sampled in a biased way, so the statistical result on this data set would deviate from the actual of Tmall.com.
Microsoft Web Data Dataset: This dataset contains a log of anonymous users of www.microsoft.com; with the task predict areas of the web site a user visited based on data on other areas the user visited.
Retailrocket recommender system dataset: This dataset consists of three files: a file with behaviour data (events.csv), a file with item properties (item_properties.сsv) and a file, which describes category tree (category_tree.сsv). The data has been collected from a real-world ecommerce website.
Wikipedia: Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries.
Airbnb Collection: The data was take of http://tomslee.net/airbnb-data-collection-get-the-data, this represent a response of the Barcelona City. The data is collected from the public Airbnb web site without logging in and the code was use is available on https://github.com/tomslee/airbnb-data-collection.

Social

Yelp: This Yelp dataset is a subset of businesses, reviews, and user-generated data for personal, educational, and academic purposes. This dataset is available in both JSON and SQL files, which can use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps.
Facebook: This dataset contains exploratory data analysis that gives insights from a Facebook dataset which consists of identifying users that can be focused more to increase the business. These valuable insights should help Facebook to take intelligent decision to identify its useful users and provide correct recommendations to them.
Twitter: This dataset consists of 'circles' (or 'lists') from Twitter. Twitter data was crawled from public sources. The dataset includes node features (profiles), circles, and ego networks.
Pinterest: This dataset contains the scene-product pairs for fashion and home, respectively.

Stock

Spanish Stocks Historical Data from 2000 to 2019: This dataset contains retrieved retrieve historical data from the companies that integrate the Continuous Spanish Stock Market. May have to refer investpy from Investing.com
Stock Exchange: This dataset is the ZZAlphaÂ® machine learning recommendations made for various US traded stock portfolios the morning of each day during the 3 year period Jan 1, 2012 - Dec 31, 2014.

Job

Job Recommendation: This dataset contains a list of recommended jobs listed for individual.
Job Recommendation Analysis: A recommendation engine which is build using NLTK helping the applicants to choose thier preferred job based on their application. You will learn how lemmetizer, stemming and vectoriztion are used to process the data and have a better output.

Item reviews

Item Learning: A dataset for Learning from Sets of Items in Recommender Systems (2019)
eCommerce Item Dataset: This dataset contains 500 actual SKUs from an outdoor apparel brand's product catalog.
Epinions: Epinions is a website where people can review products where users can register for free and start writing subjective reviews about many different types of items.

Book

Good Reads: This dataset's purpose is for the requirement of a good clean dataset of books.
Book Crossing: The BookCrossing (BX) dataset was collected by Cai-Nicolas in a 4-week crawl (August / September 2004) from the Book-Crossing community.

Map

Open OSM: This data is from OpenStreetMap which is a collaborative mapping project, sort of like Wikipedia but for maps. For reference of python, a few scripts are available at [Hermes repo].(https://github.com/lab41/hermes)

Dating

Dating Agency: This dataset contains 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users as dumped on April 4, 2006.

Personality

Personality 2018: The purpose of this dataset is for “User personality and user satisfaction with recommender systems".
DEAPdataset: This is a dataset for emotion analysis using eeg, physiological and video signals.
MyPersonalityDataset: This dataset contains information from a popular Facebook application that allowed users to take real psychometric tests, and allowed their Facebook profiles and psychological responses to be recorded (with consent!). Currently, the database contains more than 6,000,000 test results, together with more than 4,000,000 individual Facebook profiles.

Music

Million Song Dataset: The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. For code for the dataset, refer to MSongDB repo.
LastFM (Implicit): This dataset contains social networking, tagging, and music artist listening information from a set of users from Last.fm online music system, consisting of 92,800 artist listening records from 1892 users.

Movies

Netflix: This Netflix dataset is the official dataset that was used in the Netflix Prize competition.
MovieLens: GroupLens Research has collected and made available rating datasets from their movie web site consisting of 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
Flixster: Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste.
IMDB: This is a link dataset built with permission from the Internet Movie Data (IMDB).

Trust

CiaoDVD & Epinions: The CiaoDVD is a dataset crawled from the entire category of DVDs, and the Epinions dataset for each user, in their profile, it contains their ratings and trust relations. For each rating, the product name and its category, the rating score, the time point when the rating is created, and the helpfulness of this rating.

Anime

Anime Recommendations Database: This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.
Anime Data: Japanese animation, which is known as anime, has become internationally widespread nowadays. This dataset provides data on anime taken from Anime News Network.

Food

Resturant and Constumer: This dataset was obtained from a recommender system prototype, with the task to generate a top-n list of restaurants according to the consumer preferences.
Chicago Entree: This is a dataset containing a record of user interactions with the Entree Chicago restaurant recommendation system.

Games

Steam Video Games: This dataset is a list of user behaviors, with columns such as user-id, game-title, behavior-name, value. The behaviors included are 'purchase' and 'play'. The value indicates the degree to which the behavior was performed - in the case of 'purchase' the value is always 1, and in the case of 'play' the value represents the number of hours the user has played the game.
Steam Reviews Dataset: This dataset contains reviews from Steam's best selling games as February 2019.

Jokes

Jester: This is a Joke dataset containing 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,496 users.

Other

Citation Network: The data set is designed for research purpose only. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title.
YAGO: YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.
Complete Collection of Kaggle Datasets: (below is more information pertaining to this dataset)

Context: For many data analysts it is often complicated to find the right dataset for a project or to make some practice, so this collection of Kaggle datasets helps them to explore the available opportunities that Kaggle offers.

Content: Part of the data has been first collected using the Kaggle API to retrieve the full list datasets, then each URL reference has been leveraged with a Python script in order to retrieve more detailed information.

A collection of resources for Recommender Systems (RecSys)

Recommendation Algorithms

Recommender Systems Basics
- Wikipedia
Nearest Neighbor Search
Classic Matrix Facotirzation
- Matrix Factorization: A Simple Tutorial and Implementation in Python
- Matrix Factorization Techiques for Recommendaion Systems
Singular Value Decomposition (SVD)
- Wikipedia
SVD++
- Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model
Content-based CF / Context-aware CF
- there are so many ...
Advanced Matrix Factorization
Factorization Machine
- Factorization Machines
- Field-aware Factorization Machines for CTR Prediction
Sparse LInear Method (SLIM)
- SLIM: Sparse Linear Methods for Top-N Recommender Systems
- Global and Local SLIM
Learning to Rank
Cold-start
- Deep content-based music recommendation
- DropoutNet: Addressing Cold Start in Recommender Systems
Network Embedding
Sequential-based
- Factorizing Personalized Markov Chains for Next-Basket Recommendation
- Session-based Recommendations with Recurrent Neural Networks
Translation Embedding
- Translation-based Recommendation
- Translation-based Factorization Machines for Sequential Recommendation
Graph-Convolution-based
- GraphSAGE: Inductive Representation Learning on Large Graphs
- PinSage: Graph Convolutional Neural Networks for Web-Scale Recommender Systems
Knowledge-Graph-based
Deep Learning

Online Courses

Recommender Systems Specialization, University of Minnesota
Introduction to Recommender Systems: Non-Personalized and Content-Based, University of Minnesota

RecSys-related Competitions

Kaggle - product recommendations, hotel recommendations, job recommendations, etc.
ACM RecSys Challenge
WSDM Cup 2018
Million Song Dataset Challenge
Netflix Prize

Tutorials

RecSys tutorials
- 2014
- 2015
- 2016
- 2017
- 2018
Kdd 2014 Tutorial - the recommender problem revisited

Articles

Matrix Factorization: A Simple Tutorial and Implementation in Python

Conferences

RecSys – ACM Recommender Systems

daconjam/Recommender-System-Datasets