🔥 Recommendations for C++ using collaborative filtering
- Supports user-based and item-based recommendations
- Works with explicit and implicit feedback
- Uses high-performance matrix factorization
🎉 Zero dependencies
Add the header to your project and include it (supports C++20 and greater)
#include "disco.hpp"
Prep your data in the format user_id, item_id, value
using disco::Dataset;
auto data = new Dataset<std::string, std::string>();
data.push("user_a", "item_a", 5.0);
data.push("user_a", "item_b", 3.5);
data.push("user_b", "item_a", 4.0);
IDs can be integers, strings, or any other hashable data type
data.push(1, "item_a", 5.0);
If users rate items directly, this is known as explicit feedback. Fit the recommender with:
using disco::Recommender;
auto recommender = Recommender<std::string, std::string>::fit_explicit(data);
If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Use 1.0
or a value like number of purchases or page views for the dataset, and fit the recommender with:
auto recommender = Recommender<std::string, std::string>::fit_implicit(data);
Get user-based recommendations - “users like you also liked”
recommender.user_recs(user_id, 5);
Get item-based recommendations - “users who liked this item also liked”
recommender.item_recs(item_id, 5);
Get predicted ratings for a specific user and item
recommender.predict(user_id, item_id);
Get similar users
recommender.similar_users(user_id, 5);
Download the MovieLens 100K dataset.
And use:
#include <cassert>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>
#include <unordered_map>
#include "disco.hpp"
using disco::Dataset;
using disco::Recommender;
Dataset<int, std::string> load_movielens(const std::string& path) {
std::string line;
// read movies
std::unordered_map<std::string, std::string> movies;
std::ifstream movies_file(path + "/u.item");
assert(movies_file.is_open());
while (std::getline(movies_file, line)) {
std::string::size_type n = line.find('|');
std::string::size_type n2 = line.find('|', n + 1);
movies.emplace(std::make_pair(line.substr(0, n), line.substr(n + 1, n2 - n - 1)));
}
// read ratings and create dataset
auto data = Dataset<int, std::string>();
std::ifstream ratings_file(path + "/u.data");
assert(ratings_file.is_open());
while (std::getline(ratings_file, line)) {
std::string::size_type n = line.find('\t');
std::string::size_type n2 = line.find('\t', n + 1);
std::string::size_type n3 = line.find('\t', n2 + 1);
data.push(
std::stoi(line.substr(0, n)),
movies.at(line.substr(n + 1, n2 - n - 1)),
std::stof(line.substr(n2 + 1, n3 - n2 - 1))
);
}
return data;
}
int main() {
// https://grouplens.org/datasets/movielens/100k/
char *movielens_path = std::getenv("MOVIELENS_100K_PATH");
if (!movielens_path) {
std::cout << "Set MOVIELENS_100K_PATH" << std::endl;
return 1;
}
auto data = load_movielens(movielens_path);
auto recommender = Recommender<int, std::string>::fit_explicit(data, { .factors = 20 });
std::string movie = "Star Wars (1977)";
std::cout << "Item-based recommendations for " << movie << std::endl;
for (auto& rec : recommender.item_recs(movie)) {
std::cout << "- " << rec.first << std::endl;
}
int user_id = 123;
std::cout << std::endl << "User-based recommendations for " << user_id << std::endl;
for (auto& rec : recommender.user_recs(user_id)) {
std::cout << "- " << rec.first << std::endl;
}
return 0;
}
Save recommendations to your database.
Alternatively, you can store only the factors and use a library like pgvector-cpp.
Disco uses high-performance matrix factorization.
- For explicit feedback, it uses the stochastic gradient method with twin learners
- For implicit feedback, it uses the conjugate gradient method
Specify the number of factors and iterations
auto recommender = Recommender<int, int>::fit_explicit(data, { .factors = 8, .iterations = 20 });
Pass a callback to show progress
auto callback = [](const disco::FitInfo& info) {
std::cout << info.iteration << ": " << info.train_loss << std::endl;
};
auto recommender = Recommender<int, int>::fit_explicit(data, { .callback = callback });
Note: train_loss
is not available for implicit feedback
Collaborative filtering suffers from the cold start problem. It’s unable to make good recommendations without data on a user or item, which is problematic for new users and items.
recommender.user_recs(new_user_id, 5); // returns empty array
There are a number of ways to deal with this, but here are some common ones:
- For user-based recommendations, show new users the most popular items
- For item-based recommendations, make content-based recommendations
Get ids
recommender.user_ids();
recommender.item_ids();
Get the global mean
recommender.global_mean();
Get factors
recommender.user_factors(user_id);
recommender.item_factors(item_id);
- A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization
- Faster Implicit Matrix Factorization
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/disco-cpp.git
cd disco-cpp
g++ -std=c++20 -Wall -Wextra -Werror -o test/main test/main.cpp
test/main