/disco-cpp

Recommendations for C++ using collaborative filtering

Primary LanguageC++MIT LicenseMIT

Disco C++

🔥 Recommendations for C++ using collaborative filtering

  • Supports user-based and item-based recommendations
  • Works with explicit and implicit feedback
  • Uses high-performance matrix factorization

🎉 Zero dependencies

Build Status

Installation

Add the header to your project and include it (supports C++20 and greater)

#include "disco.hpp"

Getting Started

Prep your data in the format user_id, item_id, value

using disco::Dataset;

auto data = new Dataset<std::string, std::string>();
data.push("user_a", "item_a", 5.0);
data.push("user_a", "item_b", 3.5);
data.push("user_b", "item_a", 4.0);

IDs can be integers, strings, or any other hashable data type

data.push(1, "item_a", 5.0);

If users rate items directly, this is known as explicit feedback. Fit the recommender with:

using disco::Recommender;

auto recommender = Recommender<std::string, std::string>::fit_explicit(data);

If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Use 1.0 or a value like number of purchases or page views for the dataset, and fit the recommender with:

auto recommender = Recommender<std::string, std::string>::fit_implicit(data);

Get user-based recommendations - “users like you also liked”

recommender.user_recs(user_id, 5);

Get item-based recommendations - “users who liked this item also liked”

recommender.item_recs(item_id, 5);

Get predicted ratings for a specific user and item

recommender.predict(user_id, item_id);

Get similar users

recommender.similar_users(user_id, 5);

Examples

MovieLens

Download the MovieLens 100K dataset.

And use:

#include <cassert>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>
#include <unordered_map>

#include "disco.hpp"

using disco::Dataset;
using disco::Recommender;

Dataset<int, std::string> load_movielens(const std::string& path) {
    std::string line;

    // read movies
    std::unordered_map<std::string, std::string> movies;
    std::ifstream movies_file(path + "/u.item");
    assert(movies_file.is_open());
    while (std::getline(movies_file, line)) {
        std::string::size_type n = line.find('|');
        std::string::size_type n2 = line.find('|', n + 1);
        movies.emplace(std::make_pair(line.substr(0, n), line.substr(n + 1, n2 - n - 1)));
    }

    // read ratings and create dataset
    auto data = Dataset<int, std::string>();
    std::ifstream ratings_file(path + "/u.data");
    assert(ratings_file.is_open());
    while (std::getline(ratings_file, line)) {
        std::string::size_type n = line.find('\t');
        std::string::size_type n2 = line.find('\t', n + 1);
        std::string::size_type n3 = line.find('\t', n2 + 1);
        data.push(
            std::stoi(line.substr(0, n)),
            movies.at(line.substr(n + 1, n2 - n - 1)),
            std::stof(line.substr(n2 + 1, n3 - n2 - 1))
        );
    }

    return data;
}

int main() {
    // https://grouplens.org/datasets/movielens/100k/
    char *movielens_path = std::getenv("MOVIELENS_100K_PATH");
    if (!movielens_path) {
        std::cout << "Set MOVIELENS_100K_PATH" << std::endl;
        return 1;
    }

    auto data = load_movielens(movielens_path);
    auto recommender = Recommender<int, std::string>::fit_explicit(data, { .factors = 20 });

    std::string movie = "Star Wars (1977)";
    std::cout << "Item-based recommendations for " << movie << std::endl;
    for (auto& rec : recommender.item_recs(movie)) {
        std::cout << "- " << rec.first << std::endl;
    }

    int user_id = 123;
    std::cout << std::endl << "User-based recommendations for " << user_id << std::endl;
    for (auto& rec : recommender.user_recs(user_id)) {
        std::cout << "- " << rec.first << std::endl;
    }

    return 0;
}

Storing Recommendations

Save recommendations to your database.

Alternatively, you can store only the factors and use a library like pgvector-cpp.

Algorithms

Disco uses high-performance matrix factorization.

Specify the number of factors and iterations

auto recommender = Recommender<int, int>::fit_explicit(data, { .factors = 8, .iterations = 20 });

Progress

Pass a callback to show progress

auto callback = [](const disco::FitInfo& info) {
    std::cout << info.iteration << ": " << info.train_loss << std::endl;
};
auto recommender = Recommender<int, int>::fit_explicit(data, { .callback = callback });

Note: train_loss is not available for implicit feedback

Cold Start

Collaborative filtering suffers from the cold start problem. It’s unable to make good recommendations without data on a user or item, which is problematic for new users and items.

recommender.user_recs(new_user_id, 5); // returns empty array

There are a number of ways to deal with this, but here are some common ones:

  • For user-based recommendations, show new users the most popular items
  • For item-based recommendations, make content-based recommendations

Reference

Get ids

recommender.user_ids();
recommender.item_ids();

Get the global mean

recommender.global_mean();

Get factors

recommender.user_factors(user_id);
recommender.item_factors(item_id);

References

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/disco-cpp.git
cd disco-cpp
g++ -std=c++20 -Wall -Wextra -Werror -o test/main test/main.cpp
test/main