Anime picture recommendation bot
source: danbooru2019
preprocessing: scripts/export_metadata_to_csv
That is an implicit dataset: we don't have some kind of individual ratings, only likes for each user.
- Jaccard index on user likes (collaborative)
- Jaccard index on tags (content-based)
- Logistic regression of the above
- (?) Some kind of SVD / ALS, some matrix magic?
For each user, image pair calculate the following:
# given target_user, image
sum( len(target_user.likes & user.likes) / len(target_user.likes | user.likes) for user in image.likes)
Image with the greatest value is the recommended image.
The code above is, in practice, slow in large dataset, but we can ignore users with small coefficient, and that should not significantly lower the quality.
Simple code, average performance, good quality.
-
If user belongs to two or more independent clusters (group of users with similar interests), then algorithm will recommend only images from the most weighted cluster, ignoring others. So, diversity suffers.
-
If given image does not belong to our dataset, we can do nothing with it. But it is a disadvantage for collaborative filtering systems at all.