MovieLens in Neo4j

Load MovieLens dataset into Neo4j and provide an API to retrieve data.

Requirements
Project structure
Data
Graph structure
Ingestion
API
Docker
Recommender Engine

Requirements

python 3.7
py2neo
neo4j
flask
swagger
connexion

Project structure

├── api/
│   │── swagger/
│   │   └── swagger.yml
│   └── movielens-app.py│
│      
├── docker/
│   └── ...
│
├── ingestion/
│   │── data/
│   │   │── links.csv
│   │   │── movies.csv
│   │   │── ratings.csv
│   │   └── tags.csv
│   │── test/
│   │   └── ingestion_tests.py
│   └── ingestion.py
│
└── README.md

Data

MovieLens 20M Dataset

20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users.

Ratings Data File Structure (ratings.csv)

All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp

Tags Data File Structure (tags.csv)

All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

userId,movieId,tag,timestamp

Movies Data File Structure (movies.csv)

Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres

Links Data File Structure (links.csv)

Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,imdbId,tmdbId

Graph Structure

The graph structure consists of nodes with 3 distinct labels (Genre, Movie, User), and 3 relationships (RATED, TAGGED, IS_GENRE_OF). Links are added as additional properties to movie nodes.

Ingestion

Python script (ingestion.py) that loads MovieLens dataset into Neo4j in a graph structure.

Steps

Create Genre nodes
Load movies.csv
- Create Movie nodes
- Create Movie-Genre relationships
Load ratings.csv
- Create User nodes
- Create User-Movie rating relationships
Load tags.csv
- Create User-Movie tag relationships
Load links.csv
- Update Movie nodes properties with links

API

Description

API documentation is generated using Swagger and Connexion.

One example:

/api/movie/ratings/[TITLE]

Returns the ratings submitted for a given movie.

http://localhost:5000/api/movie/ratings/Braveheart

will return:

[
  {
    "rating": 4.0, 
    "user": "User 1"
  }, 
  {
    "rating": 4.0, 
    "user": "User 5"
  }, 
  {
    "rating": 5.0, 
    "user": "User 6"
  }
]

Documentation

When docker compose up is finished go to http://localhost:5000/api/ui to see the full documentation.

For example:

Docker

Instructions

cd into folder
run docker-compose up
wait for ingestion to finish
open Neo4j UI at http://localhost:7474
open API documentation at http://localhost:5000/api/ui

For the Docker solution the MovieLens version with 100K ratings was used.

If you want to use the 20M dataset:

download dataset from http://files.grouplens.org/datasets/movielens/ml-20m.zip
move unzipped data into docker/ingestion/data
then follow instructions above

By default it only loads 1000 movies/links/ratings/tags.

If you want to increase that, you can do so by changing ingestion.py:

N_MOVIES = 1000
N_RATINGS = 1000
N_TAGS = 1000
N_LINKS = 1000

If only a subset is used, some relationships might not be created due to missing nodes.

Structure

docker/
├── api/
│   │── swagger/
│   │   └── swagger.yml
│   │── Dockerfile
│   │── movielens-app.py
│   └── requirements.txt
│      
├── ingestion/
│   │── data/
│   │   │── links.csv
│   │   │── movies.csv
│   │   │── ratings.csv
│   │   └── tags.csv
│   │── Dockerfile
│   │── ingestion.py
│   └── requirements.tx
│
└── docker-compose.yml

Recommender Engine

Based on: http://guides.neo4j.com/sandbox/recommendations

Content-based

Recommend top N movies for a given movie, based on common genres.

/api/rec_engine/content/[TITLE]/[N]

Example:

Top 3 movies similar to Braveheart.

http://localhost:5000/api/rec_engine/content/Braveheart/3

Returns:

[
  {
    "genres": [
      "Action", 
      "Drama", 
      "War"
    ], 
    "numberOfSharedGenres": 3, 
    "title": "Courage Under Fire"
  }, 
  {
    "genres": [
      "Action", 
      "Drama", 
      "War"
    ], 
    "numberOfSharedGenres": 3, 
    "title": "Great Escape, The"
  }, 
  {
    "genres": [
      "Action", 
      "Drama", 
      "War"
    ], 
    "numberOfSharedGenres": 3, 
    "title": "Henry V"
  }
]

Cypher query in Neo4j:

MATCH (m:Movie)<-[:IS_GENRE_OF]-(g:Genre)-[:IS_GENRE_OF]->(rec:Movie)
WHERE m.title = [TITLE]
WITH rec, COLLECT(g.name) AS genres, COUNT(*) AS sharedGenres
RETURN rec.title as title, genres, sharedGenres
ORDER BY sharedGenres DESC LIMIT [N];

Collaborative Filtering

Recommend top N movies for a given user, based on collaborative filtering. For this to work properly much more than 1000 ratings should be loaded.

/api/rec_engine/collab/[USER_ID]/[N]