EOSC Marketplace Recommender System uses Deep Reinforcement Learning to suggest relevant scientific services to appropriate researchers on the EOSC Marketplace portal.
The recommender system works as a microservice and exposes API to the Marketplace.
The inner structure can be described as two elements:
- web service part based on
Celery
andFlask
with API created, documented and validated withFlask_restx
andSwagger
- deep reinforcement learning part based on
Pytorch
and other ML libraries
All required project packages are listed in the pipfile
. For their installation look at the setup.
If you want to use GPU with PyTorch you need CUDA capable device.
- Install
git
,python
andpipenv
- Clone this repository and go to its root directory
git clone https://github.com/cyfronet-fid/recommender-system.git
- Install all required project packages by executing
pipenv install --dev
- To open project virtual environment shell, type:
pipenv shell
Launch EOSC Marketplace Recommender server by executing in the project root directory:
export FLASK_ENV=development
export FLASK_APP=app.py
pipenv run flask run
NOTE: You can customize flask host and flask port by using FLASK_RUN_HOST
and FLASK_RUN_PORT
env variables accordingly.
To run background tasks you also need a celery worker running alongside your server. To run the worker:
export FLASK_ENV=development
pipenv run celery -A worker:app worker
NOTE: Celery needs a running redis broker server in the background.
NOTE: It is recommended for the developers to use docker-compose to run all the background servers (see docker section below).
The recommender system is running celery to execute background tasks in a queue.
As a backend, we are using Redis. By default, Redis is running on redis://localhost:6379
.
NOTE: You can customize your Redis host URL using REDIS_HOST
env variable.
NOTE: It is recommended for the developers to use docker-compose to run all the background servers (see docker section below).
Install and start the MongoDB server following the Mongo installation instructions. It should be running on the default
URL mongodb://localhost:27017
.
NOTE: You can customize your MongoDB host path in the MONGODB_HOST
env variable.
You can interact with recommender system microservice using API available (by default) here: http://localhost:5000/
To run all background servers needed for development (Redis
, MongoDB
) it is recommended that you use Docker:
docker-compose up
Mongo will be exposed and available on your host on 127.0.0.1:27017
, and Redis on 127.0.0.1:6379
, although
you can change them using MONGODB_HOST
and REDIS_HOST
env variables accordingly.
NOTE: You still need to set up Flask server and Celery worker as shown above. This is advantageous over the next option because you can run Pytest directly from your IDE, debug the application simply, restart Flask server easily, and you also avoid having to rebuild your docker image if your dependencies change.
For full-stack local development deployment use:
docker-compose -f docker-compose.yml -f development.yml up
This will build application images and run the base Flask development server on 127.0.0.1:5000
(you can customize flask port and host using env variables).
This command will also run Celery worker, Mongo and Redis.
You can immediately change the server code without restarting the containers.
To run the Jupyter notebook server along with the application stack run:
docker-compose -f docker-compose.yml -f jupyter.yml up
NOTE: The URL of the Jupyter server will be displayed in the docker-compose output
(default: http://127.0.0.1:8888/?token=SOME_JUPYTER_TOKEN
) (you can customize Jupyter port and host using env variables)
Recommender system can use one of two recommendation engines implemented:
NCF
- based on Neural Collaborative Filtering paper.RL
- based on Deep Deterministic Policy Gradient paper.
To specify from which engine the recommendations are requested, provide an optional engine_version
parameter inside the body of \recommendations
endpoint. NCF
denotes the NCF engine, while RL
indicates the RL engine.
It is possible to define which algorithm should be used by default in the absence of the engine_version
parameter by modifying the DEFAULT_RECOMMENDATION_ALG
parameter from .env file
(look into ENV variables section).
The simplest way to train a chosen agent is using ./bin/rails recommender:update
task on the Marketplace side. It sends a request to the /update
endpoint of the Recommender System. It automatically sends the most recent training data, preprocesses and uses it to train needed models.
If you want to have more fine-grained control, you can split this process into two parts:
- sending the most recent data from MP to Recommender System
/database_dumps
endpoint (using./bin/rails recommender:serialize_db
task on the MP side) - triggering training by sending a request to the Recommender System
/training
endpoint (after the process described above finished)
GPU support can be enabled using an environmental variable TRAINING_DEVICE
(look into ENV variables section), but for now, it doesn't work in the dev/test/prod environments due to the fact that celery uses fork
rather than spawn
multiprocessing method - it is incompatible with CUDA
. Fix will be available soon.
After training is finished, the system is immediately ready for serving recommendations (no manual reloading is needed).
To run all the tests in our app run:
export FLASK_ENV=testing
pipenv run pytest ./tests
...or you can run them using docker:
docker-compose -f docker-compose.testing.yml up && docker-compose -f docker-compose.testing.yml down
We are using MongoDB as our database, which is a NoSQL, schema-less, document-based DB. However, we are also using mongoengine
- an
ODM (Object Document Mapping), which defines a "schema" for each document (like specifying field names or required values).
This means that we need a minimalistic migration system to apply the defined "schema" changes,
like changing a field name or dropping a collection, if we want to maintain the Application <=> DB integrity.
Migration flask CLI commands (first set the FLASK_ENV
variable to either development
or production
):
flask migrate apply
- applies migrations that have not been applied yetflask migrate rollback
- reverts the previously applied migrationflask migrate list
- lists all migrations along with their application statusflask migrate check
- checks the integrity of the migrations - if the migration files match the DB migration cacheflask migrate repopulate
- deletes migration cache and repopulates it with all the migrations defined in/recommender/migrate
dir.
To create a new migration:
- In the
/recommender/migrations
dir: - Create a python module with a name
YYYYMMDDMMHHSS_migration_name
(e.g.20211126112815_remove_unused_collections
) - In this module create a migration class (with an arbitrary name) which inherits from
BaseMigration
- Implement
up
(application) anddown
(teardown) methods, by usingself.pymongo_db
(pymongo, a low-level adapter for mongoDB, connected to proper (dependent on theFLASK_ENV
variable) recommender DB instance)
(See existing files in the /recommender/migrate
dir for a more detailed example.)
DO NOT DELETE EXISTING MIGRATION FILES. DO NOT CHANGE EXISTING MIGRATION FILE NAMES. DO NOT MODIFY THE CODE OF EXISTING MIGRATION FILES
(If you performed any of those actions, run flask migrate check
to determine what went wrong.)
We are using .env to store instance-specific constants or secrets. This file is not tracked by git and it needs to be present in the project root directory. Details:
MONGODB_HOST
- URL and port of your running MongoDB server (example:127.0.0.1:27018
) or desired URL and port of your MongoDB server when it is run using docker-compose (recommended)REDIS_HOST
- URL and port of your running Redis server (example:127.0.0.1:6380
) or desired URL and port of your Redis server when it is run using docker-compose (recommended)FLASK_RUN_HOST
- desired URL of your application server (example:127.0.0.1
)FLASK_RUN_PORT
- desired port of your application server (example:5001
)JUPYTER_RUN_PORT
- desired port of your Jupyter server when ran using Docker (example:8889
)JUPYTER_RUN_HOST
- desired host of your Jupyter server when ran using Docker (example:127.0.0.1
)CELERY_LOG_LEVEL
- log level of your Celery worker when ran using Docker (one of:CRITICAL
,ERROR
,WARN
,INFO
orDEBUG
)SENTRY_DSN
- The DSN tells the Sentry where to send the events (example:https://16f35998712a415f9354a9d6c7d096e6@o556478.ingest.sentry.io/7284791
). If that variable does not exist, Sentry will just not send any events.SENTRY_ENVIRONMENT
- environment name - it's optional and it can be a free-form string. If not specified and using Docker, it is set todevelopment
/testing
/production
respectively to the docker environment.SENTRY_RELEASE
- human-readable release name - it's optional and it can be a free-form string. If not specified, Sentry automatically set it based on the commit revision number.TRAINING_DEVICE
- the device used for training of neural networks:cuda
for GPU support orcpu
(note:cuda
support is experimental and works only in Jupyter notebookneural_cf
- not in the recommender dev/prod/test environment)DEFAULT_RECOMMENDATION_ALG
- the version of the recommender engine (one ofNCF
,RL
) - Whenever request handling or celery task need this variable, it is dynamically loaded from the .env file, so you can change it during flask server runtime.JMS_HOST
- the address of your JMS provider (optional)JMS_PORT
- the port of your JMS provider (optional)JMS_LOGIN
- your login to the JMS provider (optional)JMS_PASSWORD
- your password to the JMS provider (optional)
NOTE: All the above variables have reasonable defaults, so if you want you can just have your .env file empty.
To activate pre-commit run:
pipenv run pre-commit install
Install EnvFile plugin. Go to the run configuration of your choice, switch to EnvFile
tab, check Enable EnvFile
, click +
button below, select .env
file and click Apply
(Details on the plugin's page)
In Pycharm, go to Settings
-> Tools
-> Python Integrated Tools
-> Testing
and choose pytest
Remember to put FLASK_ENV=testing env variable in the configuration.
While committing using PyCharm Git GUI, pre-commit doesn't use project environment and can't find modules used in hooks.
To fix this, go to .git/hooks/pre-commit
generated by the above command in the project directory and replace:
# start templated
INSTALL_PYTHON = 'PATH/TO/YOUR/ENV/EXECUTABLE'
with:
# start templated
INSTALL_PYTHON = 'PATH/TO/YOUR/ENV/EXECUTABLE'
os.environ['PATH'] = f'{os.path.dirname(INSTALL_PYTHON)}{os.pathsep}{os.environ["PATH"]}'
Sentry
is integrated with the Flask
server and the Celery
task queue manager so all unhandled exceptions from these entities will be tracked and sent to the sentry.
Customization of the sentry integration can be done vie environmental variables (look into ENV variables section) - you can read more about them here