/fluidos-energy-predictor

Predicting energy demand from past workload.

Primary LanguagePythonApache License 2.0Apache-2.0

FLUIDOS Energy Demand Predictor

Overview

This is a project to predict energy demand for a FLUIDOS node. It uses a neural network that takes as input, given a certain machine,

  • the past workload of the machine (at the moment, a week's worth of data)
  • a power profile of the machine (more information below)

and outputs the predicted energy demand for the following day.


As of April 22nd, 2024, this program is meant as a proof-of-concept, still in development and is not ready for production use. Features may be added or removed at any time. Accuracy may not be perfect. The program is provided as-is, with no guarantees of any kind.


Installation

Due to the machine-specific nature of the projects, no pre-trained models are provided. To use the project, you will need to train your own model.

Prerequisites

First of all, set the environment variables in the .env file, which is purposely not included in the repository for security reasons. The .env file must be placed in the root folder of the project, and must contain the following variables:

export DATA_TRAINING_FOLDER=/path/to/data/folder
export MODELS_TRAINING_FOLDER=/path/to/models/folder
export OUT_TRAINING_FOLDER=/path/to/output/folder
export TELEMETRY_NODE_HOST=127.0.0.1
export TELEMETRY_NODE_PORT=5000

Each variable represents the following:

  • the /path/to/data folder is used for storing the initial training data (read below for more information on the folder structure);
  • the /path/to/models folder is used for storing the trained models;
  • the /path/to/output folder is used for storing the predictions and test results;

You may then choose between the following installation methods:

Classic

Create a python3 virtual environment using the software of your choice (venv, conda, or anything else). Install python3==3.11.4 on it. Then, install the dependencies using pip install -r requirements.txt.

On macOS platforms running an Apple Silicon processor, you may want to install tensorflow-metal to accelerate GPU processing. This page provides precise information on how to correctly install the Tensorflow plugin on these platforms.

Docker

A Dockerfile is provided within the project. If you want to build the image for yourself, run docker build -t fluidos-energy-demand-predictor . You may want to use docker compose to run the image with the following docker-compose.yml file:

name: fluidos-energy-predictor
services:
    fluidos-energy-predictor:
        stdin_open: true
        tty: true
        container_name: predictor
        volumes:
            - ${DATA_TRAINING_FOLDER}:/app/data
            - ${MODELS_TRAINING_FOLDER}:/app/models
            - ${OUT_TRAINING_FOLDER}:/app/out
        environment:
            - TZ=Europe/Rome
        restart: unless-stopped
        image: ghcr.io/risingfbk/fluidos-energy-predictor:github

Pre-built images are available on the GitHub Container Registry and on Docker Hub, both for x86_64 and arm64 platforms.

Data folder structure

For the training, the data folder must contain the following files:

├── gcd
├── spec2008_agg
└── spec2008

Samples of these files are provided in the releases section of this repository.

The gcd file contains the workload data from the Google Cluster Data version 3 (repository, document), while the SPEC2008 contains power profiles from the SPEC2008 benchmark suite (link). The spec2008_agg file contains the aggregated power profiles from the SPEC2008 benchmark suite.

For your convenience, scripts for fetching spec2008 data and generating both gcd and spec2008 data are provided in the src/datasets/ folder.

Scripts for pulling data from the gcd dataset are not included in the scripts, as it requires a Google Cloud Platform account, a project with billing enabled, and a lot of patience. Our data was retrieved with both bq and manual downloads, both methods of which are described in the aforementioned link.

Usage

Run python3 src/main.py --help for a list of available commands. The program either catches flags passed to it and if not provided, automatically asks for the required information.

options:
  -h, --help            show this help message and exit
  --model MODEL, -m MODEL
                        Model name (if unspecified, will be prompted)
  --curve CURVE, -c CURVE
                        Power curve file (if unspecified, will be chosen randomly)
  --epochs EPOCHS, -e EPOCHS
                        Number of epochs (if unspecified, will be prompted)
  --action ACTION, -a ACTION
                        Action to perform (train, search hyperparameters, test)
  --machine MACHINE, -M MACHINE
                        GCD machine files to use (if unspecified, will be chosen randomly)

The program requires at least a model name (--model), a power curve for the machine (--power), and a machine to use for training (--machine). If unspecified, the model name will be prompted, while the power curve and the machine will be chosen randomly from the available ones.

Then, depending on the action (--action), the program will either search for hyperparameters, train, or test a certain model (eventually providing a number of epochs for training, --epochs).

If the program is set to train a model, it will automatically save the model and the logs in the models and out folders, respectively. If the program is set to test a model, it will automatically load the model from the models (as a convenience, the contents of the models folder are printed when the program is run) and save the predictions and the test results in the out folder.

Finally, the program will automatically generate a number of plots in the out folder, including the training and validation loss and the predictions. Predictions are additionally saved in the pred folder as .csv and .npy files.

By the way the model is currently implemented, it is suggested to use a healthy amount of epochs (at least 1000) for training. The program will automatically stop the training if the validation loss does not improve for a certain amount of epochs (see below for more information).

Specifying variables

The src/parameters.py file contains a number of constants that can be used to specify the parameters of the program.

Training parameters

TEST_FILE_AMOUNT = 24
TRAIN_FILE_AMOUNT = 24

The program uses the TRAIN_FILE_AMOUNT and TEST_FILE_AMOUNT constants to specify the number of files to use for training and testing, respectively. These files are pulled from the gcd and spec2008 folders, and specifically, from the subfolder the user specifies when running the program, or at random if the user does not specify one.

SPLIT = 0.25

The SPLIT constant specifies the percentage of the training data to use for validation.

PATIENCE = 150
LEARNING_RATE = 0.02

The PATIENCE constant specifies the number of epochs to wait before stopping the training if the validation loss does not improve. The LEARNING_RATE constant specifies the learning rate to use for the training. Take note that the code will adjust the learning rate automatically if the validation loss does not improve.

Model parameters

N_FEATURES = 2                              # Number of features (CPU, memory)
STEPS_OUT = 1                               # Number of output steps from the model (a single value)
STEPS_IN = WEEK_IN_MINUTES // GRANULARITY   # Number of input steps to the model (see below)

These parameters should not be changed, as they are hardcoded in the model. They specify the number of features and the number of output steps from the model. The WEEK_IN_MINUTES constant specifies the number of minutes in a week, and can be found in the support/dt.py file along with similar constants.

GRANULARITY = 15         # Granularity of the data in minutes

The GRANULARITY constant specifies the granularity of the data in minutes. This depends on how the data was generated. With GRANULARITY = 15, thus, the model will have 10080 // 15 = 672 input steps for a week of data.

FILTERS = 144
KSIZE = 3

The FILTERS and KSIZE constants specify the number of filters and the kernel size of the convolutional layers.

OVERHEAD = 1

The OVERHEAD constant specifies by how much should the energy consumption be increased to account for the overhead introduced by the machine.

Folder structure and other parameters

LOG_FOLDER = "out"
DEFAULT_MODEL = "model1"
MODEL_FOLDER = "models"
GCD_FOLDER = "data/gcd"
SPEC_FOLDER = "data/spec2008_agg"
CACHE_FOLDER = "data/cache"
BANLIST_FILE = "banlist"

These constants specify the folder structure of the program. Make sure you reflect any changes in the folder structure when changing these constants, especially if using Docker.

Using the banlist

The banlist is a list of files that shall not be used for training. It is useful when downloading and generating large batches of data: it might happen that some files are corrupt, badly formatted, or otherwise unusable. In this case, the program will automatically skip the file and continue with the next one. However, if the file is not skipped, it may cause a crash (although this is unlikely). To prevent this, the banlist can be used to prevent the program from using the file.

To use the banlist, create a file named banlist in the root folder of the project (or in the /app folder, if you are using Docker). The file must contain a list of file names, one per line. The program will automatically skip the files listed in the banlist. File paths must be specified from the root of the data/gcd folder. At the moment, skipping power curves from the spec2008 dataset is not supported.

Real-time predictions

The program can be used to make real-time predictions. To do so, run python3 src/realtime.py. The program will connect to the FLUIDOS Telemetry service and use the data to make predictions. The program will then start a Flask server that will serve the predictions as a JSON object.

Arguments

  • --model, -m: Specifies the model name. If unspecified, the user will be prompted to enter it. Default is None.
  • --telemetry, -t: Specifies the FLUIDOS Telemetry endpoint (i.e. http://localhost:46405/metrics). This argument is mutually exclusive with --debug.
  • --debug, -d: Enables debug mode, which uses CPU and memory data from files. This argument is mutually exclusive with --telemetry.
    • --cpufile: Specifies the CPU data file. This argument is required in debug mode. Default is None.
    • --memfile: Specifies the memory data file. This argument is required in debug mode. Default is None.
  • --output_port, -p: Specifies the output port for the Flask server. Default is 5000.
  • --truncate: If this flag is set, the data will be truncated to the required length if it is longer. If the data is shorter, keys will be aggressively deleted. This flag is used to automate the process of data parsing in the case of misaligned data.

Predictions

The program works similarly to the training program. It will automatically load the model from the models folder, but this time, it will refuse to run if the model is not found. The program will then connect to the FLUIDOS Telemetry service (or, if in debug mode, expect CPU and memory data files in Prometheus format), and use the data to make predictions. The program will use the same power curve as the training program, which is expected to be found in the model folder.

License

This project is licensed under the Apache 2.0 License - see the LICENCE file for details.

Acknowledgments

This project was developed as part of the FLUIDOS project, funded by the European Union's Horizon 2020 research and innovation programme under grant agreement 101070473 - HORIZON-CL4-2021-DATA-01. It is an integral part of the Work Package 6 of the project, which aims to define an energy- and carbon-aware computation model that can shift loads both in time and geography; devise cost-effective infrastructure optimisations for industrial environments; use Artificial Intelligence and Machine Learning methods for performance prediction and enhancement.