Run a Kedro project in a Docker environment


  • Docker
  • Kedro 0.16.6
  • Kedro-Docker 0.2.1
  • scikit-learn 0.23.0
  • pickle 0.0.11


  • Read data from csv files and excel file as well, pre-process files and then save csv files
  • Split data and then save pickle files
  • Read pickle files, run train model and then save the regression model(pickle format)
  • Load the regression model and run Predict from the pickle model
  • Set configs in the conf/*/credentials.yml
            aws_access_key_id: token
            aws_secret_access_key: key


  • Setup Kedro Environment

  • Install Kedro Docker

    pip install kedro-docker==0.2.1
  • Generate a Dockerfile

    example$ kedro docker init
  • Build a Docker image

    example$ kedro docker build

    It will create a Docker image with example:latest name

  • Run a Kedro project in a Docker Environment

    example$ kedro docker run

Usage (in Docker)with code from github(Don't install Kedro Environment)

  • Download code from github

  • Build a Docker image

    $cd example
    $docker build --tag=kedro-docker .
  • Run a docker image

    $docker run -it kedro-docker bash
  • Run a Kedro project

    $kedro run


kedro@b792338c69b3:~$ kedro run


































































INFO - Model has a coefficient R^2 of 0.456.
2020-12-28 07:28:11,411 - kedro.runner.sequential_runner -
INFO - Completed 6 out of 6 tasks


  • Could not load Excel Data Set

    kedro.io.core.DataSetError: Failed while loading data from data set ExcelDataSet
Excel xlsx file; not supported
    load_args={'engine': xlrd}, protocol=file, save_args={'index': False},
    writer_args={'engine': xlsxwriter}).
    Excel xlsx file; not supported

    Fixed: xlrd==1.2.0

  • Load Data from AWS S3

    server_1  |     ds_name, ds_config, load_versions.get(ds_name), save_version
    webserver_1  |   File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py",
    line 185, in from_config
    webserver_1  |     ) from err
    webserver_1  | kedro.io.core.DataSetError:
    webserver_1  | get_session() got an unexpected keyword argument 'aws_access_key_id'.
    webserver_1  | DataSet 'companies' must only contain arguments valid for the co
    File "/usr/local/lib/python3.7/site-packages/pluggy/callers.py", line 187, in _multicall
        res = hook_impl.function(*args)
    File "/home/kedro/src/example/hooks.py", line 78, in register_catalog
        catalog, credentials, load_versions, save_version, journal
    File "/usr/local/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 328, in from_config
        ds_name, ds_config, load_versions.get(ds_name), save_version
    File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py", line 185, in from_config
        ) from err
    create_client() got multiple values for keyword argument 'aws_access_key_id'.
    DataSet 'companies' must only contain arguments valid for
    the constructor of `kedro.extras.datasets.pandas.csv_dataset.CSVDataSet`.

    Fixed: install s3fs==0.4.0 and update credentials.yml https://discourse.kedro.community/t/how-do-i-pass-s3-credentials-to-my-datasets/156

            aws_access_key_id: access_key
            aws_secret_access_key: secret_key
