ACG Modelling Lab

The purpose of this project is to capture the ACloud Guru modelling lab.

Set-up

To use these scripts, we will need to set the following environment variables:

Within this lab, we wish to:

Explore and examine the dataset in order to create a dataset ready for analysis
Applying transformations and uploading the data to S3, where it is available for modelling
Apply a suitable algorithm via SageMaker
Interpret the results.

Initial exploration of the dataset is to take place in a Jupyter Lab environment.

We are interested in determining:

The end point of this process is to define a "spec" for a data cleaning tool.

In this section, we will develop a pipeline to:

Apply transformations (as determined in the previous step)
Create an output dataset that can be uploaded to S3
- Including a unique name for traceability
Upload it to S3

This could be executed either in a Notebook or as independent command-line scripts. I tend towards the latter.

Here, we will:

We will take the results back and plot them.

This can take place in the Jupyter environment.

Documentation for the K-means algo includes:

Some basic requirements for the model are:

Tabular data
Of continuous variables
Where the n features correspond to n dimensional space to group the points in.