This code pattern is part of a series on Red Hat Marketplace operator FP Predict Plus. Please refer to the Prerequisites
section for Getting Started
.
Machine learning is the science where in order to predict a value, algorithms are applied for a system to learn patterns within data. Most commonly, this means synthesizing useful concepts from historical data. Machine learning uses different types of learning, related to whole fields of study to specific techniques.
Classification in machine learning and statistics is a supervised learning approach in which the computer program learns from the data given to it and makes new observations or classifications. Different types of classification include binary classification, multiclass classification, and multilabel classification. In this tutorial, we focus on binary classification, but you can extend the methodology for other types of classification.
Examples of binary classification problems include:
- Classifying an example email as spam or not.
- Classifying recent user behavior as churn or not.
- Classifying recent transactions as fraudulent or not.
Specifically, this code pattern focuses on predicting fraudulent transactions using historical data and demonstrates the automated process of building models using FP Predict plus operator
from Red Hat Marketplace.
After completing this code pattern, you will understand how to:
- Quickly set up the instance on an OpenShift cluster for model building.
- Ingest the data and initiate the FP Predict Plus process.
- Build different models using FP Predict Plus and evaluate the performance of those models.
- Choose the best model and complete the deployment.
- Generate new predictions using the deployed model.
- User logs into FP Predict Plus platform using an instance of FP Predict plus operator.
- User uploads the data file in the CSV format to the Kubernetes storage on the Red Hat OpenShift platform.
- User initiates the model-building process using the FP Predict Plus operator on an OpenShift cluster and creates pipelines.
- User evaluates different pipelines from FP Predict Plus and selects the best model for deployment.
- User generates accurate predictions by using the deployed model.
In this code pattern, we show you how to use FP Predict Plus operator from Red Hat Marketplace in our steps. Please refer to the content under Included components section to know more about FP Predict Plus operator
and Red Hat Marketplace
.
You need to install and set up the FP Predict Plus operator on Open Shift cluster following the instruction in this tutorial: Get started using Findability Platform Predict Plus on Red Hat Marketplace.
- Red Hat Marketplace: A simpler way to buy and manage enterprise software, with automated deployment to any cloud
- FP Predict Plus: An sutomated, self learning, and multi-modeling AI tool that handles discrete target variables, continuous target variables, and time series data with no need for coding
- Red Hat OpenShift Container Platform: A hybrid cloud, enterprise container platform that empowers developers to innovate and ship faster
- Artificial intelligence: Any system which can mimic cognitive functions that humans associate with the human mind, such as learning and problem solving.
- Data science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
- Analytics: Analytics delivers the value of data for the enterprise.
- Machine learning: Machine learning is a form of AI that enables a system to learn from data rather than through explicit programming.
Follow these steps to set up and run this code pattern using FP Predict Plus.
- Add the data
- Create a job
- Review the job details
- Analyze results
- Download the model file
- Prediction using new data
- Create predict job
- Check job summary
- Analyze results of predict job
- Download predicted results
Launch the FP Predict Plus platform and sign in using the default credentials. Let's begin by adding data sets
. Clone this repo and navigate to the data
folder to download the data sets onto your local file system.
Click on Dataset Management which is the third option on the left navigation pane and select Datasets on the top.
Click on Browse and select the three CSV files for upload. It takes about 1 minute to upload the data sets to the platform, depending on the size of the data sets.
Note :Only the .csv format is supported, and the data set needs to have a column with unique values. In these CSV files, we added a Count
column to be unique. The data sets need to be split into training, testing, and holdout (validation) data sets before hand.
Citation is needed to use these data sets for other projects.
Now, let's create a new job in the platform. Click on Dashboard, which is the first option on the left navigation pane, and click Start on the top right hand side.
Select No to indicate that your data does not contain date or timestamps. The platform will understand to create a Predict
job with no date or timestamps. If the data set has a date or timestamp, the platform will create a Forecast
job.
Now, let's go ahead and create a new job by filling in the details. Update the name and select the task as Model + Predict since that is the task you will be doing. If you select Model, it will build a model for us. Select Predict if you have a model file ready for doing predictions. Set the data set location to Cloud, and select the "Train and Test" data sets using the Browse function for upload. Set the Target Variable to be Fraud_Risk and the Unique Identifier to be Count. Under the Advanced Settings tab, set the Operation Mode to Automated. Select Run to start the job.
The job will take a couple of minutes to complete. You can observe how many models are created and the scenarios evaluated in the process.
The model will try to use different scenarios, for instance, all variables or few variables, for predicting the outcome. You should see the job status per below.
Let's review the job details. Click the Dashboard tab, and select the job with the name detect-fraud
. The number that precedes the job name identifies how many jobs have run so far. You an ignore this number.
In our example, the probability distribution
of the model on the testing data has 50 records with the probability between 90%-100% for 43 records
. The model has performed with good accuracy on relatively small data sets.
You can observe the complete job details, including the description, modeling, and prediction. The system built 18 models using 505 records in 10 seconds from the training data and ran predictions on 50 records from the testing data.
To review the model performance in detail, select the Predicted vs Actual option to see the Confusion matrix. The model has achieved 86% "Overall Accuracy" which is a high percentage, especially considering the fact that the training data had only 505 records and the testing data had approximately 50 records. You can expect the accuracy to be higher for larger data sets. The average Precision rate is about 84%, and the average Recall rate is about 80%.
Click on Models to understand how many predictions each model does. Notice that model M-2
has scored for 29 records from the testing data.
Select Variables to understand the significance of each variable in predicting the outcome. Notice that five models used "Credit_History_Available" to predict the outcome, which makes it the most significant variable to impact the outcome.
Select Variables of Models to understand different scenarios explored by different models. Observe that models M-1
and M-2
used four variables, whereas M-3
used three variables and M-7
used two variables for building different models.
To download the results for further analysis of all the model details, click on Download Files > Download Results > Download Model File. Save the files to your local system. The Results and Model File for this experiment are also available in this repo under the reports
and model
folders. You can upload the Model file to a cloud using the Dataset Management
option as described earlier.
In this section, learn how to do predictions using the model on a new data set. We use the saved model
from the previous step to predict new records from the holdout data
. In the hold out data file, the target variable column Fraud_Risk
(without any values) needs to match the headers of the training data and predict the outcome given the transactions data.
To create a new job for prediction, select Dashboard in left navigation pane and click Start. Update the job name and description. Under Tasks, select Predict since you already built the model in the previous steps. Upload the model file
and holdout data
from you cloud or local system (whichever is most convenient) and set the Unique Identifier to be "Count".
The predict job will start per below. You should get a message stating Job Completed Successfully
in a minute or two.
To view the job summary, click Dashboard and select the predict-fraud
job. There are three records out of five that have a 90%-100% probability of predicting the outcome.
We can get more details in the next step where we can observe that 18 models were built in 10 seconds and prediction was made on five records from the holdout dataset.
Note
:- Predicted vs Actual option is not clickable because there's no actual value to be compared. We have generated the predicted values given a set of input parameters. You can review models, variables, and variables of models as part of the model evaluation.
To view the comprehensive data about model performance, select Download Files > Download Results as the followign screenshot shows.
We can observe predicted results for the new data in the second tab Prediction Result
under Predicted Value
attribute in the downloaded Excel file. The Results file named "predict-fraud-Report" is also available under the Reports folder for reference.
With this, we have come to an end of this code pattern. We have learnt how to use FP Predict Plus platform for building AI models using Classification
technique and also explored how to churn out predictions on the new dataset. This platform will be beneficial for developers, data scientists to build AI solutions quickly under different domains.
The dataset which is referenced in this tutorial is prepared by R.K.Sharath Kumar, Data Scientist, IBM India Software Labs.