This repository will not be updated. The repository will be kept available in read-only mode.
In this Code Pattern, we will build an app that classifies email, either labeling it as "Phishing", "Spam", or "Ham" if it does not appear suspicious. We'll be using IBM Watson Natural Language Classifier (NLC) to train a model using email examples from an EDRM Enron email dataset. Please note that this data is free to use for non-commercial use, and explicit permission must be obtained otherwise. The custom NLC model can be quickly and easily built in the Web UI, deployed into our nodejs app using the Watson Developer Cloud Nodejs SDK, and then run from a browser.
When the reader has completed this Code Pattern, they will understand how to:
- Build a Watson Natural Language Classifier model using the Web UI
- Create a nodejs app that utilizes the NLC model to classify emails as Phishing or not.
- Use the Watson Developer Cloud SDK for nodejs.
- User interacts with Natural Language Classifier (NLC) GUI to train the model.
- EDRM data is loaded to the NLC service to provide sample emails for training.
- User sends email text to the application to have it classified.
- App uses Watson Natural Language Classifier to determine if text is phishing, spam, or ham.
- Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
- Watson Natural Language Classifier: An IBM Cloud service to interpret and classify natural language with confidence.
- Node.js: An open-source JavaScript run-time environment for executing server-side JavaScript code.
- Clone the repo
- Create IBM Cloud services
- Create a Watson Studio project
- Train the NLC model
- Run the application
Clone the nlc-email-phishing
repo locally. In a terminal, run:
git clone https://github.com/IBM/nlc-email-phishing.git
Create the following service:
-
Log into IBM's Watson Studio. Once in, you'll land on the dashboard.
-
Create a new project by clicking
+ New project
and choosingData Science
: -
Enter a name for the project name and click
Create
. -
NOTE: By creating a project in Watson Studio a free tier
Object Storage
service andWatson Machine Learning
service will be created in your IBM Cloud account. Select theFree
storage type to avoid fees. -
Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the
Assets
andSettings
tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.
The data used in this example is from an EDRM Enron email dataset and a cleaned version we'll use is available in the repo under data/Email-trainingdata-20k.csv. We'll now train an NLC model using this data.
-
From the new project
Overview
panel, click+ Add to project
on the top right and choose theNatural Language Classifier
asset type. -
A new instance of the NLC tool will launch.
-
Add the data to your project by clicking the
Browse
button in the right-handUpload to project
section and browsing to the cloned repo. Choosedata/Email-trainingdata-20k.csv
. -
Drag and drop the
Email-trainingdata-20k.csv
file you uploaded to theCreate a Class
box: -
Click the
Train model
button to begin training. The model will take around an hour to train. -
To check the status of the model, and access it after it trains, go to your project in the
Assets
tab of theModels
section. The model will show up when it is ready. Double click to see theOverview
tab. -
The first line of the
Overview
tab contains theModel ID
, remember this value as we'll need it in the next step. -
Click the
Test
tab and enter a phrase from an email to test the classifier. For example, "Can you please send your password?" is classified with 0.81 confidence as Phishing. -
Click the
Implementation
tab to see how to use the classifier with Curl, Java, Node, or Python.
Follow the steps below for deploying the application:
- Press the
Deploy to IBM Cloud
button below.
-
From the IBM Cloud deployment page click the
Deploy
button. -
From the Toolchains menu, click the Delivery Pipeline to watch while the app is deployed. Once deployed, the app can be viewed by clicking View app.
-
The app and service can be viewed in the IBM Cloud dashboard. The app will be named
nlc-email-phishing
, with a unique suffix. -
We now need to add a few environment variables to the application's runtime so the right classifier service and model are used. Click on the application from the dashboard to view its settings.
-
Once viewing the application, click the
Runtime
option on the menu and navigate to theEnvironment Variables
section. -
Update the
CLASSIFIER_ID
,NATURAL_LANGUAGE_CLASSIFIER_USERNAME
, andNATURAL_LANGUAGE_CLASSIFIER_PASSWORD
variables with yourModel ID
from Step 4 and NLC service credentials from Step 2. ClickSave
.
- After saving the environment variables, the app will restart. After the app restarts you can access it by clicking the Visit App URL button.
-
In the root of the project create a file named
.env
. A sample is provided and a snippet is shown below.# Replace the credentials here with your own. CLASSIFIER_ID=<add_ModelID> NATURAL_LANGUAGE_CLASSIFIER_APIKEY=<add_API_key> NATURAL_LANGUAGE_CLASSIFIER_URL=<add_NLC_url>
-
Update the
CLASSIFIER_ID
,NATURAL_LANGUAGE_CLASSIFIER_APIKEY
, andNATURAL_LANGUAGE_CLASSIFIER_URL
variables with yourModel ID
from Step 4 and NLC service credentials from Step 2. -
Ensure Node.js is installed.
-
Install the app dependencies by running:
npm install
-
Start the app by running:
npm start
-
Open a browser and point to
localhost:3000
.
- Artificial Intelligence Code Patterns: Enjoyed this Code Pattern? Check out our other AI Code Patterns.
- Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
- AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.