#Building a Machine Learning Application with AWS Lambda
This example builds a machine learning application using AWS Lambda, which is an Amazon service that automatically manages compute resources for code that is request-driven. It simplifies the process of scaling microservices, eliminating the need to provision or manage servers. The front-end of the application is a web browser, while the backend is a Lambda function, with components that include a function handler, Jython code for feature munging, and an H2O model POJO. The front-end and back-end communicate via a REST endpoint.
The application classifies domain names as legitimate or malicious. Malicious domains earn their label by engaging in malicious activity, such as botnets, phishing, and malware hosting. In order to defeat security systems, attackers use domain names that are generated by algorithms. To detect domains which may be malicious, the app builds a model based on linguistic features that distinguish regular domains from those that are algorithmically generated.
Legitimate domains | Malicious domains |
---|---|
h2o | zyxgifnjobqhzptuodmzov |
zen-cart | c3p4j7zdxexg1f2tuzk117wyzn |
fedoraforum | batdtrbtrikw |
The ["Make Data Products" presentation][] given at the Silicon Valley Big Data Science meetup on March 17, 2016 references this repo. ["Make Data Products" presentation]: https://github.com/h2oai/h2o-tutorials/tree/master/tutorials/aws-lambda-app
Data | Offline | Front-end | Back-end |
---|---|---|---|
legit-dga_domains.csv | build.gradle | src/main/webapp/index.html | lib/h2o-genmodel.jar (downloaded) |
src/main/resources/words.txt | h2o-model.py | src/main/webapp/app.js | lib/aws-lambda-java-core-1.0.0.jar |
lib/jython-standalone-2.7.0.jar | |||
src/main/java/Classify.java | |||
src/main/java/MaliciousDomainModel.java (generated) | |||
src/main/resources/pymodule.py |
$ gradle wrapper
http://www.h2o.ai/download/h2o/python
$ ./gradlew build
4.2 Click "Get Started Now", or if you have created functions already, click "Create a Lambda function".
Click the Upload button and select app-malicious-domains/build/distributions/app-malicious-domains.zip in the file selector.
In the Role field, select "*Basic execution role". In the new tab click "Allow" on the bottom right.
Click "Create function" on the bottom right. If this step fails, click "Previous" then provide the S3 link URL at "Upload a .ZIP from Amazon S3" after uploading app-malicious-domains.zip to S3,.
Enter JSON format of the domain name to be classified, for example {"domain":"plzdonthackmekthxbye"}, and click "Save and test". Execution results near the bottom of the page should display "succeeded" and give a JSON response. If an error message shows that the task timed out, click "Advanced settings" to increase the Timeout field.
Write down the API endpoint URL that now appears in the API endpoint tab. It will be needed for step 6.1.
$ ./gradlew jettyRunWar -x generateModel
(If you don't include the -x generateModel above, you will build the models and deployment package again, which is time consuming.)
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.493541945983:
0 1 Error Rate
----- ----- ----- ------- ---------------
0 15889 315 0.0194 (315.0/16204.0)
1 346 10043 0.0333 (346.0/10389.0)
Total 16235 10358 0.0249 (661.0/26593.0)
$ curl -X POST -d "{\"domain\":\"plzdonthackmekthxbye\"}" <api_endpoint_url>
{
"label": 1,
"class0Prob": 0.002564083122440164,
"class1Prob": 0.9974359168775598,
"intercept": -14.94132841574946,
"length": 29.841565204329598,
"entropy": 11.178560649883826,
"proVowels": -1.7679609134401084,
"numWords": -18.347249579636706
}
- A label of 1 means the domains is predicted malicious.
- A label of 0 means the loan is predicted legitimate (not malicious).
- class1Prob is 0.997. This is the probability a domain is malicious.
- The threshold, approximately 0.5, is chosen to maximize the F1 score.
Check if the function already exists and, if not, try again. For slower internet connections, try uploading the .zip file with a S3 link in the Code tab.
In the AWS Lambda console, click the Configuration tab. Click Advanced settings and increase the timeout field.
This is due to Lambda's cold start. Keep attempting domain names and after no more than a minute, the webapp should be responsive.
Performance was tested with JMeter on a MacBook Pro with 2.5 GHz Intel Core i7 on wireless internet connection over the office WAN. Before testing, a warm-up cycle of 100 loops was run. Times are in milliseconds. The body data of the POST request was {"domain":"plzdonthackmekthxbye"}.
Memory (MB) | Threads | Loops | Samples | Average | Median | 90% | 95% | 99% | Min | Max | Error % | Throughput (calls/sec) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
512 | 1 | 10000 | 10000 | 113 | 102 | 118 | 138 | 426 | 85 | 2137 | 0 | 8.4 |
512 | 10 | 1000 | 10000 | 170 | 102 | 148 | 182 | 334 | 85 | 30330 | 0.18 | 44 |
512 | 100 | 100 | 10000 | 392 | 149 | 643 | 943 | 1738 | 85 | 30307 | 0.43 | 168 |
###Gradle The gradle distribution shows how to do basic war and jetty plugin operations.
- https://services.gradle.org/distributions/gradle-2.7-all.zip
- unzip gradle-2.7-all
- cd gradle-2.7/samples/webApplication/customized
###AWS Lambda
http://docs.aws.amazon.com/lambda/latest/dg/create-deployment-pkg-zip-java.html
###Data Sources
- legit-dga_domains.csv (Available at http://datadrivensecurity.info/blog/data/2014/10/legit-dga_domains.csv.zip)
- src/main/resources/words.txt (Available at https://raw.githubusercontent.com/dwyl/english-words/master/words.txt)