/kdd-2018

Primary LanguageJupyter Notebook

GPU Open Analytics Initiative

Accelerating the Scalable Data Science Environment with GPU-enabled Python

KDD'18 Hands-On Tutorial

Tuesday 8:30 am


Software / Hardware Requirements

The tutorial will leverage cloud resources that will provide the a common environment for all students.

Requirements:

  • Laptop with WiFi

    • We will be using the conference WiFi, please ensure that you can connect prior to the tutorial
  • Web browser - latest version of any will work, preference is towards Firefox or Chrome.

Tutorial Agenda

Introductions

  • Who we are

Getting Connected

  • Connect to Qwiklabs
  • Introduction notebook to validate

Introduction and Background

  • Big Data Ecosystem
  • Challenges in Big Data today
  • Apache Arrow
  • GPUs for compute
  • The GPU Open Analytics Initiative
  • The GPU Data Frame (GDF)
  • Python library for GDF (PyGDF)

Hands-on: Data Loading and Manipulation

  • Lab 1: Data Loading and Manipulation

    • Traditional interface through Pandas
    • Pandas to/from PyGDF
    • Column Function and Basic Transforms
    • Filtering
  • Student Assignment

Break

Hands-on: Data Science and Machine Learning

  • Lab 3: Classification using XGBoost
    • Familarize with IoT cyber network data
    • Data ingest and feature extraction
    • Time binning and preparation for classifiation
    • Building XGBoost model
    • Evaluating the model via ROC curves and AUC
    • Student Assignment:
      • Investigation into other time binnings, aggregations, and XGBoost parameters
      • Using additional features (quantitative and categorical) in the data to build better models
      • Moving beyond connection logs to other log types (e.g., DNS) and building models

Break

Wrap-up and Conclusion-

  • Roadmap
  • Scaling out to multi-GPU and multi-node
  • Partner Activities
  • Comclusion