XIEQ/workshops

All materials for workshops - HackOn(Data) - Toronto

Apache-2.0

workshops

Workshops are based on the databricks EdX lectures on Apache Spark.

Goal: Learn how to apply data science and data engineering techniques using parallel programming in Apache Spark. The workshops are heavily based on the databricks EdX Series.

Participants are expected to review the material before the session and complete the weekly challenges. Participants will be awarded up to points each week based on their participation, labs completion, and other tasks (more below). Questions during the session are encouraged, but priority will be given to questions send before the session and those that benefit the majority of the group.

Points system:

Please view "Competition" tab on FAQ

How to communicate?

Please view "Communication" tab on FAQ

Late Submissions

Submission time	Points Subtracted
1 week > submission > 2 weeks	1
2 week > submission > 3 weeks	2
3 week > submission > 4 weeks	3
4 week > submission > 5 weeks	4
5 week > submission > 9 weeks	5

Schedule:

The sessions will be delivered out by zoom.us Link to Session. Click on the session title to see the link for that session. All sessions will be recorded and published after the live session.

Every Tuesday from 6:30pm to 8:30pm, starting on July 4, 2017 until the day of the hackathon.

Sessions:

Remote sessions: Use zoom.us

Join the live session here URL https://zoom.us/j/558311905?pwd=7KDJdpU_dNA

Jul 4 - In-person - Intro

Notebook usage
Intro to spark and pySpark API
Using RDDs
Lambda functions
RDD actions, transformation, caching
Debugging and lazy evaluation

Recording is Available Here

*Please subscribe to HackOn(Data) channel to get notified when we upload a new video!

Jul 11 - Virtual session - RDDs

Create a RDD and pair RDD
Counting words
Finding unique words and mean value
Reference to regular expressions https://regex101.com/#python
Apply word count to a file

Jul 18 - Virtual session - Data Exploration

Server log analysis statistics
Finding problematic endpoints, unique hosts
Visualizing data analysis results
Data exploration

Jul 25 - Virtual Session - Text Analysis

Text similarity of Entity Resolution
Weighted bag-of-words
Cosine similarity
Scalable Entity Resolution
Analysis

Aug 1 - Virtual Session: Review

Math review
Numpy and Spark
Lambda functions

Aug 8 - Virtual session - Read, parse, and visualize dataset

Baseline model
Train linear regression
Hyperparameter tuning
Features interaction

Aug 15 - Virtual session - Feature Hashing

One-Hot Encoding (OHE)
OHE Dictionary
Prediction and log loss evaluation
Feature reduction

Aug 22 - In-person - Principal Component Analysis

PCA on a sample dataset
PCA calculation and evaluation
Data preprocessing for PCA
Feature-based aggregation

Sep 9 - Hackathon day

Additional information:

Welcome to databricks: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#00%20Welcome%20to%20Databricks.html
Databricks dbfs https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks%20Overview/10%20Databricks%20File%20System%20-%20DBFS.html
Mount s3 bucket to databricks: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#03%20Data%20Sources/2%20AWS%20S3%20-%20py.html