Workshops are based on the databricks EdX lectures on Apache Spark.
Goal: Learn how to apply data science and data engineering techniques using parallel programming in Apache Spark. The workshops are heavily based on the databricks EdX Series.
Participants are expected to review the material before the session and complete the weekly challenges. Participants will be awarded up to points each week based on their participation, labs completion, and other tasks (more below). Questions during the session are encouraged, but priority will be given to questions send before the session and those that benefit the majority of the group.
Please view "Competition" tab on http://hackondata.com/2017/index.html#faq
The deadline for submission is the end of day the Monday before the session.
Submission time | Points Subtracted |
---|---|
1 week > submission > 2 weeks | 1 |
2 week > submission > 3 weeks | 2 |
3 week > submission > 4 weeks | 3 |
4 week > submission > 5 weeks | 4 |
5 week > submission > 9 weeks | 5 |
** How to communicate?
Please view "Communication" tab on Please view "Competition" tab on http://hackondata.com/2017/index.html#faq
** If you aren't a member; send an email to mehrdad@tranquant.com with Slack Invitation Request as the subject
The sessions will be delivered out by zoom.us (link). Click on the session title to see the link for that session. All sessions will be recorded and published after the live session.
Every Tuesday from 7:00pm to 8:30pm, starting on July 4, 2017 until the day of the hackathon.
Join the live session here URL https://zoom.us/j/558311905?pwd=7KDJdpU_dNA
- Notebook usage
- Intro to spark and pySpark API
- Using RDDs
- Lambda functions
- RDD actions, transformation, caching
- Debugging and lazy evaluation
- Create a RDD and pair RDD
- Counting words
- Finding unique words and mean value
- Reference to regular expressions https://regex101.com/#python
- Apply word count to a file
- Server log analysis statistics
- Finding problematic endpoints, unique hosts
- Visualizing data analysis results
- Data exploration
- Text similarity of Entity Resolution
- Weighted bag-of-words
- Cosine similarity
- Scalable Entity Resolution
- Analysis
- Math review
- Numpy and Spark
- Lambda functions
- Baseline model
- Train linear regression
- Hyperparameter tuning
- Features interaction
- One-Hot Encoding (OHE)
- OHE Dictionary
- Prediction and log loss evaluation
- Feature reduction
- PCA on a sample dataset
- PCA calculation and evaluation
- Data preprocessing for PCA
- Feature-based aggregation
Additional information:
- Welcome to databricks: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#00%20Welcome%20to%20Databricks.html
- Databricks dbfs https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks%20Overview/10%20Databricks%20File%20System%20-%20DBFS.html
- Mount s3 bucket to databricks: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#03%20Data%20Sources/2%20AWS%20S3%20-%20py.html