/big-data

:wrench: Use dplyr to analyze Big Data :elephant:

Primary LanguageR

Big Data with R

rstudio::conf 2020

Interested? See registration information here: RStudio Conference 2020


🗓️ January 27 and 28, 2020
⏰ 09:00 - 17:00
🏨 [ADD ROOM]
✍️ RStudio Conference 2020



Overview

This 2-day workshop covers how to analyze large amounts of data in R. We will focus on scaling up our analyses using the same dplyr verbs that we use in our everyday work. We will use dplyr with data.table, databases, and Spark. We will also cover best practices on visualizing, modeling, and sharing against these data sources. Where applicable, we will review recommended connection settings, security best practices, and deployment options.

Learning objectives

In this 2-day workshop, attendees will learn how to connect to and analyze large scale data

Is this course for me?

You should take this workshop if you want to learn how to work with big data in R. This data can be in-memory, in databases (like SQL Server), or in a cluster (like Spark).

Prework

Helpful reading

Some have asked for material that would be useful to review prior to the class. The following is a compilation of subjects would be great if you are familiar with already by the time the class begins, but it is not a requirement that you study or review them.

For database background, please review the articles in the following links:

For spark background, please review the following:

Equipment

We plan to provide a personal server to each student for use during the class. The server will contain all of the applications and materials needed, including R and RStudio. All you will need is a laptop with a web browser. For those of you that need to use their work provided laptops for the class, please ensure that the web browser in it will not be prevented from navigating to Amazon AWS, which is where the servers will be set up.

Schedule

Time Activity
09:00 - 10:30 Session 1
10:30 - 11:00 Coffee break
11:00 - 12:30 Session 2
12:30 - 13:30 Lunch break
13:30 - 15:00 Session 3
15:00 - 15:30 Coffee break
15:30 - 17:00 Session 4

Instructors

Edgar Ruiz

Solutions Engineer @ RStudio

Twitter: theotheredgar

LinkedIn: edgararuiz

James Blair

Solutions Engineer @ RStudio

Twitter: Blair09M

LinkedIn: blairjm

Class Outline

The following is a tentative outline of the subjects that will be covered during the class. The content and order is subject to change.

  • Introduction to vroom
    • vroom basics
    • Load multiple files
    • Load and modify multiple files
  • Introduction to dtplyr
    • dtplyr basics
    • Object sizes
    • How dtplyr works
    • Working with dtplyr
    • Pivot data
    • The mutate() verb
  • Introduction to database connections
    • Connecting via DSN
    • Connect with a connection string
    • Secure connection details
  • Introduction to DBI
    • Local database basics
    • Options for writing tables
    • Database operations
    • knitr SQL engine
  • Databases and dplyr
    • Intro to connections
    • Table reference
    • Under the hood
    • Un-translated R commands
    • Using bang-bang
  • Data Visualizations
    • Simple plot
    • Plot in one code segment
    • Create a histogram
    • Raster plot
    • Using the compute functions
  • Modeling with databases
    • Single step sampling
    • Using tidymodels for modeling
    • Score with tidypredict
    • Run predictions in DB
  • Advanced Operations
    • Simple wrapper function
    • Multiple variables
    • Multiple queries
    • Multiple queries with an overlapping range
    • Characters to field names
  • Intro to sparklyr
    • New Spark session
    • Data transfer
    • Spark and dplyr
  • Text mining with sparklyr
    • Data Import
    • Tidying data
    • Transform the data
    • Data Exploration
  • Spark data caching
    • Map data
    • Caching data
  • Big Data with R - Exercise book {-}
  • Introduction to vroom
    • vroom basics
    • Load multiple files
    • Load and modify multiple files
  • Introduction to dtplyr
    • dtplyr basics
    • Object sizes
    • How dtplyr works
    • Working with dtplyr
    • Pivot data
    • The mutate() verb
  • Introduction to database connections
    • Connecting via DSN
    • Connect with a connection string
    • Secure connection details
  • Introduction to DBI
    • Local database basics
    • Options for writing tables
    • Database operations
    • knitr SQL engine
  • Databases and dplyr
    • Intro to connections
    • Table reference
    • Under the hood
    • Un-translated R commands
    • Using bang-bang
  • Data Visualizations
    • Simple plot
    • Plot in one code segment
    • Create a histogram
    • Raster plot
    • Using the compute functions
  • Modeling with databases
    • Single step sampling
    • Using tidymodels for modeling
    • Score with tidypredict
    • Run predictions in DB
  • Advanced Operations
    • Simple wrapper function
    • Multiple variables
    • Multiple queries
    • Multiple queries with an overlapping range
    • Characters to field names
  • Intro to sparklyr
    • New Spark session
    • Data transfer
    • Spark and dplyr
  • Text mining with sparklyr
    • Data Import
    • Tidying data
    • Transform the data
    • Data Exploration
  • Spark data caching
    • Map data
    • Caching data

Interested? See registration information here: RStudio Conference 2020


This work is licensed under a Creative Commons Attribution 4.0 International License.