Introduction to Big Data

What launched the Big Data era?

Data Science #1 catalyst for economical growth
Big Data is a stream of data
Cloud Computing = Computing on demand = dyanmic and scalable data analysis

Applications: What makes big data valuable

Big Data -> Better Models -> Higher Precision
Big Data has enabled personalized marketing
Big Data has enabled personalized healthcare

Where Does Big Data Come From?

Machines
People
Organizations

Interactive visualization: http://lod-cloud.net/versions/2017-08-22/lod.svg

Machine-Generated Data: It's Everywhere and There's a Lot!

Largest source of data
Internet of Things
Real time problem detection
1. Customer Relations
2. Fraud Detection
3. System monitoring / control
SCADA (Supervisory Control and Data Acquisition)

Big Data Generated By People: The Unstructured Challenge

Unstructured data
Huge growth in data

Company	Data Processed Daily
eBay	100PB
Google	100PB
Facebook	30PB
Twitter	100TB
Spotify	64TB

Hadoop

Handle big batches of distributed information

Spark / Apache Storm

Real time processing
Integrate with any database or store technology

Data Warehouse

Datasource integration to one place
Data gets extracted from different datasources and transformed and loaded in central database. This is called ETL
Used for structured data
Doesn't fit well with today's dynamic data world

Big Data pipeline

Organization-Generated Data: Structured but often siloed

Highly structured data
Data Silos (no 1 silo has access to all information)
Commercial Transaction
1. Detect correlated products
2. Estimate demand
3. Capture Fraudulent Activity
Banking / Stock records
Credit Cards
E-Commerce
Medical Records

Commercial Transaction + Open Data + Analytics = Better predictions

Walmart

Twitter data
local events
local weather
in-store purchases
online clicks

The Key: Integrating Diverse Data

Strucuted Data + Unstructured data + Price Optimization = Increased Revenue

Books about Price Optimization:

Data integration = Reduce data complexity + Increate Data availability + Unify your data system

https://www.amazon.com/Operationalizing-Dynamic-Pricing-Models-Forecasting/dp/383492749X/ref=sr_1_1?ie=UTF8&qid=1512249499&sr=8-1&keywords=Dynamic+Pricing+Models

Supply Chain Engineering: Useful Methods and Techniques

V's for Big Data

Volume

Vasts amount of data
Not just storage, we also need to be able to manage the data in a timely fashion. For processing, for instance.
Exponential growth

Variety

Structural variety

Formats and models

Media variety

Medium in which data get delivered

Semantic variety

How to interpret and operate on data

Availability Variations

Real time / Intermittent

Velocity

Speed of generation and pace from A to B
Real time over Batch processing

Veracity

Truthfullness and trustworthniness of data
Quality

Valence

How much the data is connected among it self
The valence increases over time (social network)

Value

How will Big Data benefit business

Data Science: Getting Value out of Big Data

Induce information from observations
Big Data + Analysis + Question = Insight
Data Science is not static, models are constantly getting better and new questions arise with new insights and data

What Data Science is made of

Building a Big Data Strategy

High level goals
What data to collect
Maybe start with objectives and derive which data to get to reach them
Must have support from company and stakeholders

Policy

Privacy and lifetime
Curation and quality
Interoperability and regulation

Plan
Action

P's of Data Science

People

Data Science team and project stakeholders

Purpose

Focus on the question, not the technology

Process

Process to iterate on, people can communicate and work together.
Acquire -> Prepare -> Analyse -> Report -> Act

Platforms

Hadoop or other

Programmability

Reusable / Reproducable / Middleware

https://words.sdsc.edu/words-data-science/data-science

Asking the Right Questions

Define the problem

e.g. How can data be used to detect machine failure
e.g. How can understand customer better to targeted marketing

Access the Situation and context
Define goal / success criteria

Steps in the Data Science Process

Acquire data

Finding data
Retrieve data
Query data

Prepare

Explore data
Pre process

Explore data

Look the data
Understand nature of data
Preliminary analysis

Pre-process

Clean
Integrate
Package into format

Analyse Data

Choose analytical approach
Build models

Communicate Results

Reports
Results

Apply results

All of the above is iterative and may be repeated

Step 1: Acquiring Data

Leave no stone unturned to find data
Make use of all data
Data comes from all places, local and remote
Data comes in different velocities

Data can come from databases (SQL and NoSQL)
Data can come from Text files
Data can come from websites and REST APIs

Step 2-A: Exploring Data

Specific characteristics of data
Correlation
Outliers
Without this step, can't use data effectively
Correlation graphs
General trends (e.g. Sales prices increases every year)
Mean, Median, Standard Deviation
Visualization (heatmap, histograms, boxplots, scatter (correlation))

Data Exploration -> Data Understanding -> Informed Analysis

Step 2-B: Pre-Processing Data

Clean + Transform to make it of quality
Inconsistent values
Duplicate records (constumer with same cpf)
Missing values
Invalid data (invalid CEP)
Outliers
Normalization of values
Feature selection

Remove feature (correlated features can be removed)
Combine features
Add Feature (cyclic features (sin/cos))

Dimensionality Reduction (PCA)
Remove data records with missing values
Merge duplicate records (conflicting values)
Invalid values -> Best estimate (missing employee age, guess from how long work)
Remove outliers, if not important for context or specific task

All of the above needs domain knowledge to be done correctly

Step 3: Analyzing Data

Classification
Regression
Clustering
Graph Analytics

Use graph structure to find connections between entities

Association Analysis

Set of rules to come up with association with items or events. e.g. when they occur together
Market basket analysis (customer behavior)

Step 4: Communicating Results

What are the main results
What added value to these results provide / what can the model add to the application
Results vs Success criteria at the beggining of the project
Aggregate facts to back the above
Results could be puzzling or divergent from expectation
Visualizations as previous steps
Tables with results from more in depth analysis

Step 5: Turning Insights into Action

Something in the business process should change to remove bottleneck?
Data should be added to application to make it more accurate?
Should segment population into more groups for more effective targeted marketing?
Favorable Results? Revist data? Further Opportunities? Real time action?

Basic Scalable Computing Concepts

Parallel Computer - Several nodes with specifialized capabilities
Commodity Cluster - Several nodes less specialized than parallel computers
Data Parallelism - Jobs that share nothing can word on different parts of the dataset

Programming Models for Big Data

Split large volumes of data
Access to data should be fast
Distribute computations to nodes
Replicate data partitions
Enable adding new resources
Optimized for specific data types

Document
Table
Key-value
Graph
Multimedia
Stream

Hadoop

Scalability
Fault Tolerance
Optimized for different data types
Facilitate a shared environment

Map Reduce

Map = apply()
Reduce = Summarize()

Pig

Dataflow scripting

Hive

SQL-like queries

Giraph

Process graphs

Storm, Spark, Flink

In memory, realtime processing

Key-vakues / Sparse tables

HBase
Cassandra
MongoDB

Zookeeper

Configuration and management

When to Use Hadoop

Future anticipated data growth
Long term availability of data

When not to use Hadoop

Small dataset
Advanced Algorithms
Not a replacement for database
Random Data Access
Task Level Parallelism

Cloud

Big Data Modeling and Management Systems

Data Management

How do we ingest the data?
Where and how do we store it?
How can we ensure data quality?
What operations do we perform on the data?
How can these operations be efficient?
How do we scale up data volume, variety, velocity and acess?
How do we keep the data secure

brunoklein99/bigdata-uoc-notes

Introduction to Big Data

What launched the Big Data era?

Applications: What makes big data valuable

Where Does Big Data Come From?

Machine-Generated Data: It's Everywhere and There's a Lot!

Big Data Generated By People: The Unstructured Challenge

Hadoop

Spark / Apache Storm

Data Warehouse

Big Data pipeline

Organization-Generated Data: Structured but often siloed

The Key: Integrating Diverse Data

V's for Big Data

Volume

Variety

Velocity

Veracity

Valence

Value

Data Science: Getting Value out of Big Data

What Data Science is made of

Building a Big Data Strategy

P's of Data Science

Asking the Right Questions

Steps in the Data Science Process

Step 1: Acquiring Data

Step 2-A: Exploring Data

Step 2-B: Pre-Processing Data

Step 3: Analyzing Data

Step 4: Communicating Results

Step 5: Turning Insights into Action

Basic Scalable Computing Concepts

Programming Models for Big Data

Hadoop

Map Reduce

Pig

Hive

Giraph

Storm, Spark, Flink

Key-vakues / Sparse tables

Zookeeper

When to Use Hadoop

When not to use Hadoop

Cloud

Big Data Modeling and Management Systems

Data Management