Introduction to Big Data
What launched the Big Data era?
- Data Science #1 catalyst for economical growth
- Big Data is a stream of data
- Cloud Computing = Computing on demand = dyanmic and scalable data analysis
Applications: What makes big data valuable
- Big Data -> Better Models -> Higher Precision
- Big Data has enabled personalized marketing
- Big Data has enabled personalized healthcare
Where Does Big Data Come From?
- Machines
- People
- Organizations
- Interactive visualization:
Machine-Generated Data: It's Everywhere and There's a Lot!
- Largest source of data
- Internet of Things
- Real time problem detection
- Customer Relations
- Fraud Detection
- System monitoring / control
- SCADA (Supervisory Control and Data Acquisition)
Big Data Generated By People: The Unstructured Challenge
- Unstructured data
- Huge growth in data
Company | Data Processed Daily |
eBay | 100PB |
100PB | |
30PB | |
100TB | |
Spotify | 64TB |
- Handle big batches of distributed information
Spark / Apache Storm
- Real time processing
- Integrate with any database or store technology
Data Warehouse
- Datasource integration to one place
- Data gets extracted from different datasources and transformed and loaded in central database. This is called ETL
- Used for structured data
- Doesn't fit well with today's dynamic data world
Big Data pipeline
Organization-Generated Data: Structured but often siloed
Highly structured data
Data Silos (no 1 silo has access to all information)
Commercial Transaction
- Detect correlated products
- Estimate demand
- Capture Fraudulent Activity
Banking / Stock records
Credit Cards
Medical Records
Commercial Transaction + Open Data + Analytics = Better predictions
- Twitter data
- local events
- local weather
- in-store purchases
- online clicks
The Key: Integrating Diverse Data
Strucuted Data + Unstructured data + Price Optimization = Increased Revenue
Books about Price Optimization:
Data integration = Reduce data complexity + Increate Data availability + Unify your data system
Supply Chain Engineering: Useful Methods and Techniques
V's for Big Data
- Vasts amount of data
- Not just storage, we also need to be able to manage the data in a timely fashion. For processing, for instance.
- Exponential growth
- Structural variety
Formats and models
- Media variety
Medium in which data get delivered
- Semantic variety
How to interpret and operate on data
- Availability Variations
Real time / Intermittent
Speed of generation and pace from A to B
Real time over Batch processing
- Truthfullness and trustworthniness of data
- Quality
- How much the data is connected among it self
- The valence increases over time (social network)
- How will Big Data benefit business
Data Science: Getting Value out of Big Data
- Induce information from observations
- Big Data + Analysis + Question = Insight
- Data Science is not static, models are constantly getting better and new questions arise with new insights and data
What Data Science is made of
Building a Big Data Strategy
- Aim
- High level goals
- What data to collect
- Maybe start with objectives and derive which data to get to reach them
- Must have support from company and stakeholders
- Policy
- Privacy and lifetime
- Curation and quality
- Interoperability and regulation
P's of Data Science
- People
- Data Science team and project stakeholders
- Purpose
- Focus on the question, not the technology
- Process
- Process to iterate on, people can communicate and work together.
- Acquire -> Prepare -> Analyse -> Report -> Act
- Platforms
- Hadoop or other
- Programmability
- Reusable / Reproducable / Middleware
Asking the Right Questions
- Define the problem
- e.g. How can data be used to detect machine failure
- e.g. How can understand customer better to targeted marketing
Access the Situation and context
Define goal / success criteria
Steps in the Data Science Process
- Acquire data
- Finding data
- Retrieve data
- Query data
- Prepare
- Explore data
- Pre process
- Explore data
- Look the data
- Understand nature of data
- Preliminary analysis
- Pre-process
- Clean
- Integrate
- Package into format
- Analyse Data
- Choose analytical approach
- Build models
- Communicate Results
- Reports
- Results
- Apply results
All of the above is iterative and may be repeated
Step 1: Acquiring Data
- Leave no stone unturned to find data
- Make use of all data
- Data comes from all places, local and remote
- Data comes in different velocities
- Data can come from databases (SQL and NoSQL)
- Data can come from Text files
- Data can come from websites and REST APIs
Step 2-A: Exploring Data
- Specific characteristics of data
- Correlation
- Outliers
- Without this step, can't use data effectively
- Correlation graphs
- General trends (e.g. Sales prices increases every year)
- Mean, Median, Standard Deviation
- Visualization (heatmap, histograms, boxplots, scatter (correlation))
Data Exploration -> Data Understanding -> Informed Analysis
Step 2-B: Pre-Processing Data
- Clean + Transform to make it of quality
- Inconsistent values
- Duplicate records (constumer with same cpf)
- Missing values
- Invalid data (invalid CEP)
- Outliers
- Normalization of values
- Feature selection
- Remove feature (correlated features can be removed)
- Combine features
- Add Feature (cyclic features (sin/cos))
Dimensionality Reduction (PCA)
Remove data records with missing values
Merge duplicate records (conflicting values)
Invalid values -> Best estimate (missing employee age, guess from how long work)
Remove outliers, if not important for context or specific task
All of the above needs domain knowledge to be done correctly
Step 3: Analyzing Data
- Classification
- Regression
- Clustering
- Graph Analytics
- Use graph structure to find connections between entities
- Association Analysis
- Set of rules to come up with association with items or events. e.g. when they occur together
- Market basket analysis (customer behavior)
Step 4: Communicating Results
- What are the main results
- What added value to these results provide / what can the model add to the application
- Results vs Success criteria at the beggining of the project
- Aggregate facts to back the above
- Results could be puzzling or divergent from expectation
- Visualizations as previous steps
- Tables with results from more in depth analysis
Step 5: Turning Insights into Action
- Something in the business process should change to remove bottleneck?
- Data should be added to application to make it more accurate?
- Should segment population into more groups for more effective targeted marketing?
- Favorable Results? Revist data? Further Opportunities? Real time action?
Basic Scalable Computing Concepts
- Parallel Computer - Several nodes with specifialized capabilities
- Commodity Cluster - Several nodes less specialized than parallel computers
- Data Parallelism - Jobs that share nothing can word on different parts of the dataset
Programming Models for Big Data
- Split large volumes of data
- Access to data should be fast
- Distribute computations to nodes
- Replicate data partitions
- Enable adding new resources
- Optimized for specific data types
- Document
- Table
- Key-value
- Graph
- Multimedia
- Stream
- Scalability
- Fault Tolerance
- Optimized for different data types
- Facilitate a shared environment
Map Reduce
- Map = apply()
- Reduce = Summarize()
- Dataflow scripting
- SQL-like queries
- Process graphs
Storm, Spark, Flink
- In memory, realtime processing
Key-vakues / Sparse tables
- HBase
- Cassandra
- MongoDB
- Configuration and management
When to Use Hadoop
- Future anticipated data growth
- Long term availability of data
When not to use Hadoop
- Small dataset
- Advanced Algorithms
- Not a replacement for database
- Random Data Access
- Task Level Parallelism
Big Data Modeling and Management Systems
Data Management
How do we ingest the data?
Where and how do we store it?
How can we ensure data quality?
What operations do we perform on the data?
How can these operations be efficient?
How do we scale up data volume, variety, velocity and acess?
How do we keep the data secure