Website: www.aisciences.io
- Streaming Data
- Machine Learning
- Batch Data
- ETL Pipelines
- Full load and Replication on going
- Student Data Analysis
- Employee Data Analysis
- Collaborative Filtering
- Spark Streaming
- ETL Pipeline
- Full Load and Replication on Going
- Speed
- Distributed
- Advanced Analytics
- Real Time
- Powerful Caching
- Fault Tolerant
- Deployment
- RDD is the spark’s core abstraction which stands for Resilient Distributed Dataset
- RDD is the immutable distributed collection of objects
- Internally spark distributes the data in RDD, to different nodes across the cluster to achieve parallelization.
- Transformation creates new RDD from an existing one.
- Actions return a value to the driver program after running a computation on the RDD
- All transformation in Spark are lazy
- Spark only triggers the data flow when there is a action.
-
Map is used as a mapper of data from one state to another
-
It will create a new RDD
-
rdd.map(lambda x:x.split())
-
Quiz
- Quiz solution : https://github.com/sawant98d/PySpark/blob/master/RDD%20Map.py
- Flat Map is used as a mapper of data and explodes data before final output
- It will create new RDD
- rdd.flatMap(lambda x:x.split())
- https://github.com/sawant98d/PySpark/blob/master/flat_map.py
- https://github.com/sawant98d/PySpark/blob/master/flat_map.ipynb
-
Filter is used to remove the elements from the RDD
-
It will create new RDD
-
rdd.filter(lambda x:x!=12)
-
Quiz
- Quiz solution - https://github.com/sawant98d/PySpark/blob/master/filter_quiz.py
- Distinct is used to get the distinct elements in RDD
- It will create new RDD
- rdd.distinct()
- https://github.com/sawant98d/PySpark/blob/master/rdd_distinct.ipynb
- groupByKey() is used to create group based on Keys in RDD
- For groupByKey to work properly the data must be in the format of (k,v), (k,v), (k2,v), (k2,v)
-
- Example ("Apple",1), ("Ball",1), ("Apple",1)
- It will create a new RDD
- rdd.groupByKey()
- mapValues(list) are usually used to get the group data
- reduceByKey() is used to combined data based on keys in RDD
- for reduceByKey() to work properly the data must be in the format of (k,v), (k,v), (k2,v), (k2,v)
-
- Example : ("Apple",1), ("Ball",1), ("Apple",1)
- It will create new RDD
- rdd.reduceByKey(lambda x,y:x+y)