Analyzing Aadhaar dataset using MapReduce and Spark
- IDE
- Apache Maven 3.x
- JVM 6 or 7
- Count the number of identities(Aadhaar) generated in each state
- Count the number of identities(Aadhaar) generated by each Enrollment Agency
- Top 10 districts with maximum identities generated for both Male and Female
The repository contains both MapReduce and Spark projects MRAadhaarAnalysis and SparkAadhaarAnalysis
- com/stdatalabs/SparkAadhaarAnalysis
- UIDStats.scala -- Spark code to analyze Aadhaar dataset
- com/stdatalabs/MRAadhaarAnalysis
- NumUIDMapper.java -- Filters the header and writes (State, Aadhaar_generated) to mapper output
- NumUIDReducer.java -- Aggregates values for each State that is received as key from the mapper and outputs the State wise identities generated
- SortMapper.java -- Receives output from previous MR job and swaps the (K, V) pair
- SortComparator.java -- Sorts the mapper output in descending order before passing to reducer
- SortReducer.java -- Swaps the (K, V) pair into (State, count) and sends to output file
- Driver.java -- Driver program for MapReduce jobs
- A comparison between MapReduce and Apache Spark Dataframes code for analyzing Aadhaar dataset Discussed in blog -- MapReduce VS Spark - Aadhaar dataset analysis