/aadhaar-dataset-analysis

An analysis on Aadhaar dataset using Mapreduce and Spark

Primary LanguageJava

MapReduce VS Spark - Aadhaar dataset analysis

Analyzing Aadhaar dataset using MapReduce and Spark

Requirements

  • IDE
  • Apache Maven 3.x
  • JVM 6 or 7

Objectives

  • Count the number of identities(Aadhaar) generated in each state
  • Count the number of identities(Aadhaar) generated by each Enrollment Agency
  • Top 10 districts with maximum identities generated for both Male and Female

General Info

The repository contains both MapReduce and Spark projects MRAadhaarAnalysis and SparkAadhaarAnalysis

  • com/stdatalabs/SparkAadhaarAnalysis
    • UIDStats.scala -- Spark code to analyze Aadhaar dataset
  • com/stdatalabs/MRAadhaarAnalysis
    • NumUIDMapper.java -- Filters the header and writes (State, Aadhaar_generated) to mapper output
    • NumUIDReducer.java -- Aggregates values for each State that is received as key from the mapper and outputs the State wise identities generated
    • SortMapper.java -- Receives output from previous MR job and swaps the (K, V) pair
    • SortComparator.java -- Sorts the mapper output in descending order before passing to reducer
    • SortReducer.java -- Swaps the (K, V) pair into (State, count) and sends to output file
    • Driver.java -- Driver program for MapReduce jobs

Description

More articles on hadoop technology stack at stdatalabs