/sparkNLP-elasticsearch

Twitter sentiment analysis using Spark and Stanford CoreNLP and visualization using elasticsearch and kibana

Primary LanguageScala

SparkTwitterPopularHashTags

A project on Spark Streaming to analyze Popular hashtags from live twitter data streams. Data is ingested from different input sources like Twitter source, Flume and Kafka and processed downstream using Spark Streaming.

Requirements

  • IDE
  • Apache Maven 3.x
  • JVM 6 or 7

General Info

The source folder is organized into 2 packages i.e. Kafka and Streaming. Each class in the Streaming package explores different approach to consume data from Twitter source. Below is the list of classes:

  • com/stdatalabs/Kafka
    • KafkaTwitterProducer.java -- A Kafka Producer that publishes twitter data to a kafka broker
  • com/stdatalabs/Streaming
    • SparkPopularHashTags.scala -- Receives data from Twitter datasource
    • FlumeSparkPopularHashTags.scala -- Receives data from Flume Twitter producer
    • KafkaSparkPopularHashTags.scala -- Receives data from Kafka Producer
    • RecoverableKafkaPopularHashTags.scala -- Spark-Kafka receiver based approach. Ensures at-least once semantics
    • KafkaDirectPopularHashTags.scala -- Spark-Kafka Direct approach. Ensures exactly once semantics
  • TwitterAvroSource.conf -- Flume conf for running Twitter avro source

Description

More articles on hadoop technology stack at stdatalabs