- Install Python and Pip
- Install dependecies: pip install tweepy && pip install kafka-python && pip install pyspark
- Make sure you have Apache Spark installed.
- Get your API keys from [https://dev.twitter.com/](Twitter Developers) and put them in
twitter_config.py
. - Set docker in your etc/hosts to point to your machine
First, run the Kafka server with the following command:
docker-compose up -d
Then, fire up the stream source:
twitter_stream_producer.py
Now submit the twitter_stream_consumer.py
to spark-submit
:
$SPARK_HOME/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 twitter_stream_consumer.py
You will start to see the most frequently used words in the tweets from your opened stream like this:
+------------------+-----+
| hashtag|count|
+------------------+-----+
|#programming\u2026| 1|
| #Spark\u2026"| 1|
| #RunWayv_CHEN| 1|
| #Jupyter| 1|
| #Learn| 1|
| #opensource| 1|
| #Apache| 1|
+------------------+-----+