- Java 8
- Maven (build with 3.0.5)
- Hadoop (build with 2.4.0)
Data, contains tweets related to the presidential election in 2012. It is available at this DataHub repository.
Next step is store the JSON file at the HDFS. So, let's create a folder named tweets/
and put our JSON file inside it.
hadoop fs -mkdir tweets
hadoop fs -put <path-to-JSON-file> tweets
The package build command should generate a self-contained jar, which includes all the library dependencies and a manifest.
mvn clean
mvn package
From the project root, just run:
hadoop jar target/twitter_analysis-1.0-SNAPSHOT-jar-with-dependencies.jar tweets/cache-1000000.json tweets/out
Once the execution is completed, the folder tweets/out
should have been generated, containing the output files.
hadoop fs -ls tweets/out
The number of resuling chunks should be the same as the reducers set at the application. The next command merges all the parts into a single csv file.
hadoop fs -getmerge -nl tweets/out stats.csv