Project For Big Data

This project is simply MapReduce homework for university

About Dataset

Context

This dataset is related to red variants of Spanish wines. The dataset describes several popularity and description metrics their effect on it's quality. The datasets can be used for classification or regression tasks. The classes are ordered and not balanced (i.e. the quality goes from almost 5 to 4 points). The task is to predict either the quality of wine or the prices using the given data.

Content

The dataset contains 7500 different types of red wines from Spain with 11 features that describe their price, rating, and even some flavor description. The was collected by me using web scraping from different sources (from wine specialized pages to supermarkets). Please acknowledge the hard work to obtain and create this dataset, you can upvote it if you find it useful to use on your projects :)

If the dataset becomes popular I will probably try to create a bigger version with wines from other countries and a wider spectrum of ratings.

Attribute Information

winery:: Winery name
wine:: Name of the wine
year:: Year in which the grapes were harvested
rating:: Average rating given to the wine by the users [from 1-5]
num_reviews:: Number of users that reviewed the wine
country:: Country of origin [Spain]
region:: Region of the wine
price:: Price in euros [€]
type:: Wine variety
body:: Body score, defined as the richness and weight of the wine in your mouth [from 1-5]
acidity:: Acidity score, defined as wine's “pucker” or tartness; it's what makes a wine refreshing and your tongue salivate and want another sip [from 1-5]

Usage

for all I know You Will Need to have Hadoop installed and set up correctly to run this

First start all Hadoop services using start-all.sh and then switch to the directory containing the mapper and reducer function cd <path>
And then simply copy and paste these commands (remember to delete or rename the file part-00000)

hdfs dfs -rmr /wineoutput
hadoop jar /$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.2.jar  \
-file ./mapper.py -mapper 'python3 mapper.py' \
-file ./reducer.py -reducer 'python3 reducer.py' \
-input  /myinput/wine.txt \
-output /wineoutput
hadoop fs -ls /wineoutput
hdfs dfs -get /wineoutput/part-00000