This project is simply MapReduce homework for university
This dataset is related to red variants of Spanish wines. The dataset describes several popularity and description metrics their effect on it's quality. The datasets can be used for classification or regression tasks. The classes are ordered and not balanced (i.e. the quality goes from almost 5 to 4 points). The task is to predict either the quality of wine or the prices using the given data.
The dataset contains 7500 different types of red wines from Spain with 11 features that describe their price, rating, and even some flavor description. The was collected by me using web scraping from different sources (from wine specialized pages to supermarkets). Please acknowledge the hard work to obtain and create this dataset, you can upvote it if you find it useful to use on your projects :)
If the dataset becomes popular I will probably try to create a bigger version with wines from other countries and a wider spectrum of ratings.
- winery:
- Winery name
- wine:
- Name of the wine
- year:
- Year in which the grapes were harvested
- rating:
- Average rating given to the wine by the users [from 1-5]
- num_reviews:
- Number of users that reviewed the wine
- country:
- Country of origin [Spain]
- region:
- Region of the wine
- price:
- Price in euros [€]
- type:
- Wine variety
- body:
- Body score, defined as the richness and weight of the wine in your mouth [from 1-5]
- acidity:
- Acidity score, defined as wine's “pucker” or tartness; it's what makes a wine refreshing and your tongue salivate and want another sip [from 1-5]
for all I know You Will Need to have Hadoop installed and set up correctly to run this
-
First start all Hadoop services using
start-all.sh
and then switch to the directory containing the mapper and reducer functioncd <path>
-
And then simply copy and paste these commands (remember to delete or rename the file
part-00000
)
hdfs dfs -rmr /wineoutput
hadoop jar /$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.2.jar \
-file ./mapper.py -mapper 'python3 mapper.py' \
-file ./reducer.py -reducer 'python3 reducer.py' \
-input /myinput/wine.txt \
-output /wineoutput
hadoop fs -ls /wineoutput
hdfs dfs -get /wineoutput/part-00000
Any Feedback is appreciated