Wine Quality Analysis with Python, AWS and Tableau

Wine quality is an important factor for both producers and consumers. The goal of this analysis is to explore which factors have the most impact on wine quality for a given dataset. The dataset includes a variety of features that may influence wine quality, such as pH, alcohol content, and the presence of certain chemicals. By analyzing the data, we aim to identify the key factors that contribute to high wine quality, as well as any potential correlations or trends. This analysis will be useful for wine producers who want to improve the quality of their wines, as well as for consumers who want to make informed purchasing decisions.

Check out this link for the final deliverable:

web link

Data Description

The data for this analysis was sourced from the UCI Machine Learning Repository. It includes 12 variables and 6495 rows that may influence the target "wine quality(0-10)", such as pH, alcohol content, and the presence of certain chemicals.

Project Structure

  • Extract data from UCI ML Repo and store data to an S3 bucket on AWS
  • Using Python to identify the top 3 variables that have the strongest relationship with wine quality
  • Transform Data using AWS Glue and Athena using SQL
  • Load cleaned data from Athena to Tableau to perform analysis and make dashboards
  • Host a website on S3 to showcase analysis results and dashboards

Screen Shot 2022-12-28 at 23 03 34

1. Data Extraction and storage

In this section, the data for red wine will be stored in the "red" bucket and white wine data in the "white" bucket Screen Shot 2022-12-29 at 22 08 17

2. Create Data Schema using AWS Glue

Screen Shot 2022-12-29 at 22 14 30

3. Transform Data using AWS Athena with SQL

  • We will use Glue to create a schema, then use Athena and SQL to ask analytical questions and generate data for analysis. The resulting data will be saved in an S3 bucket for future review. Here are our analytical questions:
    • Which three features have the highest correlation with wine quality? How does it affect the level of wine quality?
    • What is the relationship between alcohol level and wine quality?
    • What is the frequency distribution of the quality levels in both datasets?

Screen Shot 2022-12-29 at 22 17 25

3. Build heat map to analyze the correlation between features and wine quality

  • According to the heat map, the three features that have the highest correlation with red wine quality are alcohol, volatile acidity, and sulphate. Also, the the highest correlation with white wine quality are alcohol, density, and cholorides.

Screen Shot 2022-12-29 at 23 18 49

4. Connect AWS Athena with Tableau and build dashboards

Screen Shot 2022-12-29 at 22 24 32

Screen Shot 2022-12-29 at 22 24 58

5. Save dashboards into notebook and host website on S3 to showcase analysis results

web link

Screen Shot 2022-12-29 at 22 27 46

6. Conclusion

  1. The top three features that impact red wine quality are:
  • Alcohol content
  • Sulphates
  • Volatile Acidity
  1. The top three features that impact white wine quality are:
  • Alcohol content
  • Chlorides
  • Density

According to the analysis, we can conclude that for red wine, higher levels of alcohol and sulphate may be linked to better quality, while lower levels of volatile acid may also contribute to better quality. On the other hand, for white wine, higher levels of alcohol may be indicative of better quality, while lower levels of chlorides and density may also be associated with improved quality.

However, it is important to note that the relationship between these factors and wine quality is complex and can depend on various other factors such as the type of grape, winemaking techniques, and terroir. Therefore, it is essential to take into account all of the factors that may influence wine quality when making an assessment.