The GSSurvey asks US adults questions related to current demographics, opinions, religious/spiritual beliefs, political and life view points, social class opinions, as well even recording peoples zodiac signs.
This project is an interest project, taking the GSS participant responses and comparing their zodiac responses to their other survey answers to see if a Classification Model can accurately predict their Zodiac sign/birth month.
The goals of this project are to build a Classification Model that can beat the Baseline model of 10% accuracy prediction in predicting a person's zodiac sign when answering the GSSurvey questions.
This project goes through the full Data Science pipeline in acquiring the 2021 GSS survey responses, cleaning and preparing this dataset, exploring and conducting statistical chi^2 tests for 6 different hypothesis questions, and then conducting 5 different Classification Models on the found main drivers.
ABOUT THE GSS: Since 1972, the General Social Survey (GSS) has been conducted annually by the Univeristy of Chicago to measure the responses of contemporary American Society--through the lens of adult survey participants. This project uses the GSS 2021 dataset.
As the GSS 2021 is a general survey of each participant's response on their beliefs, opinions, demographics, and relationships, this model will be using the Zodiac responses as a general way to see if one's sign can actually be predicted by certain responses, and if a model could actually be built on this.
The columns selected pulled from the GSS data in this project where done so by targeting key points of interest that are often covered in astrology:
- Career
- Relationships
- Opinions
- Health (physical and mental)
- Political stance/view points
- Religion/Spirituality
- Interests and Overall outlook of others and life
ABOUT THE ZODIAC SIGNS: As many viewpoints on Astrology overall seem to be changing, oftentimes in terms of self-identity and decision factors, this project is taking a look at zodiac signs (using participants Sun Sign) to determine if the responses of the US adult partipants of the GSS actually correlates to the astrological sun signs they provided and align with.
Some of the columns and datasets may not fully makes sense to those first meeting this data. Below is a table that helps define the features and terminology used in this project.
Target | Datatype | Definition |
---|---|---|
GSS 2021 | 4032 non-null: Category | General Social Survey of 2021 |
Feature | Datatype | Definition |
---|---|---|
zodiac | 4032 non-null: Category | participant's zodiac birth sign |
sexornt | 4032 non-null: Category | participant's sexual orientation |
marital | 4032 non-null: Category | participant's marital status |
res16 | 4032 non-null: Category | participant's region of living |
degree | 4032 non-null: Category | participant's highest earned education degree |
class | 4032 non-null: Category | participant's self rated social class |
relig | 4032 non-null: Category | participant's religion |
postlifev | 4032 non-null: Category | participant's view of life after death |
sprtprsn | 4032 non-null: Category | participant's self rating of being a spiritual person |
fairv | 4032 non-null: Category | participant's view of others being fair |
trustv | 4032 non-null: Category | participant's view of others being trustworthy |
conmedic | 4032 non-null: Category | participant's trust of medical system |
happy | 4032 non-null: Category | participant's overall happiness with life |
workhard | 4032 non-null: Category | participant's 1-5 rating of most important life lesson of working hard |
occ10 | 4032 non-null: Category | participant's occupation trade |
hlthmntl | 4032 non-null: Category | participant's self rating of their mental health |
socrel | 4032 non-null: Category | participant's rating of how often they spend time with relatives/family |
decevidc | 4032 non-null: Category | participant's rating of how much they agree of needing sufficient evidence for decisions |
polviews | 4032 non-null: Category | participant's rating of their political viewpoints |
For the key features in comparison as categorical features of zodiac predictors.
- H_0: Marital status is independant of zodiac sign.
- H_a: Marital status is dependant of zodiac sign.
- H_0: workhard opinion is independant of zodiac sign.
- H_a: workhard opinion is dependant of zodiac sign.
- H_0: Political viewpoints is independant of zodiac sign.
- H_a: Political viewpoints is dependant of zodiac sign.
Within the wrangle.py file there are functions that will help to:
- Pull in the 2017, Single Family Property residents with no duplicates
- Read the SQL acquire query to a csv file
- Assign the data to a variable
Within the wrangle.py file there are functions for the prepare stage that:
- Clean up the nulls by dropping all from zodiac column and creating unanswered columns as 'unknown' to keep data that would otherwise be lost.
- Drop any unneccessary and duplicated columns
- Resets dtypes to Objects and int
- Splits data into train, validate and test
Within the explore.ipynb file, there is a step-by-step breakdown of my Exploration methods. Main features found and mentioned are:
- workhard: how participants rate (1-5) 'working hard' as a key life lesson they would teach a child.
- hlthmntl: each participant's self rating of their mental health
- socrel: how often participants spend time with relatives/family.
- decevidc: how participants rate needing sufficient evidence in making decisions
The model.ipnyb notebook holds each method of model used in this project. Models include steps to produce: Decision Tree Classifier| Random Forest| and 5 different KNN Models (with varying n_number of nearest neighbors)
The function for the final chosen model can be found in model.py and Final Report.
Using the best Classification model of Random Forest, the results did not beat the baseline prediction of 10%.
Moving forward, I would want to try creating cluster features and utilize feature engineering like SelectKBest to find better predictors for this model.
I would also want to add past GSS responses to add in more data and see if past years might help with adding to predictions.