The dataset that we have chosen is 2001 Census Data from India. There is a large amount of data for each district, and each state in India, and we think that we can make some pretty cool observations.
You can access the original dataset here.
The cleaned dataset can be seen here.
- Growth rate - what impacts this?
- Religious population / percentage
- Drinking water vs paved roads
- Religion vs amenities - is there some sort of discrimination?
- Permanent vs temporary housing vs literacy
- Literacy vs amenities
- Housing vs education
Jake Clifton - Growth Rate - How does the number of people per household and sex distribution impact the growth rate in India?
Laura Bullard - Which religion is most common in each State? How do the religions differ between high and low populated states?
Brandon Damore - Literacy vs amenities in each district. Do more literate areas have more access to amenities than less literate ones?
Matt O'Connor - Does the level of education decrease the more rural an area is?
Kevin Eugene - How does level of education relate to the quality of housing in each district?
Since each question will require different columns to be used and to be created, we each will decide which we ones we will need. A single jupyter notebook will be created to reflect that. The columns will be renamed to a uniformed format, any strange values are to be converted to NaN and then be dropped, and all data types will be reformatted as to allow for calculations. Columns that are mainly filled with empty data will be dropped completely.
Once the data has been cleaned, it will be exported into its own csv file. A second jupyter notebook will import that data so that everyone can work with a clean dataset.
This second notebook will be divided by name and question, and each person will analyze the data through different graphs, summaries, and formulas. From there, each person will move to answer their question.