Pick an inbuilt data set (you can view a list of inbuilt datasets by typing data()
on R console) and perform Exploratory Data Analysis. Make sure the dataset has at least 2+ numeric and 1+ categorical variable. If your dataset does not have a categorical variable, you can define one based on the continuous variable (hint: one way to achieve thatLinks to an external site.). The objective of the analysis is to understand the data and communicate initial findings about the data in a written format. The analysis should meet the following criteria:
- Perform checks to determine quality of the data (missing values, outliers, etc.)
- Description of the data:
- how big is it (number of observations, variables)
- how many numeric variables
- how many categorical variables
- description of the variables, if available
- Are there any missing values?
- Any duplicate rows?
- how big is it (number of observations, variables)
- Compute summary statistics (mean, median, mode, standard deviation, variance, range).
- Select one categorical variable, compute these statistics on a numeric variable by grouping on a categorical variable
- Visualize and transform to answer the questions asked. Visualizations to illustrate:
- Relationship between variables
- Trend
- Distribution of the variable(s)
- Comparison of summary statistics across categories
- Please pick at least 3 visualizations from the list above for illustration.
- Summarize your insights from the analysis
- Please use a RMarkdown document. Please submit a .docx, or a .html or .pdf knitted file from the rmarkdown document. Here is a good reference for ideas on questions and EDA in general: https://r4ds.had.co.nz/exploratory-data-analysis.html#questions
Please complete this Rmarkdown notebook and submit a knitted file (either *.html, *.pdf or *.docx)
Please pick a data set from one of the resources mentioned on this page : Open Data Sources
Make sure to pick one that is manageable and has at minimum 50 rows and 6-20 variables. The dataset’s variables should include at least 1 categorical variables and at least 2 continuous numerical variables. Please do not pick Ames Housing Data.
The proposal should meet the following criteria:
- Introduction: Why did you pick this dataset, what are you curious to know from the data (possibly framed as questions you want to answer using the data)? Alternatively you can also imagine yourself as a stakeholder who wants to make decisions from the data - what kind of decisions, and what do you need to know from the data to make decisions.
- Data: Include context about the data covering:
- Data source: Include the citation for your data, and provide a link to the source.
- Data collection: Context on how the data was collected?
- Cases: What are the cases (units of observation or experiment)? What do the rows represent in your dataset?
- Variables: What are the variables you will be studying?
- Type of study: was it an observational study or an experiment?
- Data Quality: Check for data quality issues, missing data, duplicates, format issues etc. and perform the necessary quality improvements.
- References: If you know of any other related work done by others, please include a brief description. Extra credit (5 points added to final project submission) for data chosen from: CA Open dataLinks to an external site. OR San Jose Data
Please answer the questions in the Rmarkdown file Download Rmarkdown fileand submit a knitted document. prob_hw.zip
This is one of the intermediate project report. The report should include
What are you curious to know about from the data you have selected? A sentence or two explaining why are you interested in studying this?
Data source - citation or link to the source
Data collection: how/ when was data collected? Is it observational study or from an experiment?
Units of observations: What is the unit of observation, in most cases it would mean what each row indicates?
Variables: What are the variables that you are planning to study ? All or a subset of variables?
Data cleanup (optional): If you had to do any data clean up please include code and brief description of steps, e.g. handling missing observations, transforming variables, filtering on rows, removing outliers.
Perform relevant visualization of the data and computation of summary statistics. Include what summaries of the data might be useful in answering your question.
Please do not perform an exhaustive visualization. If you need to, please include not immediately relevant ones in the appendix.
Develop a 2-3 questions that would like to answer using hypothesis testing.