The case study follows the six step data analysis process:
β Ask
π» Prepare
π Process
π Analyze
π Share
π§ββοΈ Act
Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. The company has 5 focus products: bellabeat app, leaf, time, spring and bellabeat membership. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Our team have been asked to analyze smart device data to gain insight into how consumers are using their smart devices. The insights we discover will then help guide marketing strategy for the company.
π‘ BUSINESS TASK: Analyze Fitbit data to gain insight and help guide marketing strategy for Bellabeat to grow as a global player.
Primary stakeholders: UrΕ‘ka SrΕ‘en and Sando Mur, executive team members.
Secondary stakeholders: Bellabeat marketing analytics team.
Data Source: 30 participants FitBit Fitness Tracker Data from Mobius: https://www.kaggle.com/arashnic/fitbit
The dataset has 18 CSV. The data also follow a ROCCC approach:
- Reliability: The data is from 30 FitBit users who consented to the submission of personal tracker data and generated by from a distributed survey via Amazon Mechanical Turk.
- Original: The data is from 30 FitBit users who consented to the submission of personal tracker data via Amazon Mechanical Turk.
- Comprehensive: Data minute-level output for physical activity, heart rate, and sleep monitoring. While the data tracks many factors in the user activity and sleep, but the sample size is small and most data is recorded during certain days of the week.
- Current: Data is from March 2016 to May 2016. Data is not current so the users habit may be different now.
- Cited: Unknown.
β The dataset has limitations:
- Only 30 user data is available. The central limit theorem general rule of nβ₯30 applies and we can use the t test for statstic reference. However, a larger sample size is preferred for the analysis.
- Upon further investigation with
n_distinct()
to check for unique user Id, the set has 33 user data from daily activity, 24 from sleep and only 8 from weight. There are 3 extra users and some users did not record their data for tracking daily activity and sleep. - For the 8 user data for weight, 5 users manually entered their weight and 3 recorded via a connected wifi device (eg: wifi scale).
- Most data is recorded from Tuesday to Thursday, which may not be comprehensive enough to form an accurate analysis.
Examine the data, check for NA, and remove duplicates for three main tables: daily_activity, sleep_day and weight:
dim(sleep_day)
sum(is.na(sleep_day))
sum(duplicated(sleep_day))
sleep_day <- sleep_day[!duplicated(sleep_day), ]
Convert ActivityDate into date format and add a column for day of the week:
daily_activity <- daily_activity %>% mutate( Weekday = weekdays(as.Date(ActivityDate, "%m/%d/%Y")))
Check to see if we have 30 users using n_distinct()
. The dataset has 33 user data from daily activity, 24 from sleep and only 8 from weight. If there is a discrepency such as in the weight table, check to see how the data are recorded. The way the user record the data may give you insight on why there is missing data.
weight %>%
filter(IsManualReport == "True") %>%
group_by(Id) %>%
summarise("Manual Weight Report"=n()) %>%
distinct()
Additional insight to be awared of is how often user record their data. We can see from the ggplot()
bar graph that the data are greatest from Tuesday to Thursday. We need to investigate the data recording distribution. Monday and Friday are both weekdays, why isn't the data recordings as much as the other weekdays?
ggplot(data=merged_data, aes(x=Weekday))+
geom_bar(fill="steelblue")
β From weekday's total asleep minutes, we can see the graph look almost same as the graph above! We can confirmed that most sleep data is also recorded during Tuesday to Thursday. This raised a question "how comprehensive are the data to form an accurate analysis?"
Merge the three tables:
merged_data <- merge(merged_activity_sleep, weight, by = c("Id"), all=TRUE)
Clean the data to prepare for analysis in 4. Analyze!
Check min, max, mean, median and any outliers. Avg weight is 135 pounds with BMI of 24 and burn 2050 calories. Avg steps is 10200, max is almost triple that 36000 steps. Users spend on avg 12 hours a day in sedentary minutes, 4 hours lightly active, only half hour in fairly+very active! Users also gets about 7 hour of sleep.
merged_data %>%
dplyr::select(Weekday,
TotalSteps,
TotalDistance,
VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
SedentaryMinutes,
Calories,
TotalMinutesAsleep,
TotalTimeInBed,
WeightPounds,
BMI
) %>%
summary()
Percentage of active minutes in the four categories: very active, fairly active, lightly active and sedentary. From the pie chart, we can see that most users spent 81.3% of their daily activity in sedentary minutes and only 1.74% in very active minutes.
percentage <- data.frame(
level=c("Sedentary", "Lightly", "Fairly", "Very Active"),
minutes=c(sedentary_percentage,lightly_percentage,fairly_percentage,active_percentage)
)
plot_ly(percentage, labels = ~level, values = ~minutes, type = 'pie',textposition = 'outside',textinfo = 'label+percent') %>%
layout(title = 'Activity Level Minutes',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
The American Heart Association and World Health Organization recommend at least 150 minutes of moderate-intensity activity or 75 minutes of vigorous activity, or a combination of both, each week. That means it needs an daily goal of 21.4 minutes of FairlyActiveMinutes or 10.7 minutes of VeryActiveMinutes.
In our dataset, 30 users met fairly active minutes or very active minutes.
active_users <- daily_activity %>%
filter(FairlyActiveMinutes >= 21.4 | VeryActiveMinutes>=10.7) %>%
group_by(Id) %>%
count(Id)
The bar graph shows that there is a jump on Saturday: user spent LESS time in sedentary minutes and take MORE steps. Users are out and about on Saturday.
Let's look at how active the users are per hourly in total steps. From 5PM to 7PM the users take the most steps.
ggplot(data=hourly_step, aes(x=Hour, y=StepTotal, fill=Hour))+
geom_bar(stat="identity")+
labs(title="Hourly Steps")
How active the users are weekly in total steps. Tuesday and Saturdays the users take the most steps.
ggplot(data=merged_data, aes(x=Weekday, y=TotalSteps, fill=Weekday))+
geom_bar(stat="identity")+
ylab("Total Steps")
The more active that you're, the more steps you take, and the more calories you will burn. This is an obvious fact, but we can still look into the data to find any interesting. Here we see that some users who are sedentary, take minimal steps, but still able to burn over 1500 to 2500 calories compare to users who are more active, take more steps, but still burn similar calories.
ggplot(data=daily_activity, aes(x=TotalSteps, y = Calories, color=SedentaryMinutes))+
geom_point()+
stat_smooth(method=lm)+
scale_color_gradient(low="steelblue", high="orange")
Comparing the four active levels to the total steps, we see most data is concentrated on users who take about 5000 to 15000 steps a day. These users spent an average between 8 to 13 hours in sedentary, 5 hours in lightly active, and 1 to 2 hour for fairly and very active.
According to this healthline.com article, moderately active woman between the ages of 26β50 needs to eat about 2,000 calories per day and moderately active man between the ages of 26β45 needs 2,600 calories per day to maintain his weight. Comparing the four active levels to the calories, we see most data is concentrated on users who burn 2000 to 3000 calories a day. These users also spent an average between 8 to 13 hours in sedentary, 5 hours in lightly active, and 1 to 2 hour for fairly and very active. Additionally, we see that the sedentary line is leveling off toward the end while fairly + very active line is curing back up. This indicate that the users who burn more calories spend less time in sedentary, more time in fairly + active.
According to article: Fitbit Sleep Study, 55 minutes are spent awake in bed before going to sleep. We have 13 users in our dataset spend 55 minutes awake before alseep.
awake_in_bed <- mutate(sleep_day, AwakeTime = TotalTimeInBed - TotalMinutesAsleep)
awake_in_bed <- awake_in_bed %>%
filter(AwakeTime >= 55) %>%
group_by(Id) %>%
arrange(AwakeTime)
We can use regression analysis look at the variables and correlation. For R-squared, 0% indicates that the model explains none of the variability of the response data around its mean. Higher % indicates that the model explains more of the variability of the response data around its mean. Postive slope means variables increase/decrease with each other, and negative means one variable go up and the other go down. We want to look at if users who spend more time in sedentary minutes spend more time sleeping as well. We can use regression analysis lm()
to check for the dependent and indepedent variables. We also find that how many minutes an user asleep have an very weak correlation with how long they spend in sedentary minutes during the day.
sedentary_vs_sleep.mod <- lm(SedentaryMinutes ~ TotalMinutesAsleep, data = merged_data)
summary(sedentary_vs_sleep.mod)
How about calories vs asleep? Do people sleep more burn less calories? Plotting the two variables we can see that there is not much a correlation.
ggplot(data=merged_data, aes(x=TotalMinutesAsleep, y = Calories, color=TotalMinutesAsleep))+
geom_point()+
labs(title="Total Minutes Asleep vs Calories")+
xlab("Total Minutes Alseep")+
stat_smooth(method=lm)+
scale_color_gradient(low="orange", high="steelblue")
Conclusion based on our analysis:
- Sedentary make up a significant portion, 81% of users daily active minutes. Users spend on avg 12 hours a day in sedentary minutes, 4 hours lightly active, and only half-hour in fairly+very active!
- We see the most change on Saturday: users take more steps, burn more calories, and spend less time sedentary. Sunday is the most "lazy" day for users.
- 54% of the users who recorded their sleep data spent 55 minutes awake in bed before falling asleep.
- Users takes the most steps from 5 PM to 7 PM Users who are sedentary take minimal steps and burn 1500 to 2500 calories compared to users who are more active, take more steps, but still burn similar calories.
Marketing recommendations to expand globally: