(DA with R) The connection between people's sleep condition and daily activity
This dataset is from: https://www.kaggle.com/datasets/arashnic/fitbit
2) the relationship between sleep time and calories consumption
ggplot(combine_data2, aes(x=TotalMinutesAsleep,y=Calories))+ geom_point()
3) However, the interesting thing is that this is a possible relation between participants' sedentary time and sleep time :
Therefore we may see a trend that the more people sit still in their chairs, the less sleep time they get on that day.
Brief intro of this dataset:
This dataset is from Fitbit, a dataset of people's physical activity shared the Fitbit wearable device users. This dataset includes all the data in CSV files.
This research is designed to know whether there is any possible relationship between people's sleep time and their daily activity.
Prepare the data
Since the research purpose is to explore the possible relationship between sleep time and daily activity, therefore I use two csv files for my research:
The "dailyActivity_merged.csv":
1. All the data stored in this dataset are in CSV files and organized in the wide format.
2. To check the integrity of the datasets, I did the following work:
1). Import
a)Upload the related files that need to be analyzed via Rstudio.
b)Make sure all the key packages are installed (Tidyverse, ggplot2, here, skimr, etc)
c)Load CSV files and create dataframes.
eg. daily_activity <- read.csv("dailyActivity_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")
2). Check data type
glimpse(daily_activity)
3). Check data range(eg. An average person can not sleep for more than 24 hours a day, then any sleep time data larger than 24 hours should be corrected).
4). Mandatory items
sapply(daily_activity, function(x) sum(is.na(x)))
5). Unique
use this code to check duplicated data:
duplicated(daily_activity)
use this code to drop duplicated data:
daily_activity[!duplicated(daily_activity[,"Id"]),]
All the duplicated data is removed.
6). Expression Patterns
I noticed that the date column in sleep_day dataframe is formatted in text, as follows:
sleep_day <- separate(sleep_day, SleepDay, into=c("date","time"), sep = " ")
Now the "date" column is formatted as how the "ActivityDate" is formatted. "daily_activity" dataframe.
7). Accuracy(whether the data conforms to the actual entity being measured)
3. This dataset only contains limited participants (less than 40 people) and the physical situation of the participants is unknown, therefore the result may not be representative. We need more participants and data to have solid conclusions. Therefore, the following analysis process is only the presentation of how I do data analysis with R.
4. This dataset is published by Fitbit officials and reveals the physical data of their clients.
Process/analyze
In this research, I choose R for my analysis.
Since the data is checked for integrity and cleaned, these CSV files are ready to be imported to Rstudio directly.
1) to get a general look at the dataframe
head(daily_activity)
view(daily_activity) etc....
2) realize the general statistics info of the dataframe: (some of the explorations in here are unnecessary, they are just examples)
eg. How many distinct users are in a dataframe?
n_distinct(daily_activity$Id)
How about the statistical data of the key columns?
daily_activity %>%
select(TotalSteps, TotalDistance,SedentaryMinutes) %>%
summary()
How about different users' average steps in a day?
daily_activity %>%
group_by(Id) %>% drop_na() %>% summarize(average_steps= mean(TotalSteps))
3) transforming data for better analysis
eg. I created a new column called sport_related_Minutes which is combined with VeryActiveMinutes and FairActiveMinutes.
daily_activity <- daily_activity %>%
+ mutate(sport_Minutes = VeryActiveMinutes + FairlyActiveMinutes)
4) Merging dataframes into one dataframe
eg.
combine_data <- full_join(daily_activity_key, sleep_day_key, by="Id")
Here comes a problem with the date. Due to the inconsistency of the data type of
"ActivityDate" in the daily activity file and "SleepTime" in the sleep time file, if we merge data in this way, there will be plenty of redundant data during our merging process, as follows:
"ActivityDate" in the daily activity file and "SleepTime" in the sleep time file, if we merge data in this way, there will be plenty of redundant data during our merging process, as follows:
In other words, the merged dataset should be merged based on two columns: "Id" and Date.
At this moment, I met problems with merging with multiple columns which I could not solve. Therefore I created a new column called "IDDate" for the "sleepday" dataframe, which concatenated "Id" and "Date".
sleep_day3 <- unite(sleep_day2, "IDDate", Id, Date, sep=" ")
The same to the daily_activity dataframe.
daily_activity <- unite(dailyActivity_merged, IDDate, Id, ActivityDate,sep=" ")
And finally, I could merge two tables into one dataframe:
combine_data = full_join(sleep_day3,daily_activity,by="IDDate")
After this step, I may have a dataframe that I think is helpful and conform to my assumption before my analysis begins. Then, I started to explore the dataframe by creating plots with ggplot2.
3. Analyze
1) the relationship between calories consumed and the distance they walked.
ggplot(combine_data2, aes(x=Calories,y=TotalDistance))+ geom_point() + geom_smooth(method="loess")
I assume that the more calcories consumed in a day, the longer longer distance people may walk during the day. According from the scatter point chart, I confirmed that distance may have a strong relationship with calories consumption.
ggplot(combine_data2, aes(x=TotalMinutesAsleep,y=Calories))+ geom_point()
I assumed that there may be a relationship between people's calorie consumption and sleep time, but the result does not turn out to be as I expected. It seems that there is no direct connection between calories consumption and sleep time.
The same to total steps and distance people walk every day.
3) However, the interesting thing is that this is a possible relation between participants' sedentary time and sleep time :
Share
The next step is sharing, which means I need to use a slide or dashboard to present my findings. I will not include that part in my research.
Thank you for your reading.
Comments
Post a Comment