(DA with R) The connection between people's sleep condition and daily activity

This dataset is from: https://www.kaggle.com/datasets/arashnic/fitbit

Brief intro of this dataset:

This dataset is from Fitbit, a dataset of people's physical activity shared the Fitbit wearable device users. This dataset includes all the data in CSV files.

This research is designed to know whether there is any possible relationship between people's sleep time and their daily activity.

Prepare the data

Since the research purpose is to explore the possible relationship between sleep time and daily activity, therefore I use two csv files for my research:

The "dailyActivity_merged.csv":

and the "sleepDay_merged.csv".

1. All the data stored in this dataset are in CSV files and organized in the wide format.

2. To check the integrity of the datasets, I did the following work:

1). Import

a)Upload the related files that need to be analyzed via Rstudio.

b)Make sure all the key packages are installed (Tidyverse, ggplot2, here, skimr, etc)

c)Load CSV files and create dataframes.

eg. daily_activity <- read.csv("dailyActivity_merged.csv")

sleep_day <- read.csv("sleepDay_merged.csv")

2). Check data type

glimpse(daily_activity)

dim(daily_activity)

daily_activity table has 940 rows and 15 columns.

3). Check data range(eg. An average person can not sleep for more than 24 hours a day, then any sleep time data larger than 24 hours should be corrected).

4). Mandatory items

sapply(daily_activity, function(x) sum(is.na(x)))

There are no NA cells in these two tables.

5). Unique

use this code to check duplicated data:

duplicated(daily_activity)

use this code to drop duplicated data:

daily_activity[!duplicated(daily_activity[,"Id"]),]



All the duplicated data is removed.

6). Expression Patterns

I noticed that the date column in sleep_day dataframe is formatted in text, as follows:

To make sure that this column is correctly formatted , I used the separate() function:

sleep_day <- separate(sleep_day, SleepDay, into=c("date","time"), sep = " ")

Now the "date" column is formatted as how the "ActivityDate" is formatted. "daily_activity" dataframe.

7). Accuracy(whether the data conforms to the actual entity being measured)

3. This dataset only contains limited participants (less than 40 people) and the physical situation of the participants is unknown, therefore the result may not be representative. We need more participants and data to have solid conclusions. Therefore, the following analysis process is only the presentation of how I do data analysis with R.

4. This dataset is published by Fitbit officials and reveals the physical data of their clients.

Process/analyze

In this research, I choose R for my analysis.

Since the data is checked for integrity and cleaned, these CSV files are ready to be imported to Rstudio directly.

1) to get a general look at the dataframe

head(daily_activity)

view(daily_activity) etc....

2) realize the general statistics info of the dataframe: (some of the explorations in here are unnecessary, they are just examples)

eg. How many distinct users are in a dataframe?

n_distinct(daily_activity$Id)

How about the statistical data of the key columns?

daily_activity %>%

select(TotalSteps, TotalDistance,SedentaryMinutes) %>%

summary()

How about different users' average steps in a day?

daily_activity %>%

group_by(Id) %>% drop_na() %>% summarize(average_steps= mean(TotalSteps))

3) transforming data for better analysis

eg. I created a new column called sport_related_Minutes which is combined with VeryActiveMinutes and FairActiveMinutes.

daily_activity <- daily_activity %>%

+    mutate(sport_Minutes = VeryActiveMinutes + FairlyActiveMinutes)

4) Merging dataframes into one dataframe

eg.

combine_data <- full_join(daily_activity_key, sleep_day_key, by="Id")

Here comes a problem with the date. Due to the inconsistency of the data type of
"ActivityDate" in the daily activity file and "SleepTime" in the sleep time file, if we merge data in this way, there will be plenty of redundant data during our merging process, as follows:

In other words, the merged dataset should be merged based on two columns: "Id" and Date.

At this moment, I met problems with merging with multiple columns which I could not solve. Therefore I created a new column called "IDDate" for the "sleepday" dataframe, which concatenated "Id" and "Date".

  sleep_day3 <- unite(sleep_day2, "IDDate", Id, Date, sep=" ")

The same to the daily_activity dataframe.

daily_activity <- unite(dailyActivity_merged, IDDate, Id, ActivityDate,sep=" ")

And finally, I could merge two tables into one dataframe:

combine_data = full_join(sleep_day3,daily_activity,by="IDDate")

After this step, I may have a dataframe that I think is helpful and conform to my assumption before my analysis begins. Then, I started to explore the dataframe by creating plots with ggplot2.

3. Analyze

1) the relationship between calories consumed and the distance they walked.

ggplot(combine_data2, aes(x=Calories,y=TotalDistance))+ geom_point() + geom_smooth(method="loess")

I assume that the more calcories consumed in a day, the longer longer distance people may walk during the day. According from the scatter point chart, I confirmed that distance may have a strong relationship with calories consumption.

2) the relationship between sleep time and calories consumption
ggplot(combine_data2, aes(x=TotalMinutesAsleep,y=Calories))+ geom_point()

I assumed that there may be a relationship between people's calorie consumption and sleep time, but the result does not turn out to be as I expected. It seems that there is no direct connection between calories consumption and sleep time.

The same to total steps and distance people walk every day.

3) However, the interesting thing is that this is a possible relation between participants' sedentary time and sleep time :

Therefore we may see a trend that the more people sit still in their chairs, the less sleep time they get on that day.

The next step is sharing, which means I need to use a slide or dashboard to present my findings. I will not include that part in my research.

Thank you for your reading.

Search This Blog

Sherwin's Data nest

(DA with R) The connection between people's sleep condition and daily activity

Process/analyze

Share

Comments

Post a Comment

Popular posts from this blog

(Power BI) Superstore Sales Dataset analysis

(Power BI & Python) Sales performance of different sales method on the same product

(Data analysis with SQL/Tableau) How do people use bike-sharing service in the first quarter of 2022?