(Data analysis with SQL/Tableau) How do people use bike-sharing service in the first quarter of 2022?
This dataset is from: https://divvy-tripdata.s3.amazonaws.com/index.html .
I did the same research via Excel based on the data from one month in another post. In this post, I will do a similar analysis based on the first quarter's data in 2022 via SQL and Tableau.
The sample csv file is like this :
I did the same research via Excel based on the data from one month in another post. In this post, I will do a similar analysis based on the first quarter's data in 2022 via SQL and Tableau.
The sample csv file is like this :
This research is designed to ask the following questions:
1. How do members and casual riders use bike-sharing services differently?
2. Which type of bikes do riders prefer?
I will use Bigquery for data process and analysis. And Tableau for data visualization.
Prepare the data
1. Import data: I directly import data from csv files.
I created a dataset called bikes_use and uploaded the csv files from January to March.
2. Check for data integrity:
1) data type: when I imported the data to Bigquery, it directly showed the datatype of columns:
--We can also use the following code to show the column data type.
--Through this step, we know that these three tables have the same data type for each column.
select
column_name,data_type
From `evident-apricot-368322.bikes_use.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = '202201_tripdata'
--In addition, Since this dataset contains the trip data from January to March, I want to ensure that these three tables have the same column names.
SELECT
column_name,
COUNT(table_name)
FROM
`evident-apricot-368322.bikes_use.INFORMATION_SCHEMA.COLUMNS`
GROUP BY
1;
These three tables contain the same column names. By doing so, I may use the union all function to unite these three tables together.
--since I need to calculate how long people used bikes, therefore I want to ensure that the data type of the columns are in the same data type
2) Data range :
To check the Data range of each table, we can use functions like AVG(), MAX(), MIN(), etc. Here I skip the code writing in this part since it is simple.
--find out if there are null cells in the mandatory columns
select
countif(started_at is null)
From `evident-apricot-368322.bikes_use.202201_tripdata`
4) Unique:
--Find out if there are duplicates :
select
(select count(1)
from (select distinct * from `evident-apricot-368322.bikes_use.202201_tripdata`)) as distinct_rows,
(select count(1) from `evident-apricot-368322.bikes_use.202201_tripdata`) as total_rows
6) Accuracy: I assume that the data provided by company A is credible since I have no way to check if the data is accurate or not.
3. This dataset is from company A directly and is accessible to everyone who has the link.
Process/Analyze:
1. get the data about how members and casual riders use bikes differently by the type of bike.
With ridedata as
(Select
ride_id, rideable_type,member_casual,
ROUND(TIMESTAMP_DIFF(ended_at,started_at,SECOND)/60,2) as ride_time,
Format_date("%A",started_at) as day_of_week,extract(Month from started_at) as Month
from `evident-apricot-368322.bikes_use.202201_tripdata`
UNION ALL
SELECT
ride_id, rideable_type,member_casual,
ROUND(TIMESTAMP_DIFF(ended_at,started_at,SECOND)/60,2) as ride_time,
Format_date("%A",started_at) as day_of_week,extract(Month from started_at) as Month
from `evident-apricot-368322.bikes_use.202202_tripdata`
UNION ALL
SELECT
ride_id, rideable_type,member_casual,
ROUND(TIMESTAMP_DIFF(ended_at,started_at,SECOND)/60,2) as ride_time,
Format_date("%A",started_at) as day_of_week,extract(Month from started_at) as Month
from `evident-apricot-368322.bikes_use.202203_tripdata`)
Select distinct member_casual, rideable_type,Month, count(rideable_type) as ride_times
from ridedata
group by member_casual,rideable_type, Month
2. get the data about how many times people ride a bike during a week
Select member_casual, day_of_week, count(ride_id) as ride_times
from ridedata
group by member_casual, day_of_week
3. get the data about how long people ride a bike each day of the week.
Select distinct member_casual, day_of_week, round(avg(ride_time),2) as average_ride_time
From ridedata
group by member_casual,day_of_week
Data visualization.
1) Members: It seems that members have a pretty fixed environment for riding bikes since the fluctuation of their ride-length data on each day of the week is minor. Besides, members prefer to ride bikes during the weekdays.
2) Casual riders: the times they ride bikes during the week change slightly. However, they ride much longer than members and they prefer to use bikes on the weekend.
3) There is an obvious growth in the ride times in March.
4) It seems that casual riders prefer electric bikes more than classic bikes. On the contrary, members prefer classic bikes more.
Thanks for your reading.
Comments
Post a Comment