Case background:
Six weeks ago company A launched a new line of office stationery. Company A wants to know the sales performance during these 6 weeks and how did the sales methods affect the sales performance.

- How many customers were there for each approach?
- What does the spread of the revenue look like overall? And for each method?
- Was there any difference in revenue over time for each of the methods?

(The dashboard created by Power BI is shown here. The detailed research process is after the dashboard.)

Data validation

1.Data import:

import pandas as pd
import numpy as np
df = pd.read_csv("product_sales.csv")

2.Let's have a general look of the dataset before we move to the actual validating

df.head()

	week	sales_method	customer_id	nb_sold	revenue	years_as_customer	nb_site_visits	state
0	2	Email	2e72d641-95ac-497b-bbf8-4861764a7097	10	null	0	24	Arizona
1	6	Email + Call	3998a98d-70f5-44f7-942e-789bb8ad2fe7	15	225.47	1	28	Kansas
2	5	Call	d1de9884-8059-4065-b10f-86eef57e4a44	11	52.55	6	26	Wisconsin
3	4	Email	78aa75a4-ffeb-4817-b1d0-2f030783c5d7	11	null	3	25	Indiana
4	3	Email	10e6d446-10a5-42e5-8210-1b5438f70922	9	90.49	0	28	Illinois

5 rows

df.shape

(15000, 8)

df.columns

Index(['week', 'sales_method', 'customer_id', 'nb_sold', 'revenue',
       'years_as_customer', 'nb_site_visits', 'state'],
      dtype='object')

df.dtypes

week                   int64
sales_method          object
customer_id           object
nb_sold                int64
revenue              float64
years_as_customer      int64
nb_site_visits         int64
state                 object
dtype: object

From the upper reviews, it is seen that this dataset has 15000 columns and 8 columns. Additionally, we may check the data type and see if the data type matches the description of the column given by the exam paper.

It seems that the data types of this dataset match up with the description.

3. Check for duplicated data

df.duplicated().sum()

4. Check for null value in each column

df.isna().sum()

week                    0
sales_method            0
customer_id             0
nb_sold                 0
revenue              1074
years_as_customer       0
nb_site_visits          0
state                   0
dtype: int64

The result shows that there are there are 1074 values in 'revenue'. To determine in which ways should we deal with the na value, let's firstly have a look of the percentage of value which is NA in this column.

missing_pct = round(df.isnull().sum()/len(df)*100,1)
print(missing_pct)

week                 0.0
sales_method         0.0
customer_id          0.0
nb_sold              0.0
revenue              7.2
years_as_customer    0.0
nb_site_visits       0.0
state                0.0
dtype: float64

The result shows that there are 7.2% of NA in 'revenue'. Since the result is larger than 5% and smaller than 10%, let's drop these rows which contains NA values.

df=df.dropna()
df.isna().sum()

week                 0
sales_method         0
customer_id          0
nb_sold              0
revenue              0
years_as_customer    0
nb_site_visits       0
state                0
dtype: int64

We can see that all NA values are dropped.

5.The validation process on each column

1) 'week' :

Since this column is the week number, this column should only contain integer numbers. Besides, the company wanted to investigate the selling situation of the new product six weeks ago till now, therefore the number of this column should only contain integer number from 1 to 6.

df['week'].unique()

array([6, 5, 3, 4, 1, 2])

According to the result, it is seen that this column only contains integer number from 1 to 6. Therefore, this column is validated.

2) 'sales_method'

According to the description of the sales method from the case, this column should only have three sales method: 'Email', 'Call', and 'Email and Call'. Let's check the unique value of this column.

df['sales_method'].unique()

array(['Email + Call', 'Call', 'Email', 'em + call', 'email'],
      dtype=object)

Looks like 'em + call' and ‘email’ are written in wrong format. Therefore, we need to replace this two values with 'Email' and 'Email + call'.

corrections = {'email': 'Email', 'em + call': 'Email + Call'}
df['sales_method'] = df['sales_method'].replace(corrections)
df['sales_method'].unique()

array(['Email + Call', 'Call', 'Email'], dtype=object)

After the replacement, we may see that 'em + call' and ‘email’ are correctly replaced.

3). 'customer_id'

This column contains the id of each customer. The value of this column is "unique identifier for the customer".

df['customer_id'].describe()

count                                    13926
unique                                   13926
top       3998a98d-70f5-44f7-942e-789bb8ad2fe7
freq                                         1
Name: customer_id, dtype: object

Value in this column are all unique.

4) 'nb_sold'

This column contains the sales number of the product. Therefore, this column should not contain negative numbers.

df['nb_sold'].describe()

count    13926.000000
mean        10.040069
std          1.789690
min          7.000000
25%          9.000000
50%         10.000000
75%         11.000000
max         16.000000
Name: nb_sold, dtype: float64

It is seen that all number in here are integer and postive number.

5) 'revenue'

This column can only contain positive number. Therefore, let's check the statistical information of thi column.

df['revenue'].describe()

count    13926.000000
mean        93.934943
std         47.435312
min         32.540000
25%         52.470000
50%         89.500000
75%        107.327500
max        238.320000
Name: revenue, dtype: float64

All data is positive. Then, let's round this column with 2 decimal places.

df['revenue']=df['revenue'].round(2)

5)'years_as_customer'

Since this company is founded in 1984, therefore this column cannot contain number which is bigger than(2023-1984)= 39 year. Let's firstly check the statistical info of this column.

df['years_as_customer'].describe()

count    13926.000000
mean         4.978960
std          5.047828
min          0.000000
25%          1.000000
50%          3.000000
75%          7.000000
max         63.000000
Name: years_as_customer, dtype: float64

Since the max value of this column is 63, therefore this column contains value which is bigger than 39. Therefore, let's check how many values are bigger than 39 in this column.

outliers = df['years_as_customer']>39
print(outliers.sum())

There are only two values bigger than 39, which means the outliers is not . In this case, I replaced these two values with the mean of this column.

# identify the mean of this column
mean = df['years_as_customer'].mean()
# identify the outliers in the column
outliers = df['years_as_customer'] > 39
# replace the outliers with the mean of the column
df.loc[outliers, 'years_as_customer'] = mean
print(df['years_as_customer'].describe())

count    13926.000000
mean         4.971776
std          5.011183
min          0.000000
25%          1.000000
50%          3.000000
75%          7.000000
max         39.000000
Name: years_as_customer, dtype: float64

Both of the outliers are replaced.

6) 'nb_site_visits'

The value in this column is number of times the customer has visited our website in the last 6 months. Therefore, there should be only positive number in this column.

df['nb_site_visits'].describe()

count    13926.000000
mean        24.947006
std          3.491491
min         12.000000
25%         23.000000
50%         25.000000
75%         27.000000
max         37.000000
Name: nb_site_visits, dtype: float64

There are only positive numbers in this column.

7) 'state'

This column contains location information of the customer. Normally I use SpellCheck function to check for misspelling in such a column. It might be difficult to do it in Datacamp workspace since I need to install a package first. Therefore, I check for misspellings one by one.

# sort the unique value of 'state'
sorted_value =sorted(df['state'].unique())
print(sorted_value)

['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

There are no misspellings in this column.

The validation process is ended after this process.

Exploratory Analysis to answer the customer questions

1.How many customers were there for each approach?

Since each row of the dataset means one sale, use the groupby function may help to review the customer situation.

df.groupby('sales_method')['customer_id'].count()

sales_method
Call            4781
Email           6922
Email + Call    2223
Name: customer_id, dtype: int64

It is seen that there are 4781 customers reached by call, 6922 customers reached by Email, and 2223 customers reached by call and email.

# import packages
import matplotlib.pyplot as plt 
import seaborn as sns
# set colors and order for sales method
method_order = ['Call', 'Email','Email + Call']
method_colors = {'Call': 'red', 'Email': 'green', 'Email + Call': 'blue'}
# draw a related bar chart 
_=sns.countplot(data=df, x='sales_method',palette = method_colors,order=method_order)
# create title and labels 
_=plt.title('Sales Method Counts')
_=plt.xlabel('Sales Method')
_=plt.ylabel('Count of customer')
# show the plot 
plt.show()

A pie chart can also review the situation.

counts = df['sales_method'].value_counts(normalize=True) * 100
_=plt.pie(counts.values, labels=counts.index,autopct='%1.1f%%')
_=plt.title('The contribution of sales amount from differnet sales method')
plt.show()

From the plots above, we can see that there are 16%(2223) customers from Email and Call method. Around 1/3 of customers are reached from Calls, which is 4781 people. And around half of the customer are reached from Emails, which is 6922 customers.

2. What does the spread of the revenue look like overall? And for each method?

1) The spread of revenue of each case.

Let's have a general look of the average revenue situation first. A boxplot is applied to check the statistical info of revenue.

# create a boxplot
_=plt.boxplot(df['revenue'])
# set title and label 
_=plt.title('Revenue Boxplot')
_=plt.ylabel('Revenue in each deal')
# show the plot
plt.show()

A violin polot may also be useful to check the revenue situation.

# create a violin plot
_=sns.violinplot(data=df,y='revenue')
# set titile and label name
_=plt.title('average revenue Boxplot')
_=plt.ylabel('Revenue in each deal')
# show the plot
plt.show()

The upper plots reveal the distribution of revenue in each other. We see that the range of revenue is from around 20 to 250. The median number of revenue is around 90. The inter quartile indicate that 50% of deals is in the range of 50 - 110. The violin plot also shows that a large number of deals were closed with a 20-130 revenue.

2) Check the average revenue situation for each method

A boxplot is still available for this quest.

# set the display order and colors for each sales method 
method_order = ['Call', 'Email','Email + Call']
method_colors = {'Call': 'red', 'Email': 'green', 'Email + Call': 'blue'}

# draw a box plot
_ = sns.boxplot(x='sales_method',y ='revenue',data =df,order= method_order,palette=method_colors)
# Add labels
_=plt.xlabel('sales method')
_=plt.ylabel('revenue')
# show the plot
plt.show()

A violin plot may aslo show the general situation：

# draw a violin plot included this three sales method
_=sns.violinplot(data=df, x='sales_method', y='revenue', hue='sales_method',order= method_order,palette=method_colors)

# Add labels to the plot
_=plt.xlabel('Sales Method')
_=plt.ylabel('Revenue')
_=plt.title('Distribution of Revenue by Sales Method')

# Show the plot
plt.show()

The plots above show us that deals closed by the "Email + Call" method can generate the highest average revenue. The "Call" method generates the lowest average revenue among the sales method. The average revenue generated by "Email" is between "Call" and "Email+Call".

3)Let's then have a look at the total revenue created by each method.

# Create a dataframe of sum revenue group by "sales_method"
revenue_in_each_method = df.groupby(['sales_method'])['revenue'].sum().reset_index()
# Draw the plot 
_ = sns.barplot(x='sales_method',y ='revenue',data =revenue_in_each_method,order = method_order,palette=method_colors)

# Add labels
_=plt.xlabel('sales method')
_=plt.ylabel('revenue')
# show the plot
plt.show()

A pie chart is also instinctive in showing the total revenue created by each method.

revenue_by_method = df.groupby('sales_method')['revenue'].sum()
method_labels = revenue_by_method.index
method_colors = ['red', 'green', 'blue']

_ = plt.pie(revenue_by_method, labels=method_labels, colors=method_colors, autopct='%1.1f%%')

# Add title
_ = plt.title('Revenue by Sales Method')

# Show the plot
plt.show()

Despite the fact that Email_+call is more profitable than other methods, the total revenue of Email+ call is not the highest; "Email" created the highest revenue. However, many attend to reach clients by calling is high, almost twice larger than Email + call. Email seems like a firm way to attach clients since it has the highest earning in the six weeks.

3. Was there any difference in revenue over time for each of the methods?

1) Firstly let's check the average revenue of each deal over time for each method

weekly_average_revenue = df.groupby(['week','sales_method']).mean().reset_index()
_=sns.catplot(data=weekly_average_revenue, x='week',y='revenue',kind='point', hue='sales_method',palette=method_colors)
# show the plot
plt.show()

A bar plot may also review the situation.

_= sns.barplot(data = weekly_average_revenue, x='week', y='revenue', hue ='sales_method',palette=method_colors)
plt.show()

According to the plot present in here, the revenue of each cases during the past 6 weeks grew progressively. Deals from all sales methods earn more on each deal gradually. One of the possible reason can be the rise of the number of sold units on each deal shown as follows.


df1= df.groupby(['week','sales_method'])['nb_sold'].mean().reset_index()

sns.lineplot(data=df1, x='week',y='nb_sold',hue='sales_method',palette=method_colors)

plt.show()

Through the plot we can see that the number of sold units increase step by step during the time. If we assume that the price of each unit does not vary dramatically, the increase of sold units number may lead to the increase of the average revenue of each deal directly.

2) Let's check the sum value of the revenue during these 6 weeks.


weekly_revenue = df.groupby(['week', 'sales_method'])['revenue'].sum().reset_index()
_=sns.lineplot(data=weekly_revenue, x='week',y='revenue',hue='sales_method',palette=method_colors)
# show the plot
plt.show()

A stacked area plot may also reveal the situation in a instinctive way.

# Pivot data to create a table with weeks as rows, sales methods as columns, and revenue as values
weekly_revenue_pivot = weekly_revenue.pivot(index='week', columns='sales_method', values='revenue')

# Create stacked area plot
_ = weekly_revenue_pivot.plot(kind='area', stacked=True, color=method_colors)

# Add labels
_ = plt.xlabel('Week')
_ = plt.ylabel('Revenue')
_ = plt.title('Revenue by Week and Sales Method')

# Show the plot
plt.show()

From the chart above, we can see that the sales situation is generally showing a downward trend. A possible reason is that the sales by "mail" dropped amazingly while the other two methods does not have an increase big enough to counteract the loss of sales by 'Email' while at the same time the 'Call' method does not fluctuate too much. The sales revenue of each way remains the same. The only method which may seem to have a growth is sales made by "Email + Call".

Combining the result we gained from the average revenue of each deal over time for each method, the revenue of each case raises but the total revenue of each week decrease throughout these 6 weeks. The possible reason is maybe due to the decrease in the total number of sold units. Further analysis of units sold and other exploration is mentioned in the following part.

4.Exploration of sales quantitiy and the others.

1) The quantity of deals and sold units

sold_unit = df.groupby(['week','sales_method'])['nb_sold'].sum().reset_index()
_=sns.lineplot(data=sold_unit,x='week',y='nb_sold',hue='sales_method',palette=method_colors)
plt.show()

deals_count = df.groupby(['week','sales_method'])['customer_id'].count().reset_index()
_=sns.lineplot(data=deals_count,x='week',y='customer_id',hue='sales_method',palette=method_colors)
plt.show()

The upper line plots show that the number of deals and sold units have a similar fluctuation as sum revenue plot. It is seen that the decrease of the total revenue is the result of the reduction of sold number and

Search This Blog

Sherwin's Data nest

(Power BI & Python) Sales performance of different sales method on the same product