Another practical example for conducting a statistical analysis in Python
My second project at Flatiron School required gathering information from a real-world database and using my knowledge of statistical analysis and hypothesis testing to generate analytical insights. We were working with the famous Northwind database – a free, open-source dataset created by Microsoft containing data from a fictional company and widely used for learning SQL. After already conducting a Welch’s t-test, this blog covers an independent t-test. The following blogs deal with a one-way ANOVA, a two-way ANOVA and a Chi-Squared Test for Independency.
The complete Jupyter Notebook can be found here.
The question to answer and analyze is:
Do people who order more frequently than average, also generate
more sales? If so, would it make sense for companies to target
customers who, regardless of how much or how little they spend,
make more frequent purchases on the whole?
Here’s what’s covered in this section:
- Obtaining the data
- Data Cleaning & Feature Engineering
- The Hypothesis
- The Analysis
Obtaining the data
This section was already extensively discussed in the first blog regarding statistical analysis.
Data Cleaning & Feature Engineering
First we need to consider which tables to join in order to have all data ready for our anylsis. This time it might be suitable to take the
- “OrderDetail” and
tables and when joining them give them meaningful column names. We’re using the
df2 = pd.read_sql_query('''
SELECT c.Id customer_id, o.Id order_id, od.Quantity quantity, od.UnitPrice unit_price, od.discount discount
FROM OrderDetail od
JOIN [Order] o ON od.OrderId = o.Id
JOIN Customer c ON o.CustomerId = c.Id
Let’s check the newly created dataframe:
As we’re working with frequencies, let’s calculate the mean and median amount of orders to get a genereal idea of the data:
mean = df2.groupby('customer_id')['order_id'].count().mean()
median = df2.groupby('customer_id')['order_id'].count().median()
print('Mean: ', mean)
Mean: 24.44705882352941 Median: 20.0
It’s usually better to stick to the median, because unlike the mean, the median is not influenced by outliers.
Now there are three engineering steps left to get the data we need:
- Let’s create the spending columns, both “gross” and “net of discounts” and a dataframe by spending:
# multiplying quantity and unit price to get a column with gross spending
df2['spend_gross'] = np.multiply(df2['quantity'], df2['unit_price'])
# subtracting discount to get net spending
df2['spend_net'] = np.subtract(df2['spend_gross'], df2['discount'])
# first dataframe by spending
df2_spending = df2.groupby('customer_id')[['spend_gross',
- Let’s build a dataframe by sales frequency and label each customer as below or above median sales:
# second dataframe by sales frequency
df2_frequency = pd.DataFrame(df2.groupby('customer_id')['order_id'].count())
# creating a new column in second dataframe,
# where all customer with a total amount of orders UP TO MEDIAN are classified as "Below Median"
# and all customers with a total amount of orders GREATER THAN MEDIAN as "Above Median"
df2_frequency['sales_frequency'] = df2_frequency['order_amount'].apply(lambda x: 'Below Median'
if x <= median
else 'Above Median')
Below Median 43
Above Median 42
Name: sales_frequency, dtype: int64
Good, we have very equal sized groups to work with!
- Finally, let’s concat both dataframes:
df2_full = pd.concat([df2_spending, df2_frequency], axis=1)
4 columns, 85 rows to work with!
Now it’s time to define the hypotheses – the null and the alternative one – and set the alpha level:
Ho: People who order more frequently (= above the median) do not spend more money on average than people who order less frequently:
𝜇1 = 𝜇2
H1: People who order more frequently also spend more money on average than people who order less frequently:
𝜇1 > 𝜇2
Significance level is set to the value of 5%:
𝛼 = 0.05
Since the direction of the difference is specified, it’s a one-tailed test.
To determine whether the mean difference between two unrelated groups is statistically significant, the independent t-test is typically used. Several assumptions must be met, such as:
The independent variable is categorical with at least two levels (groups).
The dependent variable must be continuous – this is measured on an interval or ratio scale.
The dependent variable should not contain any significant outliers.
- The variances between the two groups are equal.
- The dependent variable should be normally distributed in both groups.
The expenditures of frequent and non-frequent buyers are not normally distributed, but the independent t-test remains quite robust even for violations of normality. Nevertheless, we should keep in mind that reliability decreases for skewed distributions.
All goodness-of-fit tests are left out here, but can be looked up in detail in my notebook.
To conduct the t-test, we first have to prepare both of our samples,
then plug them into the built-in function:
# create criteria for each sample
crit_above = df2_full['sales_frequency']=='Above Median'
crit_below = df2_full['sales_frequency']=='Below Median'
# slice spendings for each criteria into
series_above = df2_full['spend_net'].where(crit_above).dropna()
series_below = df2_full['spend_net'].where(crit_below).dropna()
# calculating t-value and two-tailed p-value
from scipy import stats
t, p_two = stats.ttest_ind(series_above, series_below)
Lastly I calculate the p-value for resp. the area of rejecting the null hypothesis. Given our directional alternative hypothesis:
𝜇1 > 𝜇2 or stated differently: 𝜇1 – 𝜇2 > 0
this area of rejecting is on the right because of ” > 0 “.
Or in other words: I’m trying to get the probability that my test statistic exists GIVEN the null hypothesis is true. If this probabilty is very small, I will reject the null hypothesis and instead work with the alternative hypothesis.
# using the cumulative density function (CDF) with t-value und df
p = 1 - stats.t.cdf(t, n1+n2-2)
The independent t-test results are significant! Therefore, we can reject the null hypothesis in support of the alternative.
While it’s good to know if there is a statistically significant effect of some intervention on the outcome, it’s as important to know the size of the effect the intervention has on the outcome. Statistical significance does not always translate into a large effect in the real world. This is important to consider because everything costs money and time. To see this, we can calculate the effect size.
To calculate Cohen’s d as a measure of effect size, this function will take in the two samples and, by calculating the mean and standard deviation, will eventually return Cohen’s d:
from statistics import mean, stdev
from math import sqrt
def cohens_d(x, y):
""" Function takes in two samples as arguments and
calculates Cohens's as a measure of effect size. """
cohens_d = (mean(x) - mean(y)) / (sqrt((stdev(x)**2 +
stdev(y)**2) / 2))
Now let’s put this function into action:
Cohen’s may also take on values above 1 (it’s defined from -∞ to +∞). In our case it’s really an eminently big effect!
A one-tailed t-test with 𝛼 = 0.05was used to analyze whether customers who order more frequently also spend more money on average than customers who order less frequently:
A customer who buys more often than the median value was classified as a frequent customer, and all others as infrequent customers.
The average spending of a frequent customer is much higher (26917 ± 27175 units) than the average spending of an infrequent customer (4209 ± 3631 units). Note that the standard deviation of the first may also lead to negative spending, perhaps due to discounts. To get these numbers, series_above.describe() and series_below.describe() is used.
There was a statistically significant difference between both groups’ expenditures with t(42) = 5.4309 and p = 0.00000055.
I can therefore reject the null hypothesis in support of the alternative: Customers who order more frequently also spend more money on average than customers who order less frequently.
Moreover, the calculated effect size is extremely high, pointing to a large practical significance.
Therefore, in order to generate higher revenues, the results could be used by the company to think of strategies targeting customers who, regardless of how much or how little they spend, make more frequent purchases on the whole.
. . . . . . . . . . . . . .
Thank you for your reading!
If you want to read more about statistical testing, check out the other posts. The complete Jupyter Notebook can be found here.
I hope you enjoyed reading this article, and am always happy to get
critical and friendly feedback, or suggestions for improvement!