The last practical example for conducting a statistical analysis in Python. Really!
My second project at Flatiron School required gathering information from a realworld database and using my knowledge of statistical analysis and hypothesis testing to generate analytical insights. We were working with the famous Northwind database – a free, opensource dataset created by Microsoft containing data from a fictional company and widely used for learning SQL. After conducting quiet a lot of tests, ranging from Welch’s ttest, independent ttest, oneway ANOVA to twoway ANOVA , in this last part we’re going to practice a ChiSquared Test for Independency.
The complete Jupyter Notebook can be found here.
The question to answer and analyze is:
Is there a relation between the percentage of a given shipping
companyβs number of shares within a certain region vs. globally? In
other words, is their percentage of shares locally proportional to
their percentage of shares globally?If not, could shipping companies who already hold a large number of total shares within certain regions be provided with more freights/higher shipping volumes(within their regions), in return for decreasing shipping costs?
Here’s what’s covered in this section:
 Obtaining the data
 Data Cleaning & Feature Engineering
 The Hypothesis
 The Analysis
 Conclusions
Obtaining the data
This section was already discussed extensively in the first blog regarding statistical analysis.
Data Cleaning & Feature Engineering
What tables from our SQL database do we need? Well, we only need the following two:
 “Order” and
 “Shipper”
We’re using the pd
.read_sql_query() method:
df5 = pd.read_sql_query('''
SELECT *
FROM [Order] o
JOIN Shipper s ON s.Id = o.ShipVia;
''', engine)
Let’s check the newly created dataframe:
df5.head()
Now we have some work to do in order to create a frequency table in the end. To do so, I first build a table with the observed values:
# oberserved values for each shipper by grouping
df5_observed = df5.groupby(['CompanyName', 'ShipRegion'])['Freight'].sum()
# unstack dataframe for more readability
df5_observed = df5_observed.unstack()
# add row sum
df5_observed['ROW_TOTAL'] = df5_observed.sum(axis=1)
# change index column's name
df5_observed.index.names = ['Sample']
df5_observed.head()
Next comes the table with the expected values:
# adding the "sample" column to the original dataframe df5 to make unstacking for the expected values possible exactly like above
df5['Sample'] = 'COLUM_TOTAL'
# create expected values:
df5_expected = df5.groupby(['Sample', 'ShipRegion'])['Freight'].sum()
# unstack dataframe for more readability
df5_expected = df5_expected.unstack()
# add row sum
df5_expected['ROW_TOTAL'] = df5_expected.sum(axis=1)
df5_expected
And now let’s combine both table to have a crosstab:
# concat both tables to one frequency table
df5_crosstab = pd.concat([df5_observed, df5_expected], axis=0)
df5_crosstab
The Hypothesis
Now it’s time to define the hypotheses – the null and the alternative one – and set the alpha level:
Ho: There is no relationship between each shipping company’s freight volume and supplied regions. Both categorical variables are probabilistically independent and each company has the same share in each region as it has globally:
H0: π1 = π2, ... , = ππ
H1: There is a relationship between each shipping company’s freight volume and supplied regions.
H1: ππ β ππ
Significance level is set to the value of 5%.
The Analysis
To determine whether the proportions of all three shippers in the nine regions are the same, a π2 – Test of Independence is appropriate. Because of the π2 – Distribution’s shape, the significance level is always specified on the right (no onetailed or twotailed specification is necessary).
There are several assumptions that need to be met for the results of ChiSquared Test to be considered accurate, which are easy to validate:

When testing the data, the cells should be frequencies, or counts of cases, and not percentages.

The levels (categories) of the variables being tested are mutually exclusive.

Each participant contributes to only one cell within the Chisquared table.

The groups being tested must be independent.

The value of expected cells should be greater than 5.
Assumptions are met, see crosstab above.
The first, more tedious way is to calculate Chitest statistic and degrees of freedom on our own:
def chi_statistic(dataframe):
""" Function that takes in the dataframe and calculates the Chi
Statistic for ChiSquared Test of Independency by:
a) computing the expected frequency:
E(i,j) = (Row Total * Column Total) / Grand Total
b) calculating ChiSquared:
squaring the product of observed (O) and the expected (E)
frequencies, dividing it by E and summing all values up:
chi = Ξ£ of ( (O(i,j) * E(i,j)) **2 ) / E(i,j) """
# calculate grand
grand_total = dataframe.iloc[3, 9]
# define an empty list for ALL expected chivalues
chi_total = []
# iterate through all rows (=samples)
for row_j in range(0,3):
# calculate row_total for each row
row_total = dataframe.iloc[row_j, 9]
# define an empty list for the expected chivalues per single row
chi_per_row = []
# iterate through all columns in that row (= features)
for column_i in range(0,10):
# calculate the column total
col_total = dataframe.iloc[3, column_i]
# grab the observed value
obs_value = dataframe.iloc[row_j, column_i]
# calculate first the expected frequency
exp_value = (row_total * col_total) / grand_total
# calculate secondly "single" chi value
single_chi = ((obs_valueexp_value)**2) / exp_value
# append them to the list
chi_per_row.append(single_chi)
# append the sum of rowwise chivalues to the overall list
chi_total.append(sum(chi_per_row))
# sum all values up to get the chisquared
return sum(chi_total)
# and now let's run it:
chi_statistic(df5_crosstab)
Output:
2757.5246458453066
# getting the degrees of freedom: df = (nβββ1)β*β(mβββ1) where n=samples and m=features
df = (31) * (91)
df
Output:
16
Now checking the chitable, it shows: For df = 16 and Ξ±=0.05, the critical value of ChiSquared is 26.296 – even for Ξ±=0.01, the value is 32.000, i.e. our test statistic is much higher, therefore statistically significant and we can reject the null hypothesis.
The second, more convenient way is to use the builtin Chi Contingency function to get the p value:
chi, p, df, exp_frequencies = stats.chi2_contingency(df5_crosstab)
print(f"Chi Statistic: {chi}")
print(f"pvalue (twotailed): {p:.12f}")
Output:
Chi Statistic: 2757.5246458453066
pvalue (twotailed): 0.000000000000
The ChiSquared value matches the one we calculated, and the pvalue indicates a highly significant result.
The effect size for a ChiSquared Test can be calculated with Cramerβs V:
import math
# chivalue we calculated with our function
chi = chi_statistic(df5_crosstab)
# observations, i.e. the length of original datatframe
n = df5.shape[0]
# df = degrees of freedom we calculated above
df = df
v = math.sqrt(chi/(n*df))
v
Output:
0.35078318989327995
According to Cohen, this can almost be considered a medium effect size.
Conclusions
A ChiSquared Test of Independence with Ξ± = 0.05 was used to analyze whether the variables “shipping company” and “region” are probabilistically independent, therefore whether each shipping company has the same relative share in each region as it does globally:

There was a statistically significant difference between n = 3 shipping companies and m = 9 regions with p = 0.0000000.

Therefore I may reject the null hypothesis in support of the alternative: There is indeed a relationship between each shipping company’s freight volume and supplied regions.

The effect size points to a median effect so we explored a medium practical significance.
The results could be used to provide shipping companies who already hold a large number of total shares within certain regions be provided with more freights/higher shipping volumes(within their regions), in return for reducing their shipping costs. Regions with skewed share distributions (where one shipping company already holds many shares) could be targeted. Examples: Federal Shipping already holds many shares in North America, Speedy Express in Western Europe, etc.
. . . . . . . . . . . . . .
Thank you for your reading!
If you want to read more about statistical testing, check out the other posts. The complete Jupyter Notebook can be found here.
I hope you enjoyed reading this article, and am always happy to get critical and friendly feedback, or suggestions for improvement!