Hypothesis Testing #5: Chi-Squared Test for Independency

The last practical example for conducting a statistical analysis in Python. Really!

My second project at Flatiron School required gathering information from a real-world database and using my knowledge of statistical analysis and hypothesis testing to generate analytical insights. We were working with the famous Northwind database – a free, open-source dataset created by Microsoft containing data from a fictional company and widely used for learning SQL. After conducting quiet a lot of tests, ranging from Welch’s t-test, independent t-test, one-way ANOVA to two-way ANOVA , in this last part we’re going to practice a Chi-Squared Test for Independency.

The complete Jupyter Notebook can be found here.

The question to answer and analyze is:

Is there a relation between the percentage of a given shipping
company’s number of shares within a certain region vs. globally? In
other words, is their percentage of shares locally proportional to
their percentage of shares globally?

If not, could shipping companies who already hold a large number of total shares within certain regions be provided with more freights/higher shipping volumes(within their regions), in return for decreasing shipping costs?

Here’s what’s covered in this section:

  1. Obtaining the data
  2. Data Cleaning & Feature Engineering
  3. The Hypothesis
  4. The Analysis
  5. Conclusions

Obtaining the data

This section was already discussed extensively in the first blog regarding statistical analysis.

Data Cleaning & Feature Engineering

What tables from our SQL database do we need? Well, we only need the following two:

  • “Order” and
  • “Shipper”

We’re using the pd.read_sql_query() method:

df5 = pd.read_sql_query('''
SELECT *
FROM [Order] o
JOIN Shipper s ON s.Id = o.ShipVia;
''', engine)

Let’s check the newly created dataframe:

df5.head()

Now we have some work to do in order to create a frequency table in the end. To do so, I first build a table with the observed values:

# oberserved values for each shipper by grouping
df5_observed = df5.groupby(['CompanyName', 'ShipRegion'])['Freight'].sum()
# unstack dataframe for more readability
df5_observed = df5_observed.unstack()
# add row sum
df5_observed['ROW_TOTAL'] = df5_observed.sum(axis=1)
# change index column's name
df5_observed.index.names = ['Sample']
df5_observed.head()

Next comes the table with the expected values:

# adding the "sample" column to the original dataframe df5 to make unstacking for the expected values possible exactly like above
df5['Sample'] = 'COLUM_TOTAL'

# create expected values:
df5_expected = df5.groupby(['Sample', 'ShipRegion'])['Freight'].sum()
# unstack dataframe for more readability
df5_expected = df5_expected.unstack()
# add row sum
df5_expected['ROW_TOTAL'] = df5_expected.sum(axis=1)
df5_expected

And now let’s combine both table to have a crosstab:

# concat both tables to one frequency table
df5_crosstab = pd.concat([df5_observed, df5_expected], axis=0)
df5_crosstab

The Hypothesis

Now it’s time to define the hypotheses – the null and the alternative one – and set the alpha level:

Ho: There is no relationship between each shipping company’s freight volume and supplied regions. Both categorical variables are probabilistically independent and each company has the same share in each region as it has globally:

H0: 𝑋1 = 𝑋2, ... , = π‘‹π‘š

H1: There is a relationship between each shipping company’s freight volume and supplied regions.

H1: 𝑋𝑖 β‰  𝑋𝑗

Significance level is set to the value of 5%.

The Analysis

To determine whether the proportions of all three shippers in the nine regions are the same, a πœ’2 – Test of Independence is appropriate. Because of the πœ’2 – Distribution’s shape, the significance level is always specified on the right (no one-tailed or two-tailed specification is necessary).

There are several assumptions that need to be met for the results of Chi-Squared Test to be considered accurate, which are easy to validate:

  • When testing the data, the cells should be frequencies, or counts of cases, and not percentages.

  • The levels (categories) of the variables being tested are mutually exclusive.

  • Each participant contributes to only one cell within the Chi-squared table.

  • The groups being tested must be independent.

  • The value of expected cells should be greater than 5.
    Assumptions are met, see crosstab above.

The first, more tedious way is to calculate Chi-test statistic and degrees of freedom on our own:

def chi_statistic(dataframe):
""" Function that takes in the dataframe and calculates the Chi
Statistic
for Chi-Squared Test of Independency by:

a) computing the expected frequency:
E(i,j) = (Row Total * Column Total) / Grand Total

b) calculating Chi-Squared:
squaring the product of observed (O) and the expected (E)
frequencies,
dividing it by E and summing all values up:
chi = Ξ£ of ( (O(i,j) * E(i,j)) **2 ) / E(i,j) """

# calculate grand
grand_total = dataframe.iloc[3, 9]

# define an empty list for ALL expected chi-values
chi_total = []

# iterate through all rows (=samples)
for row_j in range(0,3):

# calculate row_total for each row
row_total = dataframe.iloc[row_j, 9]

# define an empty list for the expected chi-values per single row
chi_per_row = []

# iterate through all columns in that row (= features)
for column_i in range(0,10):

# calculate the column total
col_total = dataframe.iloc[3, column_i]
# grab the observed value
obs_value = dataframe.iloc[row_j, column_i]
# calculate first the expected frequency
exp_value = (row_total * col_total) / grand_total
# calculate secondly "single" chi value
single_chi = ((obs_value-exp_value)**2) / exp_value
# append them to the list
chi_per_row.append(single_chi)

# append the sum of row-wise chi-values to the overall list
chi_total.append(sum(chi_per_row))

# sum all values up to get the chi-squared
return sum(chi_total)

# and now let's run it:
chi_statistic(df5_crosstab)
Output:
2757.5246458453066
# getting the degrees of freedom: df = (nβ€‰βˆ’β€‰1) * (mβ€‰βˆ’β€‰1) where n=samples and m=features
df = (3-1) * (9-1)
df
Output:
16

Now checking the chi-table, it shows: For df = 16 and Ξ±=0.05, the critical value of Chi-Squared is 26.296 – even for Ξ±=0.01, the value is 32.000, i.e. our test statistic is much higher, therefore statistically significant and we can reject the null hypothesis.

The second, more convenient way is to use the built-in Chi Contingency function to get the p value:

chi, p, df, exp_frequencies = stats.chi2_contingency(df5_crosstab)
print(f"Chi Statistic: {chi}")
print(f"p-value (two-tailed): {p:.12f}")
Output:
Chi Statistic: 2757.5246458453066
p-value (two-tailed): 0.000000000000

The Chi-Squared value matches the one we calculated, and the p-value indicates a highly significant result.

The effect size for a Chi-Squared Test can be calculated with Cramer’s V:

import math

# chi-value we calculated with our function
chi = chi_statistic(df5_crosstab)

# observations, i.e. the length of original datatframe
n = df5.shape[0]

# df = degrees of freedom we calculated above
df = df

v = math.sqrt(chi/(n*df))
v
Output:
0.35078318989327995

According to Cohen, this can almost be considered a medium effect size.

Conclusions

A Chi-Squared Test of Independence with Ξ± = 0.05 was used to analyze whether the variables “shipping company” and “region” are probabilistically independent, therefore whether each shipping company has the same relative share in each region as it does globally:

  • There was a statistically significant difference between n = 3 shipping companies and m = 9 regions with p = 0.0000000.

  • Therefore I may reject the null hypothesis in support of the alternative: There is indeed a relationship between each shipping company’s freight volume and supplied regions.

  • The effect size points to a median effect so we explored a medium practical significance.

The results could be used to provide shipping companies who already hold a large number of total shares within certain regions be provided with more freights/higher shipping volumes(within their regions), in return for reducing their shipping costs. Regions with skewed share distributions (where one shipping company already holds many shares) could be targeted. Examples: Federal Shipping already holds many shares in North America, Speedy Express in Western Europe, etc.

 

. . . . . . . . . . . . . .

Thank you for your reading!

If you want to read more about statistical testing, check out the other posts. The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and am always happy to get critical and friendly feedback, or suggestions for improvement!