Hypothesis Testing #4: Two-way ANOVA

Another practical example for conducting a statistical analysis in Python

My second project at Flatiron School required gathering information from a real-world database and using my knowledge of statistical analysis and hypothesis testing to generate analytical insights. We were working with the famous Northwind database – a free, open-source dataset created by Microsoft containing data from a fictional company and widely used for learning SQL. After already conducting a Welch’s t-test, an independent t-test and a one-way ANOVA, this blog covers a two-way ANOVA. The following and last blog of this series covers a Chi-Squared Test for Independency.

The complete Jupyter Notebook can be found here.

The question to answer and analyze is:

How do gender and product influence the amount of sales? Do they interact in that a specific gender responds well to specific categories?

Here’s what’s covered in this section:

  1. Obtaining the data
  2. Data Cleaning & Feature Engineering
  3. The Hypothesis
  4. The Analysis
  5. Conclusions

Obtaining the data

This section was already discussed extensively in the first blog regarding statistical analysis.

Data Cleaning & Feature Engineering

We will need the following five tables from our SQL database:

  • “Order”
  • “OrderDetail”
  • “Customer”
  • “Product” and
  • “Category”

We’re using the pd.read_sql_query() method:

df4 = pd.read_sql_query('''

SELECT c.Id customer_id, c.contactname contact_name, c.contacttitle title,
o.Id order_id, od.Quantity quantity, od.UnitPrice unit_price, od.Discount discount, cat.Id cat_id, cat. CategoryName cat_name

FROM OrderDetail od

JOIN [Order] o ON od.OrderId = o.Id

JOIN Customer c ON o.CustomerId = c.Id
JOIN Product p ON od.ProductId = p.Id
JOIN Category cat ON p.categoryId = cat.Id
;''', engine)

Let’s check the newly created dataframe:


How many and what product categories are we up against?

['Dairy Products' 'Grains/Cereals' 'Produce' 'Seafood' 'Condiments' 'Confections' 'Beverages' 'Meat/Poultry']

Also: How many names are we up against?

array(['Paul Henriot', 'Renate Messner', (......) 'Dominique Perrier', 'Daniel Tonini'], dtype=object)

Okay, how do we define who’s male and who’s female? Simply by using good, old copy-and-paste, and defining all male names. A google search can help us out …:

# define the gender for male customers:
male_names = ['Paul Henriot','Mario Pontes', 'Pascale Cartrain',
'Yang Wang', 'Michael Holz',
'Carlos Hernández',
'Roland Mendel', 'Francisco Chang', 'Bernardo Batista',

'Frédérique Citeaux', 'Pirkko Koskitalo', 'Peter Franken', 'Manuel Pereira',
'Karl Jablonski', 'Art Braunschweiger', 'Horst Kloss', 'Giovanni Rovelli',
'Miguel Angel Paolino', 'Alexander Feuer', 'Carlos González', 'Maurizio Moroni',
'Pedro Afonso', 'José Pedro Freyre', 'Rene Phillips', 'Fran Wilson', 'Guillermo Fernández',
'Philip Cramer', 'Jose Pavarotti','Martín Sommer', 'Lino Rodriguez', 'Laurence Lebihan',
'Jean Fresnière', 'Georg Pipps', 'Thomas Hardy', 'Hari Kumar', 'Sven Ottlieb',
'Eduardo Saavedra', 'Palle Ibsen', 'Zbyszek Piestrzeniewicz', 'Yoshi Latimer',
'Jonas Bergulfsen', 'Felipe Izquierdo', 'Paolo Accorti', 'André Fonseca', 'Sergio Gutiérrez',
'John Steel', 'Yoshi Tannamuri', 'Simon Crowther','Patricio Simpson', 'Howard Snyder',
'Helvetius Nagy', 'Jaime Yorres', 'Matti Karttunen', 'Dominique Perrier', 'Daniel Tonini']

# create a new column "gender"
# when contact name is found in the list of male names, label the gender 'm', else: 'f'
df4['gender'] = np.where(df4['contact_name'].isin(male_names), 'm', 'f')

# multiplying quantity and unit price to get a column with gross spending
df4['spend_gross'] = np.multiply(df4['quantity'], df4['unit_price'])
# subtracting discount to get net spending
df4['spend_net'] = np.subtract(df4['spend_gross'], df4['discount'])


The Hypothesis

Now it’s time to define the hypotheses – the null and the alternative one – and set the alpha level. This time we have to define a set of hypotheses for each factor and for the interaction we expect:

Main effect of factor A (product category):

Ho: The mean sales are equal for all 8 product categories.

H1: The mean sales are different for least one product category.

Main effect of factor B (gender):

Ho: The mean sales are equal for both women and men.

H1: The mean sales between women and men are different.

Factor A x factor B interactions:

Ho: There is no interaction between product and gender.

H1: There is an interaction between product and gender.

Significance level is set to the value of 5%.

The Analysis

The two-way ANOVA is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one continuous dependent variable. The two-way ANOVA not only aims to assess the main effect of each independent variable but also to see if there is any interaction between them.

Because of the F-Distribution’s shape, the significance level is always specified on the right (no one-tailed or two-tailed specification is necessary).

There are the same 3 assumptions for a two-way ANOVA as for the one-way ANOVA:

  • Normality
    The caveat to this is that if group sizes are equal, the F-statistic is robust to violations of normality.

  • Homogeneity of variance
    Same caveat as above; if group sizes are equal, the F-statistic is robust to this violation.

  • Independent observations
    Could be assumed.

All goodness-of-fit tests are left out here, but can be looked up in detail in my notebook.

The syntax for a two-way ANOVA is quite similar to that of a one-way ANOVA:

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model with the interaction term
# This will also automatically include the main effects for each factor
model = ols('spend_net ~ C(gender)*C(cat_name)', df4).fit()

# Seeing if the OVERALL model is significant
print(f"Overall model F({model.df_model: .0f},{model.df_resid: .0f}) = {model.fvalue: .3f}, p = {model.f_pvalue: .4f}")
Overall model F( 15, 2062) = 4.103, p = 0.0000

Excellent, the overall model is significant.

Next let’s create the ANOVA table to check each factor and the interaction term:

# Creating the ANOVA table
result = sm.stats.anova_lm(model, typ=2)

The interaction term is not significant (p = 0.918). This indicates that there is no interaction effect between gender and product category on the mean spendings. Hence, the interaction term is to be removed from the model and the model will need to be re-run so we can look at the main effects of each factor independently.

# Fit the model again
model2 = ols('spend_net ~ C(gender) + C(cat_name)', df4).fit()

print(f"Overall model F({model2.df_model: .0f},{model2.df_resid: .0f}) = {model2.fvalue: .3f}, p = {model2.f_pvalue: .4f}")
Overall model F( 8, 2069) = 7.382, p = 0.0000

Excellent, the model is still significant. Let’s create the ANOVA table a second time:

# Create the ANOVA table
result2 = sm.stats.anova_lm(model2, typ=2)

Each factor, gender and product category, independently has a significant effect on the mean net spending.

On can use a few different effect sizes for an ANOVA: eta squared (η2) and omega squared (ω2). Omega squared is considered a better measure of effect size than eta squared because it is unbiased in its calculation.

+++ Something to note: For some reason, R2 is called eta squared within the ANOVA framework, but they are the same thing. R2 is a measure of how much variance is explained by the model, and is calculated by taking the explained variance and dividing it by the total variance. +++

The following code uses the ANOVA table produced by statsmodels and appends the effect size measures of eta-squared (η2) and omega-squared (ω2).

def anova_effect_size(aov): 
""" This function takes the Anova table performed with
statsmodels and appends two effect size measures: eta-squared
and omega-squared. """

# mean squared
aov['mean_sq'] = aov[:]['sum_sq']/aov[:]['df']

# eta-squared (η2)
aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])

# omega-squared (ω2)
num = (aov[:-1]['sum_sq'] - (aov[:-1]['df']*aov['mean_sq'][-1]))
denom = (sum(aov['sum_sq']) + aov['mean_sq'][-1])
aov['omega_sq'] = num / denom

return aov


Each factor, gender and product category, has a small effect on the mean net spendings.

The overall model was significant, so now we want to find the differences between groups of factors – this is called Post-hoc Testing. To do so, Tukey’s HSD is one method that can be used to test for differences within each factor separately.

Let’s start with the first factor “gender”:

import statsmodels.stats.multicomp

mc = statsmodels.stats.multicomp.MultiComparison(df4['spend_net'], df4['gender'])
mc_results = mc.tukeyhsd()
print('\n', mc_results, '\n')

The output’s explanation:

  • The meandiff column is the difference in means of the two groups being calculated as group2 – group1.
  • The lower/upper columns are the lower/upper boundaries of the 95% confidence interval.
  • And the reject column states whether or not the null hypothesis should be rejected.

There is a statistically significant difference in the mean net spending between women and men: men’s spending rates are significantly higher than women’s.

Now the second factor, “product category”. Let’s use the visual form for this factor as it’s much easier to spot differences. The standard of comparison is color coded BLUE, the groups that differ significantly are RED and all insignificant groups are GRAY.

The code for the first comparison is provided:

mc = statsmodels.stats.multicomp.MultiComparison(df4['spend_net'], df4['cat_name'])
mc_results = mc.tukeyhsd()
mc_results.plot_simultaneous(comparison_name='Meat/Poultry', figsize=(6,4));

In sum: meat and poultry revenues were the highest, whilst those for seafood were the lowest.


A two-way ANOVA with 𝛼 = 0.05 was used to analyze the effect of the factors gender and product category on net spending:

  • Each factor, gender and product category, has a statistically significant effect on the mean net spending: a) between gender and spending with p = 9.079449e-03, and b) between product category and spending with p = 5.902098e-09.

  • The interaction term, though, was not significant (p = 0.918), indicating no interaction effect between the factors gender and product category on mean spending.

  • Therefore I may partially reject the null hypothesis in support of the alternative: The mean sales between women and men are different. So are the mean sales between some of the product categories. However, with regards to the interaction effect, I support the null hypothesis.

  • Post-hoc testing revealed interesting details about which groups actually differ: 1. Men’s spending rates are significantly higher than
    women’s. 2. Meat and poultry outperform most of the other product categories.

  • However, the effect size is small pointing, to a low practical significance.

The results may be used to focus more on a male target group or draft marketing strategies to increase women’s spending. Moreover, the shop could diversify its meat and poultry products.

. . . . . . . . . . . . . .

Thank you for your reading!

If you want to read more about statistical testing, check out the other posts. The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and am always happy to get critical and friendly feedback, or suggestions for improvement!