Hypothesis Testing #3: One-way ANOVA

Yet another practical example for conducting a statistical analysis in Python

My second project at Flatiron School required gathering information from a real-world database and using my knowledge of statistical analysis and hypothesis testing to generate analytical insights. We were working with the famous Northwind database – a free, open-source dataset created by Microsoft containing data from a fictional company and widely used for learning SQL. After already conducting both a Welch’s t-test and an independent t-test, this blog covers a one-way ANOVA. The following blogs deal with a two-way ANOVA and a Chi-Squared Test for Independency.

The complete Jupyter Notebook can be found here.

The question to answer and analyze is:

Does discount level have a statistically significant effect on the
quantity of product in an order? If so, at what level(s)?

Here’s what’s covered in this section:

  1. Obtaining the data
  2. Data Cleaning & Feature Engineering
  3. The Hypothesis
  4. The Analysis
  5. Conclusions

Obtaining the data

This section was already discussed extensively in the first blog regarding statistical analysis.

Data Cleaning & Feature Engineering

This time it’s very easy, as we only need one table:

  • “OrderDetail”

We’re using the pd.read_sql_query() method:

df3 = pd.read_sql_query('''SELECT * FROM OrderDetail;''', engine)

Let’s check the newly created dataframe:


How many unique discount levels do we have?


0.00 1317 0.05 185 0.10 173 0.20 161 0.15 157 0.25 154 0.03 3 0.02 2 0.01 1 0.04 1 0.06 1

Oh, some discount levels occur only once. Let’s drop these and work only with both the no-discount group, and the 5 groups with substantial discount levels:

crit = (df3.Discount == 0.00) | (df3.Discount == 0.05) | (df3.Discount == 0.10) | (df3.Discount == 0.15) | (df3.Discount == 0.20) | (df3.Discount == 0.25)

df3_test = df3.loc[crit]

0.00 1317 0.05 185 0.10 173 0.20 161 0.15 157 0.25 154

Much better. Let’s also check if any of our discount levels have substantial outliers:

df3_test.boxplot('Quantity', by='Discount', figsize=(10, 5));

Great! There are no outliers in each group with respect to the quantity of products.

The Hypothesis

Now it’s time to define the hypotheses – the null and the alternative one – and set the alpha level:

Ho: The average quantity of ordered products in an order is equal for all levels of discounts.:

H0: πœ‡1 = πœ‡2 = πœ‡3 = … = πœ‡π‘– … = πœ‡π‘—

H1: The average quantity of ordered products differs between at least two discounts levels:

H1: πœ‡π‘– β‰  πœ‡π‘—

Significance level is set to the value of 5%:

𝛼 = 0.05

The Analysis

The analysis of variance or one-way ANOVA is used to compare the differences between the means of a condition of 2+ groups. It can be thought of as an extension of the t-test that is used to compare the means of a condition between 2 groups.

ANOVA is an omnibus test, meaning it tests the data as a whole; it tests if there is a difference in the mean somewhere in the model, but it does not tell us where the difference is (if there is one). To find out where the difference is between the groups, we have to conduct post-hoc tests.

Although it can be thought of as an extension of the t-test in terms of when to use it, mathematically speaking it’s more of a regression model and is considered a generalized linear model (GLM).

Because of the F-Distribution’s shape, the significance level is always specified on the right (no one-tailed or two-tailed specification is necessary).

There are 3 assumptions that need to be met for the results of an ANOVA test to be considered accurate and trustworthy. It’s important to note that two assumptions apply to the residuals and not the variables themselves. The ANOVA assumptions are the same as those for linear regression, and are:

  • Normality
    The caveat to this is that if group sizes are equal, the F-statistic is robust to violations of normality.
  • Homogeneity of variance
    Same caveat as above; if group sizes are equal, the F-statistic is robust to this violation.
  • Independent observations
    The independence of observations can be assumed!

If possible, it is best to have equal-sized groups so corrections to the data do not need to be made. Fortunately, we are dealing with equal-sized groups here.

As two of the assumptions apply to the residuals, they are tested after conducting the ANOVA. Both goodness-of-fit tests are left out here, but can be looked up in detail in my notebook.

The ANOVA itself is coded very simple:

import statsmodels.api as sm
from statsmodels.formula.api import ols

# syntax for defining the ANOVA-model
anova = ols('Quantity ~ Discount', data=df3_test).fit()
table = sm.stats.anova_lm(anova, typ=2)

On the left, we have our discount factor, and on the right, the probability that this factor is indeed influential.

Values < .05 (i.e. our Ξ±) indicate rejection of the null hypothesis. We can see the discount factor appears to be highly influential.

One can use a few different effect sizes for an ANOVA: eta squared (Ξ·2), and omega squared (Ο‰2). Omega squared is considered a better measure of effect size than eta squared because it is unbiased in its calculation.

Something to note: For some reason, R2 is called eta squared within the ANOVA framework, but they are the same thing. R2 is a measure of how much variance is explained by the model, and is calculated by taking the explained variance and dividing it by the total variance. The following code uses the ANOVA table produced by statsmodels and appends the effect size measures of eta-squared (Ξ·2) and omega-squared (Ο‰2).

def anova_effect_size(aov): 
""" This function takes the Anova table performed with
and appends two effect size measures: eta-squared
and omega-squared. """

# mean squared
aov['mean_sq'] = aov[:]['sum_sq']/aov[:]['df']

# eta-squared (Ξ·2)
aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])

# omega-squared (Ο‰2)
num = (aov[:-1]['sum_sq'] - (aov[:-1]['df']*aov['mean_sq'][-1]))
denom = (sum(aov['sum_sq']) + aov['mean_sq'][-1])
aov['omega_sq'] = num / denom

return aov


According to both measures (around 0.016), the discount has only a small effect.

The overall model was significant; now we want to know which groups actually differ. This is called Post-hoc Testing. To do so, Tukey’s HSD is one method that can be used:

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

mc = MultiComparison(df3_test['Quantity'], df3_test['Discount'])
mc_results = mc.tukeyhsd()

Instead of printing a table, let’s use visualizations – it’s just so much easier to spot the differences.

The base discount group to compare against is color coded BLUE, all groups that are significantly different from this base are color coded RED, and all insignificant groups GRAY:

# let's compare all levels with discount level 0.0 (blue)
# all groups that are significantly different from 0.0 are color coded red, all insignificant groups gray
mc_results.plot_simultaneous(comparison_name=0.0, figsize=(6,4), ylabel='Discount Level\n', xlabel='\nAverage Number of Products in Order');

  • All discount levels significantly differ from NO DISCOUNT – except the 0.1% discount.

Let’s have a look at all of the other comparisons:

  • Some levels seem to have an unwanted effect though, e.g. 0.2: The number of ordered products becomes even less compared 0.15 or 0.05!


A one-way ANOVA on 𝛼 = 0.05 was used to analyze the effect of discount levels on the quantity of a product in an order:

  • There was a statistically significant difference between the discount levels with p = 2.991487e-09.

  • Therefore I may reject the null hypothesis in support of the alternative: The average quantity of ordered products differs between at least two discounts levels.

  • Post-hoc testing revealed interesting details about which discount-level groups actually differ. Further investigation may be needed, but there is evidence to suggest that a discount of 0.1 has less of an effect than one of 0.15 or 0.25 (double-digits).

  • However, the effect size is small, pointing to a low practical significance.

The results could be used to further investigate – and maybe test – the effect of different discount levels as revenue boosters.

. . . . . . . . . . . . . .

Thank you for your reading!

If you want to read more about statistical testing, check out the other posts. The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and am always happy to get
critical and friendly feedback, or suggestions for improvement!