Hypothesis Testing #1: Welch’s t-test

A practical example for conducting a statistical analysis in Python

My second project at Flatiron School required gathering information from a real-world database and using my knowledge of statistical analysis and hypothesis testing to generate analytical insights. We were working with the famous Northwind database – a free, open-source dataset created by Microsoft containing data from a fictional company and widely used for learning SQL. In this first part, let’s practice a Welch’s t-test. The following blog series covers an independent t-test, a one-way ANOVA, a two-way ANOVA and a Chi-Squared Test for Independency.

The complete Jupyter Notebook can be found here.

The question to answer and analyze is:

Do customers spend significantly more money on imported products than on domestic ones? So might there be a chance to sell more by focusing on imported products?

Here’s what’s covered in this section:

  1. Obtaining the data
  2. Data Cleaning & Feature Engineering
  3. The Hypothesis
  4. The Analysis
  5. Conclusions

Obtaining the data

After importing all the libraries we are expecting to work with such as NumPy, Pandas or Matplotlib, we can then use SQLAlchemy to connect to the database and inspect it. We do this by making use the of the inspect module provided by SQLAlchemy. We can create an inspector by using the inspectfunction and passing in our engine, and then making use of various methods from our  inspector object to get the names of tables or info on the tables themselves:

# import necessary libraries
from sqlalchemy import create_engine
from sqlalchemy import inspect

# connecting to the Database
engine = create_engine("sqlite:///Northwind_small.sqlite")

# create an inspector object
inspector = inspect(engine)

# get names of tables in database
print(inspector.get_table_names())

Output:
['Category', 'Customer', 'CustomerCustomerDemo', 'CustomerDemographic', 'Employee', 'EmployeeTerritory', 'Order', 'OrderDetail', 'Product', 'Region', 'Shipper', 'Supplier', 'Territory']

To quickly print the columns of a table and each column’s data type, let’s write a function. It might also be of use later on:

def get_columns_info(table_name):    
""" Function takes in table name and prints columns' names
and datatypes """

cols_list = inspector.get_columns(table_name)
print(f"Table Name: {table_name}\n")
for col in cols_list:
print(f"Name: {col['name']} \t Type: {col['type']}")

Does it work?

# calling the function to test it
get_columns_info('OrderDetail')
Output:
Table Name: OrderDetail Name: Id Type: VARCHAR(8000) Name: OrderId Type: INTEGER Name: ProductId Type: INTEGER Name: UnitPrice Type: DECIMAL Name: Quantity Type: INTEGER Name: Discount Type: FLOAT

It does. Great! Then let’s work on our first question.

Data Cleaning & Feature Engineering

First we need to consider which tables to join in order to have all the data ready for our analysis. It might be suitable to take the

  • “Supplier”
  • “Product”
  • “OrderDetail”
  • “Order” and
  • “Customer”

tables and to also give them meaningful column names when joining.

The results returned from raw SQL queries are often pretty hard to read since they don’t contain any structure. However, we can fix this problem by creating a pandas DataFrame.

The most convenient way to incorporate pandas is to make use of the pd.read_sql_query() method. When using this method, we pass in two parameters:

  1. The SQL query we want to execute, in the form of a string.
  2. The engine object we created to connect to our database with SQLAlchemy

This has the benefit of grabbing the appropriate metadata from the table to label the dataframe columns correctly. It also saves us a few extra lines of code.

df1 = pd.read_sql_query(''' 

SELECT p.Id product_id, p.ProductName product_name, s.Id supplier_id, s.Country supplier_country, s.Region supplier_region, od.Quantity order_quantity, od.UnitPrice unit_price, c.Id customer_id, c.Country customer_country, c.Region customer_region

FROM Supplier s

JOIN Product p ON s.Id = p.SupplierId

JOIN OrderDetail od ON p.Id = od.ProductId

JOIN [Order] o ON od.OrderId = o.Id

JOIN Customer c ON o.CustomerId = c.Id

;''', engine)

Let’s check the newly created dataframe:

df1.head()

Looks good!

Now we’re engineering the preference feature by comparing the products’ and the customers’ respective regions and using np.where(). If both regions match, we label it a “domestic” product, and an “imported” product if not.

df1['region_preference'] = np.where(df1['supplier_region'] == df1['customer_region'], 'domestic', 'imported')

Also, we want to know exactly how much was spent on domestic and how much on imported products. First, we need to engineer the overall spending column and afterwards the specific spending using NumPy again:

# constructing a column with the amount spent
df1['spending'] = np.multiply(df1['order_quantity'],
df1['unit_price'])


# if preference is domestic, then fill domestic_spending with spending / if not, make it zero
df1['domestic_spending'] = np.where(df1['region_preference']=='domestic',
df1['spending'], 0.0)


# same vice versa

df1['imported_spending'] = np.where(df1['region_preference']=='imported',
df1['spending'], 0.0)

Finally, we cut out only the columns we need and check our final dataframe for hypothesis testing:

# drop all columns we don't need and group by customer
df1_test = df1[['customer_id', 'domestic_spending', 'imported_spending']]
df1_test = df1_test.groupby('customer_id').sum()

df1_test.head()

The Hypothesis

Now it’s time to define the hypotheses – the null and the alternative one – and set the alpha level:

Ho: People do not spend more money on imported food, so the difference between average expenditures is zero (or less):

𝜇𝑑 = <0

H1: People spend more money on imported products:

𝜇𝑑 > 0

Significance level is set to the value of 5%:

𝛼 = 0.05

Since the direction of the difference is specified, it’s a one-tailed test.

The Analysis

To determine whether the mean difference between two unrelated groups is statistically significant, the independent t-test is typically used. Several assumptions must be met, such as

  • The independent variable is categorical with at least two levels (groups).

  • The dependent variable must be continuous – this is measured on an interval or ratio scale.

  • The dependent variable should not contain any significant outliers.

  • The variances between the two groups are equal.

As our data fails the last criterium, we can’t proceed with the independent t-test. Instead, we can conduct a viable alternative, the Welch’s t-test, also named “unequal variances t-test”.

All goodness-of-fit tests are left out here, but can be looked up in detail in my notebook.

We now come to the core of the analysis: calculating the t-statistic, the degrees of freedom (df), and finally the p-value.

The Welch’s t-test can be conducted using the stats.ttest_ind() method and passing “False” in the equal_var-argument:

# calculating t-value and two-tailed p-value 
t, p_two = stats.ttest_ind(df1_test['imported_spending'],
df1_test['domestic_spending'],
equal_var=False)
t
Output:
4.519665767821693

To get the degrees of freedom, we write a function that may useful for later applications, too:


def welch_df(a, b):
""" Calculates the effective degrees of freedom
for two samples. """
s1 = a.var(ddof=1)
s2 = b.var(ddof=1)
n1 = a.size
n2 = b.size
num = (s1/n1 + s2/n2)**2
denom = (s1/n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)
df = num/denom
return round(df,0)
df = welch_df(df1_test['imported_spending'],
df1_test['domestic_spending'])
df
Output:
124.0

Lastly I calculate the p-value – or the area under the probability density function of rejecting the null hypothesis. Given our directional alternative hypothesis:

𝜇𝑑 > 0

this area of rejecting is on the right because of  ” > 0 “.

Or in other words: I’m trying to get the probability that my test statistic exists GIVEN the null hypothesis is true. If this probabilty is very small, I will reject the null hypothesis and instead work with the alternative hypothesis.

# using the cumulative density function (CDF) with t-value und df
p = 1 - stats.t.cdf(t, df)
print(p)
Output:
7.129045843212722e-06

The findings are statistically significant! We can reject the null hypothesis in support of the alternative.

While it’s good to know if there is a statistically significant effect of some intervention on the outcome, it’s as important to know the size of the effect the intervention has on the outcome. Statistical significance does not always translate into a large effect in the real world. This is important to consider because everything costs money and time. To see this, we can calculate the effect size.

To calculate Cohen’s d as a measure of effect size, this function will take in the two samples and, by calculating the mean and standard deviation, will eventually return Cohen’s d:

from statistics import mean, stdev
from math import sqrt

def cohens_d(x, y):
""" Function takes in two samples as arguments and
calculates Cohens's as a measure of effect size. """

cohens_d = (mean(x) - mean(y)) / (sqrt((stdev(x)**2 +
stdev(y)**2) / 2))

print(cohens_d)

Now let’s put this function into action:

cohens_d(df1_test['imported_spending'], df1_test['domestic_spending'])
Output:
0.6932850810422416

Not bad! A value of > 0.5 is considered a medium effect, while a value of > 0.8 is considered a strong effect. With 0.7 we’re close to a strong effect.

Conclusions

A one-tailed Welch’s t-test with 𝛼 = 0.05 was used to analyze customers’ spending on domestic and imported products with the hypothesis that people spend more money on imported products:

  • The average expenditures on imported products are much higher (12134 ± 16084 units) compared to those on domestic products (3296 ± 8142 units). To get these numbers, df1_test.describe()  is used.

  • There is a statistically highly significant difference between domestic and imported spending with t(124) = 4.5197 and p = 0.000007129.

  • Therefore I may reject the null hypothesis in support of the alternative: People spend more money on imported products.

  • Moreover, the calculated effect size is almost strong, pointing to a medium to high practical significance.

The results could be used to think about a strategy focusing on imported products for different regions to maximize revenues.

. . . . . . . . . . . . . .

Thank you for your reading!

If you want to read more about statistical testing, check out the other posts. The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and am always happy to get critical and friendly feedback, or suggestions for improvement!