# Hypothesis Testing #1: Welch’s t-test

## A practical example for conducting a statistical analysis in Python

My second project at Flatiron School required gathering information from a real-world database and using my knowledge of statistical analysis and hypothesis testing to generate analytical insights. We were working with the famous Northwind database – a free, open-source dataset created by Microsoft containing data from a fictional company and widely used for learning SQL. In this first part, let’s practice a Welch’s t-test. The following blog series covers an independent t-test, a one-way ANOVA, a two-way ANOVA and a Chi-Squared Test for Independency.

The complete Jupyter Notebook can be found here.

The question to answer and analyze is:

Do customers spend significantly more money on imported products than on domestic ones? So might there be a chance to sell more by focusing on imported products?

Here’s what’s covered in this section:

1. Obtaining the data
2. Data Cleaning & Feature Engineering
3. The Hypothesis
4. The Analysis
5. Conclusions

### Obtaining the data

After importing all the libraries we are expecting to work with such as NumPy, Pandas or Matplotlib, we can then use SQLAlchemy to connect to the database and inspect it. We do this by making use the of the inspect module provided by SQLAlchemy. We can create an inspector by using the inspectfunction and passing in our engine, and then making use of various methods from our  inspector object to get the names of tables or info on the tables themselves:

# import necessary librariesfrom sqlalchemy import create_enginefrom sqlalchemy import inspect# connecting to the Databaseengine = create_engine("sqlite:///Northwind_small.sqlite")# create an inspector objectinspector = inspect(engine)# get names of tables in databaseprint(inspector.get_table_names())
Output:['Category', 'Customer', 'CustomerCustomerDemo', 'CustomerDemographic', 'Employee', 'EmployeeTerritory', 'Order', 'OrderDetail', 'Product', 'Region', 'Shipper', 'Supplier', 'Territory']

To quickly print the columns of a table and each column’s data type, let’s write a function. It might also be of use later on:

def get_columns_info(table_name):        """ Function takes in table name and prints columns' names    and datatypes """        cols_list = inspector.get_columns(table_name)        print(f"Table Name: {table_name}\n")    for col in cols_list:        print(f"Name: {col['name']} \t Type: {col['type']}")

Does it work?

# calling the function to test itget_columns_info('OrderDetail')
Output:Table Name: OrderDetail

Name: Id 	 Type: VARCHAR(8000)
Name: OrderId 	 Type: INTEGER
Name: ProductId 	 Type: INTEGER
Name: UnitPrice 	 Type: DECIMAL
Name: Quantity 	 Type: INTEGER
Name: Discount 	 Type: FLOAT


It does. Great! Then let’s work on our first question.

### Data Cleaning & Feature Engineering

First we need to consider which tables to join in order to have all the data ready for our analysis. It might be suitable to take the

• “Supplier”
• “Product”
• “OrderDetail”
• “Order” and
• “Customer”

tables and to also give them meaningful column names when joining.

The results returned from raw SQL queries are often pretty hard to read since they don’t contain any structure. However, we can fix this problem by creating a pandas DataFrame.

The most convenient way to incorporate pandas is to make use of the pd.read_sql_query() method. When using this method, we pass in two parameters:

1. The SQL query we want to execute, in the form of a string.
2. The engine object we created to connect to our database with SQLAlchemy

This has the benefit of grabbing the appropriate metadata from the table to label the dataframe columns correctly. It also saves us a few extra lines of code.

df1 = pd.read_sql_query(''' SELECT p.Id product_id, p.ProductName product_name, s.Id supplier_id, s.Country supplier_country, s.Region supplier_region, od.Quantity order_quantity, od.UnitPrice unit_price, c.Id customer_id, c.Country customer_country, c.Region customer_regionFROM Supplier sJOIN Product p ON s.Id = p.SupplierIdJOIN OrderDetail od ON p.Id = od.ProductIdJOIN [Order] o ON od.OrderId = o.IdJOIN Customer c ON o.CustomerId = c.Id;''', engine)

Let’s check the newly created dataframe:

df1.head()

Looks good!

Now we’re engineering the preference feature by comparing the products’ and the customers’ respective regions and using np.where(). If both regions match, we label it a “domestic” product, and an “imported” product if not.

df1['region_preference'] = np.where(df1['supplier_region'] == df1['customer_region'], 'domestic', 'imported')

Also, we want to know exactly how much was spent on domestic and how much on imported products. First, we need to engineer the overall spending column and afterwards the specific spending using NumPy again:

# constructing a column with the amount spentdf1['spending'] = np.multiply(df1['order_quantity'],                               df1['unit_price'])# if preference is domestic, then fill domestic_spending with spending / if not, make it zerodf1['domestic_spending'] = np.where(df1['region_preference']=='domestic', df1['spending'], 0.0)# same vice versadf1['imported_spending'] = np.where(df1['region_preference']=='imported', df1['spending'], 0.0)

Finally, we cut out only the columns we need and check our final dataframe for hypothesis testing:

# drop all columns we don't need and group by customerdf1_test = df1[['customer_id', 'domestic_spending', 'imported_spending']]df1_test = df1_test.groupby('customer_id').sum()df1_test.head()

### The Hypothesis

Now it’s time to define the hypotheses – the null and the alternative one – and set the alpha level:

Ho: People do not spend more money on imported food, so the difference between average expenditures is zero (or less):

𝜇𝑑 = <0

H1: People spend more money on imported products:

𝜇𝑑 > 0

Significance level is set to the value of 5%:

𝛼 = 0.05

Since the direction of the difference is specified, it’s a one-tailed test.

### The Analysis

To determine whether the mean difference between two unrelated groups is statistically significant, the independent t-test is typically used. Several assumptions must be met, such as

• The independent variable is categorical with at least two levels (groups).

• The dependent variable must be continuous – this is measured on an interval or ratio scale.

• The dependent variable should not contain any significant outliers.

• The variances between the two groups are equal.

As our data fails the last criterium, we can’t proceed with the independent t-test. Instead, we can conduct a viable alternative, the Welch’s t-test, also named “unequal variances t-test”.

All goodness-of-fit tests are left out here, but can be looked up in detail in my notebook.

We now come to the core of the analysis: calculating the t-statistic, the degrees of freedom (df), and finally the p-value.

The Welch’s t-test can be conducted using the stats.ttest_ind() method and passing “False” in the equal_var-argument:

# calculating t-value and two-tailed p-value t, p_two = stats.ttest_ind(df1_test['imported_spending'],                            df1_test['domestic_spending'],                            equal_var=False)t
Output:4.519665767821693

To get the degrees of freedom, we write a function that may useful for later applications, too:

def welch_df(a, b):    """ Calculates the effective degrees of freedom         for two samples. """    s1 = a.var(ddof=1)     s2 = b.var(ddof=1)    n1 = a.size    n2 = b.size     num = (s1/n1 + s2/n2)**2    denom = (s1/n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)    df = num/denom    return round(df,0)
df = welch_df(df1_test['imported_spending'],              df1_test['domestic_spending'])df
Output:124.0

Lastly I calculate the p-value – or the area under the probability density function of rejecting the null hypothesis. Given our directional alternative hypothesis:

𝜇𝑑 > 0

this area of rejecting is on the right because of  ” > 0 “.

Or in other words: I’m trying to get the probability that my test statistic exists GIVEN the null hypothesis is true. If this probabilty is very small, I will reject the null hypothesis and instead work with the alternative hypothesis.

# using the cumulative density function (CDF) with t-value und dfp = 1 - stats.t.cdf(t, df)print(p)
Output:7.129045843212722e-06

The findings are statistically significant! We can reject the null hypothesis in support of the alternative.

While it’s good to know if there is a statistically significant effect of some intervention on the outcome, it’s as important to know the size of the effect the intervention has on the outcome. Statistical significance does not always translate into a large effect in the real world. This is important to consider because everything costs money and time. To see this, we can calculate the effect size.

To calculate Cohen’s d as a measure of effect size, this function will take in the two samples and, by calculating the mean and standard deviation, will eventually return Cohen’s d:

from statistics import mean, stdevfrom math import sqrtdef cohens_d(x, y):    """ Function takes in two samples as arguments and     calculates Cohens's as a measure of effect size. """    cohens_d = (mean(x) - mean(y)) / (sqrt((stdev(x)**2 +                                      stdev(y)**2) / 2))    print(cohens_d)

Now let’s put this function into action:

cohens_d(df1_test['imported_spending'], df1_test['domestic_spending'])
Output:0.6932850810422416

Not bad! A value of > 0.5 is considered a medium effect, while a value of > 0.8 is considered a strong effect. With 0.7 we’re close to a strong effect.

### Conclusions

A one-tailed Welch’s t-test with 𝛼 = 0.05 was used to analyze customers’ spending on domestic and imported products with the hypothesis that people spend more money on imported products:

• The average expenditures on imported products are much higher (12134 ± 16084 units) compared to those on domestic products (3296 ± 8142 units). To get these numbers, df1_test.describe()  is used.

• There is a statistically highly significant difference between domestic and imported spending with t(124) = 4.5197 and p = 0.000007129.

• Therefore I may reject the null hypothesis in support of the alternative: People spend more money on imported products.

• Moreover, the calculated effect size is almost strong, pointing to a medium to high practical significance.

The results could be used to think about a strategy focusing on imported products for different regions to maximize revenues.

. . . . . . . . . . . . . .