Machine Learning by the (Cook) Book: Recipes for Preprocessing

In order to train a routine and not forget any preprocessing steps before running a machine learning algorithm in Python and Pandas, it’s helpful to have some recipes handy – so you can take the ones you need and skip the ones you don’t. Here, I’d like to suggest my secret formulas for handling numerical data, categorical data, and date and time.

First I’d like to thank Chris Albon, who is certainly not aware of his impact on my work, but who’s always a great help whenever I got stuck. His blog and his book “Machine Learning with Python Cookbook” act as my constant companions!

Let’s assume we loaded two dataframes about road traffic data and name it accidents and casualties.

Describing the Data

You may first want to get some descriptive statistics for any numerical columns:

# in wide format
# in long format

You may also want to check the distribution of a numerical feature visually and use a histogram:

# prepare the plot using seaborn
fig, ax = plt.subplots(figsize=(10,6))

# plot Age_of_Casualty column
casualties.Age_of_Casualty.hist(bins=20, ax=ax, color='lightsteelblue')
ax.set_title('\nCasualties by Age\n', fontsize=14, fontweight='bold')
ax.set(xlabel='Age', ylabel='Total Count of Casualties')

# remove all spines
sns.despine(top=True, right=True, left=True, bottom=True);

Handling Duplicates

If you find duplicates in your dataframe, you will want to drop them:

# check for duplicates
# drop duplicates
accidents = accidents.drop_duplicates()

Handling Missing Values

First we’ll want to find any missing values:

# check NaN's for each column
# check NaN's for complete dataframe

Next you may decide to drop certain columns with lots of NaN’s:

accidents = accidents.drop(columns=['Speed_limit', 'Urban_or_Rural_Area'])

Alternatively, you may decide to drop records with missing values:

# drop all records with NaN's
accidents = accidents.dropna()

# drop records with NaN's in certain columns
accidents = accidents.dropna(subset=['Road_Type', 'Number_of_Casualties'])

We could also replace missing values with specific values of our choice:

# fill with string values
accidents['Weather_Conditions'] = accidents['Weather_Conditions'].fillna(value='not known')

# fill with integer
accidents['Weather_Conditions'] = accidents['Weather_Conditions'].fillna(value=0)

Handling Numerical Data

  • Handling Outliers

To detect outliers, a boxplot is extremely useful:

num_cols = ['Age_of_Vehicle', 'Engine_Capacity_(CC)']

# plotting boxplots
fig, axes = plt.subplots(2,1, figsize=(10,4))

for ax, col in zip(axes, num_cols):
    df_model.boxplot(column=col, grid=False, vert=False, ax=ax)

To handle the detected outliers, you may choose one of several strategies. You can mark them and include them as a feature, or you can transform the feature to dampen the effect of the outlier. Finally, you can also drop them:

# phrasing conditions
cond_1 = (accidents['Age_of_Vehicle'] < 30) & (accidents['Age_of_Vehicle'] > 0)
cond_2 = (accidents['Engine_Capacity_(CC)'] < 20000) & (accidents['Engine_Capacity_(CC)'] > 0)

# combining all conditions
conditions = cond_1 & cond_2

# keep only records that meet our conditions and don't fall within extreme outliers
accidents = accidents[conditions]

Which strategy you choose depends on your goals for machine learning, as well as on your domain expertise.

  • Standardizing a Feature

If you use a distance-based algorithm and your numerical features range vary widely, the algorithm won’t work properly unless the features’ ranges are normalized. (Other models, such as tree based models, are not distance-based and can handle varying ranges of features, so scaling is not required.)

While you may have good reasons for using min-max scaling, we will use standardization here:

from sklearn import preprocessing

# create scaler
scaler = preprocessing.StandardScaler()

# standardize numerical columns
accidents[num_cols] = scaler.fit_transform(accidents[num_cols])

An important note: If you have outliers (and did not drop them), standardization might not be appropriate because they might highly influence the mean and variance.

  • Imputing Missing Values

If you are up against missing values in your numerical features, you may want to fill them in with their mean, median, or most frequent value:

# load library
from sklearn.preprocessing import Imputer

# instantiate imputer
imputer = Imputer(strategy='mean', axis=0)

# impute values
num_cols = ['Age_of_Vehicle', 'Engine_Capacity_(CC)']
num_features_imputed = imputer.fit_transform(accidents[num_cols])

Alternatively, we can predict missing values using k-nearest neighbors (KNN). 

… coming soon …

  • Binning Numerical Features

Sometimes you have a numerical feature and want to break it up into discrete bins. Binning (or: bucketing) attempts to convert a numerical feature into a categorical one. A popular example is binning ages into ranges of age.

# bin Age_of_Vehicle feature
df_model['Age_of_Vehicle'] = np.digitize(df_model['Age_of_Vehicle'], bins=[-1,0,2,5,10])

Note that by default the arguments for the bins parameter denote the left edge of each bin (right=False). You can switch this behavior by setting right=True.

Handling Categorical Data

  • Encoding Categorical Features

As most machine learning algorithms require numerical input, categorical data needs to be transformed into numbers.

A convenient one-hot encoder is provided by Pandas:

cat_cols = ['Daytime', 'Age_Band_of_Driver', 'Urban_or_Rural_Area']

# use Pandas get_dummies() for one-hot encoding
dummies = pd.get_dummies(accidents[cat_cols], drop_first=True)

It’s recommended that after one-hot encoding a feature, we drop one of the features to avoid linear dependence. Using the drop_first argument above did the trick.

  • Handling Imbalanced Classes

If we have a target with highly imbalanced classes and we can’t apply the best strategy, which is to simply collect more data (especially from the minority class), we need to find other ways to deal with the imbalanced classes:

  • Another strategy is to use a model evaluation metric better suited to imbalanced classes: confusion matrices, precision, recall, F1 scores, or ROC curves instead of accuracy.
  • A third strategy is to use the class weighing parameter included in implementations of some models. This allows us to have the algorithm adjust for imbalanced classes.
  • The fourth and fifth strategies are related: downsampling the majority class and upsampling the minority class. These resampling strategies are well summarized in this blog post by Chris Remmel.

Let’s first demonstrate using the class weighing parameter:

Random Forest is a popular classification algorithm and includes a class_weight parameter. You can pass the argument balanced in order to the algorithm automatically create weights inversely proportional to class frequencies, or you can pass an argument specifying the desired class weights explicitly.

# train RandomForest with balanced class weights

# train RandomForest with explicitly specified weights
weights = {1:.33, 2:.33, 3:.33}

Alternatively, we can use resampling strategies:

One of these is the Synthetic Minority Oversampling Technique (SMOTE), where we upsample the rarer class by drawing additional records with replacement (bootstrapping). The SMOTE algorithm finds a record that is similar to the record being upsampled and creates a synthetic record using K-Nearest Neighbors.

# view previous class distribution

# resample data ONLY using training data
X_resampled, y_resampled = SMOTE().fit_sample(X_train, y_train) 

# view synthetic sample class distribution

# then perform ususal train-test-split
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, random_state=0)

# and train your algorithm again
  • Imputing Missing Class Values

If you have a categorical feature containing missing values, you can fill them in with the feature’s most frequent value:

# load library
from sklearn.preprocessing import Imputer

# instantiate imputer
imputer = Imputer(strategy='most_frequent', axis=0)

# impute values
cat_features_imputed = imputer.fit_transform(accidents[cat_cols])

An even better option would be to predict missing values using a k-nearest neighbors (KNN) classifier.

… coming soon …

Handling Dates and Times

Dates and times are frequently encountered during preprocessing for machine learning.

  • Converting Strings to Dates

When dates and times come as strings, we need to convert them into a data type that Python can understand. As the format of the strings can vary, we need to specify the format parameter in Pandas:

accidents['Date']= pd.to_datetime(accidents['Date'], format="%d/%m/%Y")
  • Selecting Date and Time

We can set the date column as the index and then slice the dataframe using loc:

accidents = accidents.set_index('Date')
accidents.loc['2015-01-01' : '2016-01-01']
  • Breaking up Date Data into Components

If we have a column of dates and want to access the year, month, weekday etc., we can use Pandas datetime properties:

accidents['year'] = accidents['Date'].dt.year
accidents['month'] = accidents['Date'].dt.month
accidents['weekday'] = accidents['Date'].dt.weekday
  • Encoding Days of the Week

If you want to know the day of the week for each date in your date column, it might be helpful to group them and compare, let’s say, the average number of accidents on each day of the week.

weekday = accidents['Date'].dt.weekday_name
  • Using a Rolling Time Window

Rolling means are often used to smooth out time series because using the mean of a certain window dampens the effect of short-term fluctuations. (And it’s also impressive in a visual!)

The window parameter specifies the time window we want to use, followed by the statistics we want:

accident['rolling_mean'] = accident['Accidents'].rolling(window=10).mean()

. . . . . . . . . . . . . .

I hope that are some helpful recipes to cook your machine learning analysis! If you think I’ve missed something or have made an error somewhere, please let me know: