## In machine learning, we use training algorithms to learn the parameters of a model by minimizing a loss function. However, many algorithms also have hyperparameters that must be defined outside of a learning process. For example, the number of decision trees in random forests must be set prior to the fitting. Additionally, it is not uncommon that we might want to try multiple learning algorithms.

Let’s start with defining the terms used in this blog post to make sure we’re on the same page:

**Terminology**

*Learning algorithm*

An algorithm used to learn the best parameters of a model, for example linear regression, naive Bayes, or decision trees.

*Hyperparameters*

The settings of a learning algorithm that need to be set before training.

*Model*

An output of a learning algorithm’s training. Learning algorithms train models, which we then use to make predictions.

*Parameters*

The weights or coefficients of a model that has been learned through training.

*Performance*

A metric used to evaluate a model.

*Loss*

A metric to maximize or minimize through training.

**Selecting Learning Algorithms**

Choosing a learning algorithm can be a difficult task. Assuming you know your problem and your data, you have likely already decided on a starting point: supervised or unsupervised techniques, regression or classification etc. However, you are almost certainly still faced with having to choose from a range of possible algorithms.

If you have the time, you can try them all. However, that’s not usually the case. You can approach the matter by asking yourself several questions to narrow down which learning algorithm you might want to use:

*Explainability*

Neural networks or ensemble models are very accurate, but black boxes and therefore hard to understand for a non-technical audience – and even harder to explain. On the other hand, linear regression or decision trees produce models that are not *that* accuate, but the way they predict is very straightforward and easy to communicate.

*In-memory vs. Out-memory*

Can your dataset be fully loaded into the RAM of your server or laptop? If yes, you can choose from a wide variety of candidate algorithms. If no, you might prefer incremental learning algorithms that can improve the model by adding data gradually.

*Number of Records and Features*

How many training examples do you have? And how many features does each example have? Some algorithms, such as neural networks or gradient bossing, can handle a huge number of records and features. Others are quite modest in their capacity.

*Nonlinearity of the Data*

Can your data be modeled with a linear model, or is it linearly separable? If so, logistic regression, linear regression, or SVMs with a linear kernel may be good choices. Otherwise, (deep) neural networks or ensemble algorithms might fit better.

*Training Speed*

How much time is the learning algorithm allowed for training a model? Neural networks are known to train slowly, while simpler algorithms, like linear and logistic regression or decision trees, train much faster.

*Prediction Speed*

How fast does the model need to generate predictions once it is in production? Linear or logistic regression, or SVMs are very fast in predicting – kNN, ensemble algorithms, or recurrent neural networks are slower.

*Validation Set*

A popular way to choose an algorithm is by testing it on the validation set. If a dataset is of sufficient size, we work with three distinct sets of records:

- the
*training*set - the
*validation*set - the
*test*set

The *validation set* is used to choose the learning *algorithm* and find the best values for its hyperparameters.

On the other hand, the *test set* is used to evaluate the *model* before putting it in production.

**Selecting Best Models**

*Regularization*

If a model is too complex for the data or if we have too many features and only a small number of training examples, the model is likely to show signs of overfitting. To penalize complex models and have them reduce their variance, regularization is a widely applied approach.

Specifically, a penalty term is added to the loss function we are trying to minimize, typically the *L1* (or: Lasso) and *L2* (or: Ridge) penalties.

To combat overfitting neural networks, a similar strategy to L2 is *weight* *regularization*. Others are *Dropout*, *Batch Normalization*, or *EarlyStopping*.

**Selecting Best Models using GridSearch**

If you want to select the best model by searching through a range of hyperparameters, you can use scikit-learn’s *GridSearchCV*:

```
# load libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# create LogisticRegression
logistic = LogisticRegression()
# create range of penalty hyperparameter values
penalty = ['l1', 'l2']
# create range of max. number of iterations taken to converge
max_iter = [100, 110, 120, 130, 140]
# create dictionary with hyperparameter options
hyperparameters = dict(penalty=penalty, max_iter=max_iter)
# create grid search
grid_search = GridSearchCV(logistic, hyperparameters, cv=5)
# fit grid search
best_model = grid_search.fit(X_train, y_train)
# view best hyperparameters
print(best_model.best_params_)
```

**Selecting Best Models using Randomized Search**

A computationally cheaper and more efficient method than an exhaustive GridSearch is a randomized search technique called *RandomizedSearchCV* which searches through a specific number of random combinations of hyperparameter values.

If we specify a distribution, scitkit-learn will randomly sample (without replacement) hyperparameter values from that distribution. Alternatively, if we specify a list of values, *RandomizedSearchCV* will randomly sample (with replacement) from the list.

```
# load libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
# create RandomForestRegressor
forest = RandomForestRegressor()
# create range of candidate number of trees in random forest
n_estimators = [100, 150]
# create range of candidate max. numbers of levels in tree
max_depth = [3, 4, 5]
# create dictionary with hyperparameter options
hyperparameters = dict(n_estimators=n_estimators, max_depth=max_depth)
# create randomized search
randomized_search = RandomizedSearchCV(forest, hyperparameters)
# fit randomized search
best_model = randomized_search.fit(X_train, y_train)
# view best hyperparameters
print(best_model.best_params_)
```

**Selecting Best Models from Multiple Algorithms (Pipeline)**

If we are not certain which learning algorithm to use, we are also allowed to define a search space that includes multiple learning algorithms and their respective hyperparameters:

```
# load libraries
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# set random seed
np.random.seed(0)
# create a pipeline
pipe = Pipeline([("classifier", RandomForestClassifier())])
# create search space with candidate learning algorithms and their hyperparameters
search_space = [{"classifier": [LogisticRegression()],
"classifier__penalty": ['l1', 'l2'],
"classifier__C": np.logspace(start=0,stop=4,num=10)},
{"classifier": [RandomForestClassifier()],
"classifier__n_estimators": [130, 150, 170],
"classifier__max_features": [4, 5, 6]}]
# create grid search
gridsearch = GridSearchCV(pipe, search_space, cv=5)
# fit grid search
best_model = gridsearch.fit(X_train, y_train)
# view best model's learning algorithm and hyperparameters
best_model.best_estimator_.get_params()["classifier"]
```

**Selecting Best Models when Preprocessing (Pipeline)**

Some preprocessing methods have their own parameters, which often have to be supplied by the user. For example, dimensionality reduction using PCA requires you to define the number of principal components to use in order to produce the transformed feature set.

```
# load libraries
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# set random seed
np.random.seed(0)
# create preprocessing object that includes StandardScaler and PCA
preprocess = FeatureUnion([("scaler", StandardScaler()),
("pca", PCA())])
# create a pipeline
pipe = Pipeline([("preprocess", preprocess),
("classifier", LogisticRegression())])
# create search space of candidate values
search_space = [{"preprocess__pca__n_components": [1, 2, 3],
"classifier__penalty": ['l1', 'l2'],
"classifier__C": np.logspace(start=0,stop=4,num=10)}]
# create grid search
gridsearch = GridSearchCV(pipe, search_space, cv=5)
# fit grid search
best_model = gridsearch.fit(X_train, y_train)
# view model's best number of principal components
best_model.best_estimator_.get_params()["preprocess__pca__n_components"]
```

**Speeding up Model Selection with Parallelization**

In the real world, we will often have many thousands of models to train, which can take us many hours. To speed up the process, scikit-learn lets us train multiple models simultaneously: as many models as we have cores on our laptop. Most modern laptops have four cores, so we have the potential to train four models at the same time, which will dramatically increase the speed of our model selection process.

The parameter *n_jobs* defines the number of models to train in parallel. If we set *n_jobs* to *-1*, we tell scikit-learn to use *all* of the cores. By default, *n_jobs* is set to *1*, meaning it only uses *one* core.

```
# create randomized search using ONE core (default value!)
randomized_search = RandomizedSearchCV(forest, hyperparameters, cv=5, n_jobs=1)
# create randomized search using ALL cores
randomized_search = RandomizedSearchCV(forest, hyperparameters, cv=5, n_jobs=-1)
```

. . . . . . . . . . . . . .

I hope these recipes are useful in selecting your machine learning model! If you think I’ve missed something or have made an error somewhere, please let me know: bb@brittabettendorf.com