Predicting How Someone Will Vote on Basic Income

A practical use case for applying different classification algorithms

In machine learning, classification describes the prediction of data into baskets (or: classes) by using a classifier built by the classification algorithm. There are many available for use, each with pros and cons. To predict how someone might vote on basic income, we will use four such algorithms here: Logistic Regression, Random Forest, XGBoost and Support Vector Machine (SVM).

A poll on basic income across Europe was conducted by Dalia Research in April 2016 across 28 EU member states. The dataset contains 9649 records and 15 columns and includes demographics such as age, gender, education etc. It also includes opinions on the effects of a basic income on someone’s work choices, their familiarity with the idea of basic income, convincing arguments for and against basic income – and of course, whether they ultimately approve or reject the idea. You can check it out on kaggle: https://www.kaggle.com/daliaresearch/basic-income-survey-european-dataset/home

Our target variable is to see whether people vote “yes” or “no” on basic income, while all the remaining columns or features may – or may not – become our predictor variables. Following the OSEMiN approach (Obtaining, Scrubbing, Exploring, Modeling and Interpreting) as a blueprint for working on data problems, let’s start with the first task: cleaning and preparing our data.

Preprocessing

The usual way of importing a dataset, exploring its shape and datetypes, renaming columns, dropping duplicates etc. can be looked up in my notebook. I want to focus on two basic feature engineering steps in this preprocessing workflow.

1. Recoding the votes

I decided to condense the target variable, because someone who claims they will “probably vote yes” is exactly the same as someone who claims they will vote “yes” – they are both in favor. Both groups were therefore consolidated into one yes-group. The same held for the “no”– and “probably no”-voters. However, all who claimed to not want to vote were dropped, because I wanted my models to predict a clear “yes” or a “no” – those who were undetermined were of no interest to me.

# recode voting
def vote_coding(row):
    if row == 'I would vote for it' : return('for')
    elif row == 'I would probably vote for it': return('for')
    elif row == 'I would vote against it': return('against')
    elif row == 'I would probably vote against it': return('against')
    elif row == 'I would not vote': return('no_action')

# apply function
df['vote'] = df['vote'].apply(vote_coding)

# drop all records who are not "for" or "against"
df = df.query("vote != 'no_action'")

2. Pulling out the most frequent arguments

One of the questions respondents were asked was: Which of the following arguments FOR a basic income do you find convincing? Choose all that apply. Seven options were given in a randomized order, where people could choose as many as they found appropriate. The same options were given for the reverse question: Which of the following arguments AGAINST the basic income do you find convincing? Choose all that apply. For each question, I extracted the two most frequent chosen arguments. Here is the code for the FOR-arguments:

arg_for = ['It reduces anxiety about financing basic needs',
           'It creates more equality of opportunity',
           'It encourages financial independence and self-responsibility',
           'It increases solidarity, because it is funded by everyone',
           'It reduces bureaucracy and administrative expenses',
           'It increases appreciation for household work and volunteering',
           'None of the above']

# count all arguments
counter = [0,0,0,0,0,0,0]

for row in df.iterrows():
    for i in range(0, len(arg_for)):
        if arg_for[i] in row[1]['arg_for'].split('|'):
            counter[i] = counter[i] + 1

# create a new dictionary 
dict_keys = ['less anxiety', 'more equality', 'financial independance', 
             'more solidarity', 'less bureaucracy', 'appreciates volunteering', 'none']

arg_dict = {}

for i in range(0, len(arg_for)):
    arg_dict[dict_keys[i]] = counter[i]

# sub-df for counted arguments
sub_df = pd.DataFrame(list(arg_dict.items()), columns=['Arguments PRO basic income', 'count'])

# plotting
sub_df.sort_values(by=['count'], ascending=True).plot(kind='barh', x='Arguments PRO basic income', y='count',  
                                                      figsize=(10,6), legend=False,
                                                      title='Arguments PRO basic income')
plt.xlabel('Count');
Frequency of arguments PRO basic income

Now I knew which arguments occurred most often (apart from “none”). I used this knowledge to create 2 new columns with boolean values, with TRUE if a participant used this argument and FALSE if not:

df['anxiety'] = df['arg_for'].str.contains('anxiety')
df['equality'] = df['arg_for'].str.contains('equality')

The same was done for the CONTRA arguments. At the end, I dropped the original arguments’ columns. If you are interested in learning about the full preprocessing workflow, feel free to read my jupyter notebook.

Data Visualization

Now it’s time to get familiar with our data by visualizing it. The most obvious question that comes to mind is: How did they vote?

df['vote'].value_counts(normalize=True).plot(kind='barh', figsize=(8,4), 
                                             color=['maroon','midnightblue']);
Votes for and against basic income in %

By glancing at the number of records we have for each voting class, we see that we have far more records voting “for” a basic income – 70% – than “against” it. Let’s keep this in mind, because it could potentially skew our model’s ability to predict.

Although I’ve never used a mosaic plot before, I decided to finally try using one to picture the differences in votes between women and men, people living in cities vs. rural areas, people with or without children, and people with or without a full-time job. The latter can be seen here:

from statsmodels.graphics.mosaicplot import mosaic

mosaic(df, ['full_time_job', 'vote'], gap=0.015, 
       title='Vote vs. Having a Full Time Job or not - Mosaic Chart');

Another highly interesting plot is a comparison of how four different countries vote, depending on education level. Let’s have a look:

# Plot votes in 4 countries - depending on education level

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16,12))

# create sub-df for Germany
sub_df_1 = df[df['country_code']=='DE'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_1.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax1, legend=False)
ax1.set_title('\nVotes in GERMANY depending on education level\n', fontsize=14, fontweight='bold')
ax1.set_xlabel("Education Level")
ax1.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)
ax1.set_ylabel("Percentage of Voters\n")

# create sub-df for France
sub_df_2 = df[df['country_code']=='FR'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_2.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax2, legend=False)
ax2.set_title('\nVotes in France depending on education level\n', fontsize=14, fontweight='bold')
ax2.set_xlabel("Education Level")
ax2.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)
ax2.set_ylabel("Percentage of Voters\n")

# create sub-df for Italy
sub_df_3 = df[df['country_code']=='IT'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_3.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax3, legend=False)
ax3.set_title('\nVotes in Italy depending on education level\n', fontsize=14, fontweight='bold')
ax3.set_xlabel("Education Level")
ax3.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)

# create sub-df for Slovakia
sub_df_4 = df[df['country_code']=='SK'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_4.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax4, legend=False)
ax4.set_title('\nVotes in Slovakia depending on education level\n', fontsize=14, fontweight='bold')
ax4.set_xlabel("Education Level")
ax4.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)

# create only one legend
handles, labels = ax2.get_legend_handles_labels()
fig.legend(handles, labels, bbox_to_anchor=(1.0, 0.95))
plt.tight_layout()
plt.show();

If you are interested in seeing more data visualizations, feel free to read my jupyter notebook.

Machine Learning

1. Recoding categorical features

Machine learning algorithms generally need all data – including categorical data – in numeric form. To satisfy these algorithms, categorical features are converted into separate binary features called dummy variables. Therefore, we have to find a way to represent these variables as numbers before handing them off to the model. Here we use a technique called one-hot encoding that creates a new column for each unique category in a categorical variable. So, each observation receives a 1 in the column for its corresponding category, and a 0 in all other new columns. To apply one-hot encoding, we use the pandas get_dummies function:

# define our features 
features = df.drop(["vote"], axis=1)

# define our target
target = df[["vote"]]

# create dummy variables
features = pd.get_dummies(features)

2. Training Logistic Regression

Now the modeling can begin! The first algorithm used is plain, old logistic regression. All we need to do is import some libraries, apply a train-test-split, and so the fun starts! While testing different models, I store all accuracy scores in a new dataframe to later have quick and easy access to all models and scores:

# import train_test_split function
from sklearn.model_selection import train_test_split

# import LogisticRegression
from sklearn.linear_model import LogisticRegression

# import metrics
from sklearn.metrics import classification_report, accuracy_score

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")
# split our data based on the PCA reduced feature set
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)
# instantiate the logistic regression
logreg = LogisticRegression()

# train
logreg.fit(X_train, y_train)

# predict
train_preds = logreg.predict(X_train)
test_preds = logreg.predict(X_test)

# evaluate
train_accuracy_logreg = accuracy_score(y_train, train_preds)
test_accuracy_logreg = accuracy_score(y_test, test_preds)
report_logreg = classification_report(y_test, test_preds)

print("Logistic Regression")
print("------------------------")
print(f"Training Accuracy: {(train_accuracy_logreg * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_logreg * 100):.4}%")

# store accuracy in a new dataframe
score_logreg = ['Logistic Regression', train_accuracy_logreg, test_accuracy_logreg]
models = pd.DataFrame([score_logreg])
Logistic Regression
------------------------
Training Accuracy: 77.38%
Test Accuracy:     77.94%

3. Training a Random Forest

The next classifier that has often proved to be a good choice is a Random Forest. A promising way to potentially improve the model’s performance is to find good parameters to set when creating the model. This can have a huge impact on the model’s fitness! Here we use a Combinatorial Grid Search to find the best combination of parameters for our Random Forest:

# import random forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# create a baseline
forest = RandomForestClassifier()

# create Grid              
param_grid = {'n_estimators': [80, 100, 120],
              'criterion': ['gini', 'entropy'],
              'max_features': [5, 7, 9],         
              'max_depth': [5, 8, 10], 
              'min_samples_split': [2, 3, 4]}

# instantiate the tuned random forest
forest_grid_search = GridSearchCV(forest, param_grid, cv=3, n_jobs=-1)

# train the tuned random forest
forest_grid_search.fit(X_train, y_train)

# print best estimator parameters found during the grid search
print(forest_grid_search.best_params_)
{'criterion': 'gini', 'max_depth': 10, 'max_features': 9, 'min_samples_split': 4, 'n_estimators': 120}
# instantiate the tuned random forest with the best found parameters
forest = RandomForestClassifier(n_estimators=120, criterion='gini', max_features=9, 
                                max_depth=10, min_samples_split=4, random_state=4)

# train the random forest
forest.fit(X_train, y_train)

# predict
train_preds = forest.predict(X_train)
test_preds = forest.predict(X_test)

# evaluate
train_accuracy_forest = accuracy_score(y_train, train_preds)
test_accuracy_forest = accuracy_score(y_test, test_preds)
report_forest = classification_report(y_test, test_preds)

print("Random Forest")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_forest * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_forest * 100):.4}%")

# append accuracy score to our dataframe
score_forest = ['Random Forest', train_accuracy_forest, test_accuracy_forest]
models = models.append([score_forest])
Random Forest
-------------------------
Training Accuracy: 80.9%
Test Accuracy:     76.76%

4. Training an XGBoost classifier

Gradient Boosting is one of the most powerful concepts in machine learning right now. The version that currently has the highest performance is XGBoost (eXtreme Gradient Boosting), which is a great choice for classification tasks. We’ll use Grid Search again to find the best combination of parameters:

# import xgboost classifier
import xgboost as xgb

# create a baseline
booster = xgb.XGBClassifier()

# create Grid
param_grid = {'n_estimators': [100],
              'learning_rate': [0.05, 0.1], 
              'max_depth': [3, 5, 10],
              'colsample_bytree': [0.7, 1],
              'gamma': [0.0, 0.1, 0.2]}

# instantiate the tuned random forest
booster_grid_search = GridSearchCV(booster, param_grid, scoring='accuracy', cv=3, n_jobs=-1)

# train the tuned random forest
booster_grid_search.fit(X_train, y_train)

# print best estimator parameters found during the grid search
print(booster_grid_search.best_params_)
{'colsample_bytree': 0.7, 'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
# instantiate tuned xgboost
booster = xgb.XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=100,
                            colsample_bytree=0.7, gamma=0.1, random_state=4)

# train
booster.fit(X_train, y_train)

# predict
train_preds = booster.predict(X_train)
test_preds = booster.predict(X_test)

# evaluate
train_accuracy_booster = accuracy_score(y_train, train_preds)
test_accuracy_booster = accuracy_score(y_test, test_preds)
report_booster = classification_report(y_test, test_preds)

print("XGBoost")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_booster * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_booster * 100):.4}%")

# append accuracy score to our dataframe
score_booster = ['XGBoost', train_accuracy_booster, test_accuracy_booster]
models = models.append([score_booster])
XGBoost
-------------------------
Training Accuracy: 81.6%
Test Accuracy:     77.99%

5. Training a Support Vector Machine (SVM)

Another fast and popular classification technique is a Support Vector Machine (also referred to as SVMs). Let’s give it a try:

from sklearn import svm

# instantiate Support Vector Classification
svm = svm.SVC(kernel='rbf', random_state=4)

# train
svm.fit(X_train, y_train)

# predict
train_preds = svm.predict(X_train)
test_preds = svm.predict(X_test)

# evaluate
train_accuracy_svm = accuracy_score(y_train, train_preds)
test_accuracy_svm = accuracy_score(y_test, test_preds)
report_svm = classification_report(y_test, test_preds)

print("Support Vector Machine")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_svm * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_svm * 100):.4}%")

# append accuracy score to our dataframe
score_svm = ['Support Vector Machine', train_accuracy_svm, test_accuracy_svm]
models = models.append([score_svm])
Support Vector Machine
-------------------------
Training Accuracy: 77.94%
Test Accuracy:     77.46%

6. Model Comparison

Now that we have run several models, let’s check the testing accuracy we built on the side, as well as the classification reports:

models
Raw dataframe with accuracy scores
# make the dataframe easier to read
models.columns = ['Classifier', 'Training Accuracy', "Testing Accuracy"]
models.set_index(['Classifier'], inplace=True)

# sort by testing accuracy
models.sort_values(['Testing Accuracy'], ascending=[False])
Sorted and labeled dataframe with accuracy scores
print('Classification Report XGBoost: \n', report_booster)
print('------------------------------------------------------')
print('Classification Report Logistic Regression: \n', report_logreg)
print('------------------------------------------------------')
print('Classification Report SVM: \n', report_svm)
print('------------------------------------------------------')
print('Classification Report Random Forest: \n', report_forest)

All our models have done similarly well, boasting an average F1 score between 72% and 76%. However, looking at our classification report, we can see that “for” votes are classified fairly well, but “against” votes are disproportionately misclassified.

Why might this be the case? Earlier, we realized that we have far more records for  “for” votes than for “against” votes, and we had a hunch that this might skew our model’s ability to predict. And that’s exactly what happened, which also tells us that most of our model’s accuracy is driven by its ability to classify “for” votes, which is less than ideal.

7. Balancing the data

To account for our imbalanced dataset, we can use an oversampling algorithm called SMOTE (Synthetic Minority Oversampling Technique). SMOTE uses the nearest neighbors of observations to create synthetic data. It’s important to know that we only oversample the training data – that way, none of the information in the validation data is used to create synthetic observations. Here we go:

# import the necessary library
from imblearn.over_sampling import SMOTE

# view previous class distribution
print(target['vote'].value_counts()) 

# resample data ONLY on training data
X_resampled, y_resampled = SMOTE().fit_sample(X_train, y_train) 

# view synthetic sample class distribution
print(pd.Series(y_resampled).value_counts()) 
for        6073
against    2398
Name: vote, dtype: int64

for        3419
against    3419
dtype: int64
# then perform ususal train-test-split
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, random_state=0)

After oversampling the minority class, we applied all of our models to the now balanced data. I have skipped this code here, but if you are interested, feel free to read it in the full jupyter notebook.

We’ve now balanced our dataset. Did it improve the models’ biases?

models2.columns = ['Classifier balanced', 'Training Accuracy', "Testing Accuracy"]
models2.set_index(['Classifier balanced'], inplace=True)
models2.sort_values(['Testing Accuracy'], ascending=[False])
Accurac scores for models run on balanced data

Great! The first thing we see is that the balancing improved the overall accuracy for both XGBoost and Random Forest. But did it also remove our nasty bias??

print('Classification Report XGBoost: \n', report_booster2)
print('------------------------------------------------------')
print('Classification Report Logistic Regression: \n', report_logreg2)
print('------------------------------------------------------')
print('Classification Report SVM: \n', report_svm2)
print('------------------------------------------------------')
print('Classification Report Random Forest: \n', report_forest2)

Success! Balancing our data has removed the bias towards the more prevalent class. The F1 scores are almost even now. That’s what we wanted.

8. Feature Importance

Our last step today is to check which features are most important in predicting someone’s vote. Let’s pick the Random Forest algorithm for this:

# plot the important features - based on Random Forest
feat_importances = pd.Series(forest.feature_importances_, index=features.columns)
feat_importances.nlargest(10).sort_values().plot(kind='barh', color='darkgrey', figsize=(10,5))
plt.xlabel('Relative Feature Importance with Random Forest');

The plot proves that our (fairly basic) engineered features are most important for our predictions! This shows that knowing someone’s attitude towards specific arguments will help a lot in predicting their final verdict, and therefore their vote.

. . . . . . . . . . . . . .

Thank you for reading!

The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and I am always happy to get
critical and friendly feedback, or suggestions for improvement!