Detecting and Categorizing Toxic Comments in Online Conversations

A deep learning approach for text classification

Neural networks have proved to be very good at text classification and have achieved state-of-the-art results, due to their potential to reach high accuracy with less need for pre-processed features. Using a competition dataset from, I’d like to present five such deep learning algorithms for classifying comments that are rude, disrespectful, or otherwise likely to make someone leave an online discussion.

In 2018, Google set up a kaggle competition to challenge participants to build better performing models that are capable of detecting different types of toxicity in online comments such as threats, obscenity, insults, or identity-based hate.

I used neural networks to approach this task and would like to take you on this journey with me! The corresponding notebook with all the code is here.

1. Obtaining and Viewing the Data

The available data consisted of a training dataset containing comments with their binary labels, a test dataset with only comments, and a test_label dataset with the appropriate labels. We therefore had to concat both test related datasets and merge them with an inner join, as a substantial amount of rows needed to be dropped due to the fact that these comments were not rated and thus could not be used for evaluating our models. If you are interested in the code, feel free to read my jupyter notebook.

The training data consists of an id column, the comment itself, and six different labels qualifying this comment as toxic, severely toxic, obscene, a threat, an insult, or identity-based hate. Note that the comments can be attached to one, none, or several of these categories!

The training data warrants a closer look:

# check the number of records
print('The dataset contains', train.shape[0], 'records and', train.shape[1], 'columns.')

[Output:] The dataset contains 159571 records and 8 columns.
# check that there are no missing values in either training set
print('The dataset has', train.isna().sum().sum(), 'missing values.')

[Output:] The dataset has 0 missing values.
# check if there are any duplicates
print('The dataset has', train.duplicated().sum(), 'duplicates.')

[Output:] The dataset has 0 duplicates.

Let’s explore some comments:


[Output:] "You, sir, are my hero. Any chance you remember what page that's on?"

[Output:] "Before you start throwing accusations and warnings at me, lets review the edit itself-making ad hominem attacks isn't going to strengthen your argument, it will merely make it look like you are abusing your power as an admin. \nNow, the edit itself is relevant-this is probably the single most talked about event int he news as of late. His absence is notable, since he is the only living ex-president who did not attend. That's certainly more notable than his dedicating an aircracft carrier. \nI intend to revert this edit, in hopes of attracting the attention of an admin that is willing to look at the issue itself, and not throw accusations around quite so liberally. Perhaps, if you achieve a level of civility where you can do this, we can have a rational discussion on the topic and resolve the matter peacefully."

2. Data Visualizations

To familiarize ourselves with the data, it’s always practical to perform some data visualizations. To not bore my dear readers, I will only show the images – if you would like to see the corresponding code, please check out my notebook on GitHub.

One question that immediately pops up is: What type of comments occur most frequently? Let’s have a look!

Bar Chart that counted all labelled comments

The three types of comments that occur most often are: toxic comments, obscene comments, and insulting comments.

As mentioned earlier, comments can be labelled with none, one, or several categories. So let’s quickly check the number of frequent comment combinations using a simple groupby-statement:

Grouping by columns, counting the number of comments and sorting descending

Comments which have no label at all and seem to be completely fine are clearly ahead with 143,346 comments. Toxic comments then occur in different combinations within the first 15 ranks. The same holds for obscene comments, as well as for insulting comments. Interestingly, the number of comments for each combination drops exponentially.

This leads to looking at a correlation matrix:

Heatmap of classified comments

The heatmap – or correlation matrix – illuminates interesting relationships: Toxic comments are clearly correlated with both obscene and insulting comments. Interestingly, toxic and severely toxic comments are only weakly correlated. Obscene comments and insulting comments are also highly correlated, which makes perfect sense.

Two Venn diagrams emphasize these findings:

But what does it mean for a comment to be obscene, threatening, or based on identity hate? To shed light upon these comments, WordClouds are a great visualization tool! They work in a simple way: the more a specific word appears in a text, the bigger and bolder it appears in the WordCloud.

(Please, please accept my apologies in case you feel offended by the following text! These words and language used do not reflect my own views or opinions, and are used here for educational purposes only.)

WordCloud for comments classified as threatening
WordCloud for comments classified as identity hate

3. Preprocessing the Data

Now, we have to preprocess our text data. But how do we represent words to a deep learning algorithm?

Well, we can do this with Word Embeddings. Word embedding models map words to vectors that use similarity metrics to reveal meaningful semantic relationships between certain words. Unlike the simpler bag-of-word model representations, word embeddings provide a dense representation of words and their relative meanings. There are two different ways to feed word embeddings into a neural network: train your own embedding layer or use a pre-trained embedding (like GloVe).

To convert our text to vectors, we use Keras’s convenient preprocessing tools to tokenize each example, convert it to a sequence, and then pad the sequences so they’re all the same length.

# importing libraries
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing import text, sequence

# define X
X_train = train["comment_text"].values
X_test  = test["comment_text"].values

# define y
y_train = train[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values
y_test  = test[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values

# tokenizing the data
tokenizer = Tokenizer(num_words=20000)

# turning the tokenized text into sequences
X_train = tokenizer.texts_to_sequences(X_train)
X_test  = tokenizer.texts_to_sequences(X_test)

# padding the sequences
X_train = sequence.pad_sequences(X_train, maxlen=200)
X_test  = sequence.pad_sequences(X_test,  maxlen=200)

print('X_train shape:', X_train.shape)
print('X_test shape: ', X_test.shape)

X_train shape: (159571, 200)
X_test shape:  (63978, 200)

That’s pretty much all there is to preprocessing!

We can now move on to building, training, and evaluating our neural networks.

4. Different Neural Networks

Before we start, I’d like to make it clear that I did not reinvent the wheel! I conducted a lot of research to find starting points and inspiration for useful “ingredients” for a neural network for this specific task. For example: how many layers to use, how many nodes, how many filters, which activation function to incorporate, what kind of regularization and where to place it, etc. In other words, it’s not all my idea.

Instead, my goal was to get comfortable with building different networks and gain a deeper understanding of the architecture!

The first step was to import the necessary Keras libraries. The second step was to set some vital model parameters:

from keras import initializers, regularizers, constraints, optimizers, layers
from keras.models import Model, Input, Sequential
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, SpatialDropout1D, Activation
from keras.layers import Conv1D, Bidirectional, GlobalMaxPool1D, MaxPooling1D, BatchNormalization
from keras.optimizers import Adam
# number of unique words we want to use (or: number of rows in incoming embedding vector)
max_features = 20000 

# max number of words in a comment to use (or: number of columns in incoming embedding vector)
max_len = 200 

# dimension of the embedding variable (or: number of rows in output of embedding vector)
embedding_dims = 128

4.1. Baseline Neural Network

Let’s start with instantiating a classic densely connected neural network to create a strong baseline and reiterate the layers it’s composed of:

  • The input layer  is the first layer in the neural network. It takes input values and passes them on to the next layer without applying any operations to them. We begin by defining an input layer that accepts a list of words with a dimension of 200 (= max words in one comment).
  • Next, we pass our vectors to an embedding layer, where we project the words onto a defined vector space depending on the distance of the surrounding words in a sentence. Embedding allows us to reduce model size, and most importantly the huge dimensions we have to deal with.
  • The output of the embedding layer is just a list of the coordinates of the words in this vector space. We need to define the size of this vector space and the number of unique words we are using. It’s important to know that the embedding size is a parameter that one can tune and experiment with.
  • A pooling layer effectively downsamples the output of the prior layer, reducing the number of operations required for all following layers, but still passing on the valid information from the previous layer.

  • Dense layers (or fully connected layers) are simply a linear operation in which every input is connected to every output by a weight, generally followed by a non-linear activation function.

  • Regularization  layers are used to overcome the over-fitting problem. In regularization we either penalize our loss term by adding an L1 (LASSO) or an L2 (Ridge) norm on the weight vector. Alternatively, we can apply a dropout layer where individual nodes are dropped out of the net, so that a reduced network is left.

  • Finally, the output layer  is the last layer in the network, and receives its input from the last hidden layer. With this layer we can get the desired number of values in a desired range. In our network we have 6 neurons in the output layer.

  • Activation functions are used to introduce non-linearity to neural networks. They squash the values into a smaller range. There are many activation functions used in the deep learning industry, such as ReLU, Sigmoid, or TanH.

# instantiate NN model
base_model = Sequential()

# add embedding layer 
base_model.add(Embedding(input_dim=max_features, input_length=max_len,

# add pooling layer 
# ... which will extract features from the embeddings of all words in the comment

# add dense layer to produce an output dimension of 50 and apply relu activation
base_model.add(Dense(50, activation='relu'))

# set the regularizing dropout layer to drop out 30% of the nodes

# finally add a dense layer
# ... which projects output into six units and squash it with sigmoid activation
base_model.add(Dense(6, activation='sigmoid'))

Next, we compile the model. Some thoughts on the parameters that need to be defined in this stage:

  • The loss function computes the error for a single training example. The cost function is the average of the loss functions of the entire training set. Our choice here is categorical_crossentropy, i.e. a multi-class logarithmic loss.
  • When we train neural networks, we usually use Gradient Descent to optimize the weights. At each iteration, we use back-propagation to calculate the derivative of the loss function with respect to each weight and subtract it from that weight. The learning rate determines how quickly or how slowly we want to update our weight values. It should be just high enough to take an acceptable time to converge, and just low enough to be capable of finding the local minima.
  • The optimizer is a search technique, which is used to update weights in the model. Our choice here is Adaptive Moment Estimation (Adam) which uses adaptive learning rates.
  • Performance metrics are used to measure the performance of the neural network. Accuracy, loss, validation accuracy, validation loss, mean absolute error, precision, recall, and f1 score are some performance metrics. Our choice here is accuracy.
                   optimizer=Adam(0.01), metrics=['accuracy'])

# check the model with all our layers

Now we have to train our network and decide on some training parameters:

  • The batch size is the number of training examples in one forward/backward pass. In general, larger batch sizes result in faster progress in training, but don’t always converge as quickly. Smaller batch sizes train slower, but can converge faster. And the higher the batch size, the more memory space you’ll need.

  • The training epochs  are the number of times that the model is exposed to the training dataset. One epoch equals one forward pass and one backward pass of all the training examples. In general, the models improve with more epochs of training, to a point. They’ll start to plateau in accuracy as they converge.

base_hist =, y_train, batch_size=32, 
                           epochs=3, validation_split=0.1)

It seems that the accuracy is pretty decent for a basic attempt! Let’s evaluate our model on the unseen test data:

# evaluate the algorithm on the test dataset
base_test_loss, base_test_auc = base_model.evaluate(X_test, y_test, batch_size=32)
print('Test Loss:    ', base_test_loss)
print('Test Accuracy:', base_test_auc)

63978/63978 [==============================] - 9s 144us/step
Test Loss:     0.0777554601263663
Test Accuracy: 0.971805410953583

Let’s move on to the next algorithm:

4.2. Convolutional Neural Network (CNN)

Convolutional neural networks (CNN’s) recently proved to be very effective at document classification, namely because they are able to pick out salient features (e.g. tokens or sequences of tokens) in a way that is invariant to their position within the input sequences. Simply put, a convolution is a sliding window function applied to a matrix. To set up a CNN, we have to add a convolutional layer:

  • A convolutional layer consists of a set of “filters”. These filters only take in a subset of the input data at a given time, but are applied across the full input by sweeping over it. The operations performed here are still linear, but they are generally followed by a non-linear activation function.

Also, I learned to add Batch Normalization, which together with Dropout are the keys to prevent overfitting.

  • A batch normalization layer normalizes the activations of the previous layer at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1. It will be added after the activation function, between a convolutional and a max-pooling layer.

Let’s instantiate, compile, and train the CNN:

# instantiate CNN model
cnn_model = Sequential()

# add embedding layer 
cnn_model.add(Embedding(input_dim=max_features, input_length=max_len,
# set the dropout layer to drop out 50% of the nodes

# add convolutional layer that has ...
# ... 100 filters with a kernel size of 4 so that each convolution will consider a window of 4 word embeddings
cnn_model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))

# add normalization layer

# add pooling layer 

# set the dropout layer to drop out 50% of the nodes

# add dense layer to produce an output dimension of 50 and using relu activation
cnn_model.add(Dense(50, activation='relu'))

# finally add a dense layer
cnn_model.add(Dense(6, activation='sigmoid'))
# compile

# train
cnn_hist =, y_train, batch_size=32, 
                         epochs=3, validation_split=0.1)

Well, the validation accuracy is 0.02% worse than in our baseline…

As a side note: I first ran this algorithm without batch normalization and had a training accuracy of 0.9736 and a validation accuracy of 0.9748. Now, with batch normalization, both training and validation accuracy scores have increased slightly – to 0.9807 and 0.9805 respectively.

# validate
cnn_test_loss, cnn_test_auc = cnn_model.evaluate(X_test, y_test, batch_size=32)
print('Test Loss:    ', cnn_test_loss)
print('Test Accuracy:', cnn_test_auc)

63978/63978 [==============================] - 135s 2ms/step
Test Loss:     0.07513055060917224
Test Accuracy: 0.9705471649182295

4.3. Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks. The idea behind RNNs is to make use of sequential information.

Bidirectional RNNs are based on the idea that the output at a given point in time may not only depend on the previous elements in the sequence, but also future elements. For example, to predict a missing word in a sequence you will want to look at the context on both the left and the right.

From this point onwards, I will drop some outputs to prevent this blog post from blowing out of proportion. Detailed code snippets are available here.

Let’s now instantiate, compile, train, and evaluate the RNN:

# instantiate RNN model
rnn_model = Sequential()

# add embedding layer 
rnn_model.add(Embedding(input_dim=max_features, input_length=max_len,

# set the dropout layer to drop out 50% of the nodes

# add bidirectional layer and pass in an LSTM()
rnn_model.add(Bidirectional(LSTM(25, return_sequences=True)))

# add normalization layer

# add pooling layer 

# set the dropout layer to drop out 50% of the nodes

# add dense layer to produce an output dimension of 50 and using relu activation
rnn_model.add(Dense(50, activation='relu'))

# finally add a dense layer
rnn_model.add(Dense(6, activation='sigmoid'))
# compile

# train
rnn_hist =, y_train, batch_size=32, 
                          epochs=3, validation_split=0.1)
# evaluate
rnn_test_loss, rnn_test_auc = rnn_model.evaluate(X_test, y_test, batch_size=32)
print('Test Loss:    ', rnn_test_loss)
print('Test Accuracy:', rnn_test_auc)

63978/63978 [==============================] - 332s 5ms/step
Test Loss:     0.07362373834004522
Test Accuracy: 0.971680366852086

4.4. CNN with Pre-Trained GloVe Embedding

Let’s now explore the other way to add an embedding layer to our neural network: by using a pre-trained embedding layer. We’ll go with the GloVe method, which first uses a CNN and then an RNN.

Let’s start with loading and preparing the GloVe text data:

# load the glove840B embedding

embeddings_index = dict()
f = open('glove.840B.300d.txt')

for line in f:
    # Note: use split(' ') instead of split() if you get an error
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 2196016 word vectors.
# create a weight matrix
embedding_matrix = np.zeros((len(tokenizer.word_index)+1, 300))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Now, let’s instantiate, compile, train, and evaluate the pre-trained CNN. Note that a pre-trained embedding requires the arguments weights=embedding_matrix as well as trainable=False to freeze the weights.

# instantiate pretrained glove model
glove_model = Sequential()

# add embedding layer 
glove_model.add(Embedding(input_dim =embedding_matrix.shape[0], input_length=max_len,
                          weights=[embedding_matrix], trainable=False))
# set the dropout layer to drop out 50% of the nodes

# add convolutional layer that has ...
# ... 100 filters with a kernel size of 4 so that each convolution will consider a window of 4 word embeddings
glove_model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))

# add normalization layer

# add pooling layer 

# set the dropout layer to drop out 50% of the nodes

# add dense layer to produce an output dimension of 50 and using relu activation
glove_model.add(Dense(50, activation='relu'))

# finally add a dense layer
glove_model.add(Dense(6, activation='sigmoid'))
# compile

# train
glove_hist =, y_train, batch_size=32, 
                             epochs=3, validation_split=0.1)
# evaluate
glove_test_loss, glove_test_auc = glove_model.evaluate(X_test, y_test, batch_size=32)
print('Test Loss:    ', glove_test_loss)
print('Test Accuracy:', glove_test_auc)

63978/63978 [==============================] - 192s 3ms/step
Test Loss:     0.08019186425862686
Test Accuracy: 0.9684605098325205

4.5. RNN with Pre-Trained GloVe Embedding

Finally, we will instantiate, compile, train, and evaluate a pre-trained RNN:

# instantiate pretrained glove model
glove_2_model = Sequential()

# add embedding layer 
glove_2_model.add(Embedding(input_dim =embedding_matrix.shape[0], input_length=max_len,
                          weights=[embedding_matrix], trainable=False))

# set the dropout layer to drop out 50% of the nodes

# add bidirectional layer and pass in an LSTM()
glove_2_model.add(Bidirectional(LSTM(25, return_sequences=True)))

# add normalization layer

# add pooling layer 

# set the dropout layer to drop out 50% of the nodes

# add dense layer to produce an output dimension of 50 and using relu activation
glove_2_model.add(Dense(50, activation='relu'))

# finally add a dense layer
glove_2_model.add(Dense(6, activation='sigmoid'))
# compile

# train
glove_2_hist =, y_train, batch_size=32, 
                                 epochs=3, validation_split=0.1)
# evaluate
glove_2_test_loss, glove_2_test_auc = glove_2_model.evaluate(X_test, y_test, batch_size=32)
print('Test Loss:    ', glove_2_test_loss)
print('Test Accuracy:', glove_2_test_auc)

63978/63978 [==============================] - 442s 7ms/step
Test Loss:     0.072724116503393
Test Accuracy: 0.9723863393294703

5. Conclusions

It’s time to wrap up everything we’ve done so far. Here is a summary:

  • When we look at the training accuracy, the plain neural network performs best.
  • When we check the validation accuracy, it’s also the plain neural network that stands out.
  • But when we check how well the models classify unseen data – and therefore check the testing accuracy – it’s the pre-trained RNN that does this job with the highest precision.

In this common challenge, it really came down to the decimal places. The winner reached an accuracy score of 0.9877 for a pre-trained embedding – although I’m not certain whether this is a training or validation accuracy. They also used a semi-supervised learning technique called Pseudo-Labelling!

I think I came quite close to these numbers, but some minor tweaks could certainly be done to improve the third decimal place even more, such as trying different dropout patterns or numbers of nodes in certain layers. It’s not the sky that’s the limit – it’s the runtime!

. . . . . . . . . . . . . .

Thank you for reading!

The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and I am always happy to get
critical and friendly feedback, or suggestions for improvement!