What Airbnb Visitors Like and Dislike

Applying Natural Language Processing techniques to look into the content of Airbnb reviews

Since its creation in 2008, Airbnb has seen enormous growth, with the number of rentals listed on its website growing exponentially each year. In Germany, no city is more popular than Berlin. That makes Berlin one of the hottest markets for Airbnb in Europe, with over 22,552 listings as of November 2018. With a size of 891 km², this means there are roughly 25 homes being rented out per km² in Berlin on Airbnb! With so many visitors, what can we find out about their likes and dislikes? And since certain things, such as location and furniture, are obviously fixed, is there anything else a host can influence – such as their communication patterns, the Airbnb description, and/or additional services to satisfy their guests?

In this project, I will use data from the the reviews and combine it with some features from the detailed Berlin listings data from the Inside Airbnb website. Both datasets were scraped on November 7th, 2018.

The corresponding notebook with all the full code is here.

1. Obtaining and Viewing the Data

As the reviews data only contains, well, reviews – it may be valuable to have more details, such as reviewed accommodation’s latitude and longitude, the neighbourhood it’s in, the host id number, etc. To get this information, let’s combine our reviews_dataframe with the listings_dataframe and only taking the columns we need from the latter one:

All in all we have to deal with 401,963 records and 12 columns.

2. Preprocessing the Data

Language Detection

Let’s first find out what languages sre most frequently used to write reviews. We use Python’s langdetect module to figure that out:

# we use Python's langdetect 
from langdetect import detect

# write the function that detects the language
def language_detection(text):
        return detect(text)
        return None

# apply the function and store values in a new column
df['language'] = df['comments'].apply(language_detection)

What are the top 6 languages?

# let's split the dataframe into 3 language related sub-dataframes
df_eng = df[(df['language']=='en')]
df_de  = df[(df['language']=='de')]
df_fr  = df[(df['language']=='fr')]


Let’s look at WordClouds for the English, German, and French reviews to get a sense for which words are used most frequently in each language:

3. Sentiment Analysis

VADER package

VADER is an easy to use Python library built especially for sentiment analysis of social media texts. VADER belongs to a type of sentiment analysis that is based on lexicons of sentiment-related words. In this approach, each of the words in the lexicon is rated as positive or negative, and in many cases, how positive or negative.

# load the SentimentIntensityAnalyser object in
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# assign it to another name to make it easier to use
analyzer = SentimentIntensityAnalyzer()

VADER produces four sentiment metrics from these word ratings, which you can see below. The first three – positive, neutral and negative – represent the proportion of the text that falls into those categories. As you can see, our example sentence was rated as 42% positive, 58% neutral, and 0% negative.

The final metric, the compound score, is the sum of all of the lexicon ratings which have been standardised to range between -1 and 1. In this case, our example sentence has a rating of 0.44, which is pretty neutral.

# use the polarity_scores() method to get the sentiment metrics
def print_sentiment_scores(sentence):
    snt = analyzer.polarity_scores(sentence)
    print("{:-<40} {}".format(sentence, str(snt)))

print_sentiment_scores("This raspberry cake is good.")
This raspberry cake is good.------------ {'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compound': 0.4404}

Let’s quickly define functions to get each score for a given sentence:

# getting only the negative score
def negative_score(text):
    negative_value = analyzer.polarity_scores(text)['neg']
    return negative_value

# getting only the neutral score
def neutral_score(text):
    neutral_value = analyzer.polarity_scores(text)['neu']
    return neutral_value

# getting only the positive score
def positive_score(text):
    positive_value = analyzer.polarity_scores(text)['pos']
    return positive_value

# getting only the compound score
def compound_score(text):
    compound_value = analyzer.polarity_scores(text)['compound']
    return compound_value

Let’s now have VADER produce all four scores for each of our English-language comments. As this takes roughly a quarter of an hour, it’s a good idea to save the dataframe.

df_eng['sentiment_neg'] = df_eng['comments'].apply(negative_score)
df_eng['sentiment_neu'] = df_eng['comments'].apply(neutral_score)
df_eng['sentiment_pos'] = df_eng['comments'].apply(positive_score)
df_eng['sentiment_compound'] = df_eng['comments'].apply(compound_score)

And let’s investigate the distribution of all the scores:

Remember what we said earlier: VADER produces four sentiment metrics from these word ratings: The first three – positive, neutral and negative – represent the proportion of the text that falls into those categories. And the final metric, the compound score, is the sum of all of the lexicon ratings which have been standardized to range between -1 and 1.

Comparing Negative and Positive Comments

Let’s use the describe() method to generate percentiles of our dataset’s compound score:

percentiles = df.sentiment_compound.describe(percentiles=[.05, .1, .2, .3, .4, .5, .6, .7, .8, .9])
count    271836.000000
mean          0.831619
std           0.261954
min          -0.996800
5%            0.051600
10%           0.585900
20%           0.784700
30%           0.862000
40%           0.903200
50%           0.928700
60%           0.946800
70%           0.960800
80%           0.972100
90%           0.982500
max           0.999500
Name: sentiment_compound, dtype: float64

And let’s define all comments up to the 5% percentile as negative comments, all comments from the 5% to the 10% percentile as slightly positive and all above as very positive. What percentage of each type of comment do we have?

Clearly, the bulk of the reviews are tremendously positive. Wouldn’t it be interesting to know what the negative and positive comments are about?

However, to keep this blog post readable, I chose to only focus on the positive comments here. You can find the rest of the story in my notebook on GitHub,where you can check out the negative comments and much more besides that.

Positive Comments

Let’s start with a WordCloud for positive comments only:

Another method for visually exploring text is with frequency distributions. In the context of a text corpus, such a distribution tells us the prevalence of certain words. Here we use the Yellowbrick library.

# importing libraries
from sklearn.feature_extraction.text import CountVectorizer
from yellowbrick.text.freqdist import FreqDistVisualizer
from yellowbrick.style import set_palette
# vectorizing text
vectorizer = CountVectorizer(stop_words='english')
docs = vectorizer.fit_transform(pos_comments)
features = vectorizer.get_feature_names()

# preparing the plot
plt.title('The Top 30 most frequent words used in POSITIVE comments\n', fontweight='bold')

# instantiating and fitting the FreqDistVisualizer, plotting the top 30 most frequent terms
visualizer = FreqDistVisualizer(features=features, n=30)

Next we’ll explore topic modeling, an unsupervised machine learning technique for abstracting topics from collections of documents or, in our case, for identifying which topic is being discussed in a comment.

Methods for topic modeling have evolved significantly over the last decade. In this section, we’ll explore a technique called Latent Dirichlet Allocation (LDA), a widely used topic modelling technique. It’s conducted in three steps: 1. Cleaning and Preprocessing, 2. Conducting LDA the Gensim way, and 3. Visualizing topics.

1. Cleaning and Preprocessing

In the first step we remove stopwords and punctuation, and normalize the corpus. Check the corresponding code I used here if you’re interested.

2. Conducting LDA the Gensim way

The second step is to create a Gensim dictionary from the normalized data. We then convert this to a bag-of-words corpus, and save both dictionary and corpus for future use.

from gensim import corpora
dictionary = corpora.Dictionary(doc_clean)
corpus = [dictionary.doc2bow(text) for text in doc_clean]

import pickle
pickle.dump(corpus, open('data/sentimentData/corpus.pkl', 'wb'))
import gensim

# let LDA find 3 topics
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)

topics = ldamodel.print_topics(num_words=4)
for topic in topics:
(0, '0.024*"apartment" + 0.015*"great" + 0.015*"walk" + 0.014*"minute"')
(1, '0.017*"u" + 0.012*"place" + 0.012*"room" + 0.009*"even"')
(2, '0.038*"great" + 0.032*"place" + 0.027*"stay" + 0.022*"apartment"')
  1. The first topic includes words like apartment, great, walk, and minute. This sounds like a topic related to convenience in terms of the distances between the accommodation and interesting places to go to nearby.
  2. The second topic includes words like place, room, even, and a mysterious u (perhaps u-bahn for the underground?). It seems unclear to me what this was supposed to be about.
  3. And the third topic combines words like great, place, stay, and apartment, which sounds like a cluster related to overall satisfaction with the home.

We can also let LDA find 6 topics instead of 3, or 20 topics or however many we want. I ran LDA with 3, 5, and 10 topics to get a good sense for the data.

Putting it all together – the WordCloud, the different LDA’s, and the Frequency Distribution – it appears that is often the following criteria that make someone rate an apartment positively:

  • The apartment is clean, the bathroom is clean, and the bed is comfortable.
  • The apartment is quiet and to getting a good night’s sleep.
  • The area is centrally located with short walking distances, good public transport connections, and has cafes and restaurants nearby.

Now, getting an Airbnb that fulfills the last two criteria can be virtually impossible… but it is nonetheless true that this is what tourists all over the world look for!

Before we leave, let’s visualize the LDA model:

3. Visualizing topics

The pyLDAvis library is designed to provide a visual interface for interpreting the topics derived from a topic model by extracting information from a fitted LDA topic model.

dictionary = gensim.corpora.Dictionary.load('data/sentimentData/dictionary.gensim')
corpus = pickle.load(open('data/sentimentData/corpus.pkl', 'rb'))

import pyLDAvis.gensim
# visualizing 5 topics
lda = gensim.models.ldamodel.LdaModel.load('data/sentimentData/model5.gensim')
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
Topic Modeling with pyLDAvis library

In short, the interface provides:

  • a left panel that depicts a global view of the model (how prevalent each topic is and how topics relate to each other);
  • a right panel containing a bar chart – the bars represent the terms that are most useful in interpreting the topic currently selected (what the meaning of each topic is).

On the left, the topics are plotted as circles, whose centers are defined by the computed distance between topics (projected into 2 dimensions). The prevalence of each topic is indicated by the circle’s area. On the right, two juxtaposed bars show the topic-specific frequency of each term (in red) and the corpus-wide frequency (in blueish gray). When no topic is selected, the right panel displays the top 30 most salient terms for the dataset.

Feel free to check out my notebook and play around a bit. Also, my investigation of negative comments can be found there. That’s it for today!

. . . . . . . . . . . . . .

Thank you for reading! The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and I am always happy to get
critical and friendly feedback, or suggestions for improvement!