What to Expect as an Airbnb Host in Berlin

An attempt to determine a daily rate for a new accommodation listing in Berlin.

Since its beginning in 2008, Airbnb has seen enormous growth, with the number of rentals listed on its website growing exponentially each year. In Germany, no city is more popular than Berlin. This makes Berlin one of the hottest markets for Airbnb in Europe, with over 22,552 listings as of November 2018. With a size of 891 km², this means there are roughly 25 homes being rented out per km² in Berlin on Airbnb! However, it is difficult for potential hosts to know what the true value of their home is, and how in-demand their home might be. Can we perhaps succeed in determining a fairly spot-on daily price for a new accommodation that fits into its specific market environment and given its competition in Berlin?

Our question focuses on accommodation features, and the decisions a new host can make ib regard to initial presentation i.e. posting a picture of him- or herself on the website, determining a minimum night stay, offering instant bookings etc.

I will perform an analysis of the detailed Berlin listings data, sourced from the Inside Airbnb website, in order to understand the rental landscape and try to recommend a daily rate for a newbie entering the market. The dataset was scraped on November 7th, 2018.

For this task, I will be using an XGBoost regressor, and would like to take you on this journey with me! The corresponding notebook with the full code is here.

1. Obtaining and Viewing the Data

The dataset contains 22,552 records and 96 columns – that’s too much to work with. I defined 23 columns that appeared reasonable and dropped the rest. What room types and proper types are we up against?

df_raw.room_type.value_counts(normalize=True)

Private room       0.511440
Entire home/apt    0.475435
Shared room        0.013125
Name: room_type, dtype: float64
df_raw.property_type.value_counts(normalize=True)

Apartment                 0.896816
Condominium               0.027137
Loft                      0.020397
House                     0.017648
Serviced apartment        0.007760
Hostel                    0.005676
Townhouse                 0.004390
Guest suite               0.003281
Bed and breakfast         0.002838
Guesthouse                0.002527
Hotel                     0.002217
Other                     0.002084
Boutique hotel            0.001907
Bungalow                  0.000887
Boat                      0.000754
Tiny house                0.000532
Houseboat                 0.000488
Camper/RV                 0.000488
Villa                     0.000443
Pension (South Korea)     0.000310
Aparthotel                0.000310
Cabin                     0.000266
Cottage                   0.000177
Resort                    0.000133
Castle                    0.000089
Train                     0.000089
Casa particular (Cuba)    0.000089
Cave                      0.000044
Island                    0.000044
Chalet                    0.000044
Tipi                      0.000044
Barn                      0.000044
In-law                    0.000044
Name: property_type, dtype: float64

2. Preprocessing the Data

Feature Engineering No. 1: Distance to Center of Berlin

Location is always an important factor in lodging services. To make it more descriptive, I decided to calculate each accommodation’s distance to the so-called center of Berlin instead of just relying on the neighbourhoods or areas. (This center can easily be found by asking google.) For our convenience, let’s write a quick function that does this, apply it to each accommodation, and store the values in a new column:

# importing library
from geopy.distance import great_circle

# defining function
def distance_to_mid(lat, lon):
    berlin_centre = (52.5027778, 13.404166666666667)
    accommodation = (lat, lon)
    return great_circle(berlin_centre, accommodation).km

# store calculated values in new column
df_raw['distance'] = df_raw.apply(lambda x: distance_to_mid(x.latitude, x.longitude), axis=1)

Feature Engineering No. 2: Lodging Size

One of the most important pieces of information for predicting rate is size. Since the column square_feet was heavily filled with null values, we dropped it in the previous section. (Besides, size in Germany is expressed in square meters, not in square feet.) But the description column seems to be rich in content. Let’s extract

  • all double-digit or three-digit numbers…
  • that are followed by one of the two characters “s” or “m” (covering “sqm”, “square meters”, “m2” etc.) and…
  • that may or may not be connected by a space.

Single- or more than three-digit numbers for accommodation size are quite unlikely.

I know, it’s a bold move – but let’s give it a try…

# extract numbers 
df_raw['size'] = df_raw['description'].str.extract('(\d{2,3}\s?[smSM])', expand=True)
df_raw['size'] = df_raw['size'].str.replace("\D", "")

# change datatype of size into float
df_raw['size'] = df_raw['size'].astype(float)

Comparing the data in a couple of the description rows with the results in the newly created size column, we did a pretty good job with most records, but filtered an incorrect size for some. Let’s keep that in mind: there may be mistakes in the size we engineered from the text!

However, half of our records still don’t have a size. That means we have a problem! Dropping these records isn’t an option, as we would lose too much valuable information. Simply replacing them with the mean or median would make no sense. That leaves a third option: predict the missing value with a Machine Learning Algorithm. I used some numerical features, split the data into a) a training set where we have sizes and b) a test set where we don’t, and applied a Linear Regression to predict the missing values. If you’re interested in finding out how I did that, please check out my corresponding notebook with all the code here.

Feature Engineering No. 3: Lodging Amenities

It could also be of interest to know what amenities hosts offer their guests. Moreover, in order to enrich our price prediction, it might also be interesting to see if we can determine what some of the more special and/or rare amenities might be that make a property more desirable.

# import Counter library
from collections import Counter

# clean the amenities which are all written in one column
results = Counter()
df['amenities'].str.strip('{}')\
               .str.replace('"', '')\
               .str.lstrip('\"')\
               .str.rstrip('\"')\
               .str.split(',')\
               .apply(results.update)

# create a new dataframe
sub_df = pd.DataFrame(results.most_common(30), columns=['amenity', 'count'])

# plot the Top 30
sub_df.sort_values(by=['count'], ascending=True).plot(kind='barh', x='amenity', y='count',  
                                                      figsize=(10,7), legend=False, color='darkgrey',
                                                      title='Amenities')
plt.xlabel('Count');
Top 30 most frequently mentioned Amentities

Let’s add columns with amenities that are somewhat unique and not offered by all hosts:

  • a laptop-friendly workspace
  • a TV
  • kid friendly accommodation
  • smoker friendly and
  • being greeted by the host.

After doing this, let’s drop the original column:

df['Laptop_friendly_workspace'] = df['amenities'].str.contains('Laptop friendly workspace')
df['TV'] = df['amenities'].str.contains('TV')
df['Family_kid_friendly'] = df['amenities'].str.contains('Family/kid friendly')
df['Host_greets_you'] = df['amenities'].str.contains('Host greets you')
df['Smoking_allowed'] = df['amenities'].str.contains('Smoking allowed')

df.drop(['amenities'], axis=1, inplace=True)

3. Modeling the Data

First we need to define the target and the features we want to work with. The string columns need to be converted into categorical columns which then have to be converted into separate binary features called dummy variables, as machine learning algorithms require data in numeric form.

One of the challenges when building models is mixing features that have different scales. Look at our dataset and compare bathrooms with size or maximum_nights. When we mix units with ranges that have different orders of magnitude, our models may not be able to find the proper coefficients. To account for this problem, we standardize or normalize the features.

# split our data
X_train, X_test, y_train, y_test = train_test_split(features_recoded, target, test_size=0.2)

# scale our data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test  = sc.transform(X_test)

Now we’re prepared to train an XGBoost. To improve its performance, I used GridSearch aiming to find the best parameters.

# create a baseline
booster = xgb.XGBRegressor()

from sklearn.model_selection import GridSearchCV

# create Grid
param_grid = {'n_estimators': [100, 150, 200],
              'learning_rate': [0.01, 0.05, 0.1], 
              'max_depth': [3, 4, 5, 6, 7],
              'colsample_bytree': [0.6, 0.7, 1],
              'gamma': [0.0, 0.1, 0.2]}

# instantiate the tuned random forest
booster_grid_search = GridSearchCV(booster, param_grid, cv=3, n_jobs=-1)

# train the tuned random forest
booster_grid_search.fit(X_train, y_train)

# print best estimator parameters found during the grid search
print(booster_grid_search.best_params_)

{'colsample_bytree': 0.7, 'gamma': 0.2, 'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 200}

Some of the important hyperparameters to tune an XGBoost are:

  • n_estimators= Number of trees one wants to build.
  • learning_rate = Rate at which our model learns to recognize patterns in the data. After every round, it shrinks the feature weights to reach the best optimum.
  • max_depth = Determines how deeply each tree is allowed to grow during any boosting round.
  • colsample_bytree = Percentage of features used per tree.
  • gamma = Specifies the minimum loss reduction required to make a split.

Let’s run our XGBoost …

# instantiate xgboost with best parameters
booster = xgb.XGBRegressor(colsample_bytree=0.7, gamma=0.2, learning_rate=0.1, 
                           max_depth=6, n_estimators=200, random_state=4)

# train
booster.fit(X_train, y_train)

# predict
y_pred_train = booster.predict(X_train)
y_pred_test = booster.predict(X_test)

… and check the results.

(Remember Winston Churchill’s saying: “However beautiful the strategy, you should occasionally look at the results.”)

RMSE = np.sqrt(mean_squared_error(y_test, y_pred_test))
print(f"RMSE: {round(RMSE, 4)}")

RMSE: 22.537
r2 = r2_score(y_test, y_pred_test)
print(f"r2: {round(r2, 4)}")

r2: 0.7123

In order to build more robust models, it is common to conduct a k-fold cross validation where all the entries in the original training dataset are used for both training and validation. You can check the process I carried out here.

4. Interpreting the Data

We can see that our average error (RMSE) in the initial XGBoost is around 22€, which improved to 17.5€ by cross validation. Given that even after cleaning up the price column, half of our lodgings cost only up to 45€ (and 75% of our lodgings up to 70€), even the improved standard deviation of 17€ is quite a massive inaccuracy that doesn’t help much in recommending a price.

It turns out that the price isn’t only dependent on geography, size, and amenities. It stands to reason that

  • the furnishing (simple, luxurious),
  • the quality of presentation (e.g. pictures),
  • availability,
  • the number and content of reviews,
  • communication (e.g. acceptance rate, host response time) and/or
  • status (whether or not the host is a super host)

might also have a substantial influence. But the purpose of this analysis was to recommend a price to a “rookie” without any reviews, ratings, or status of any sort. With this in mind, we might say that although we can’t recommend an exact price, we can recommend a range instead.

The next step (and maybe an idea for the reader) would be to start all over again and include some of the features mentioned above to try to find out if the accuracy improves. That might help a beginner on Airbnb better know what price to aim for.

With what we have done here, we have explained 71% of the variance (R^2) with the most important accommodation features, as pictured below:

As we see, the most important features are size, distance, and cleaning fee, which account for approximately 45% of the daily price. Other top features are the number of people the apartment accommodates, other fees such as security deposit or the price for extra people, a minimum night stay, and bathrooms. That’s it for today!

. . . . . . . . . . . . . .

Thank you for reading! The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and I am always happy to get
critical and friendly feedback, or suggestions for improvement!