**An approach to e**xplaining the demand for, and expected earnings of, Airbnb hosts in Berlin

Since its creation in 2008, Airbnb has seen enormous growth, with the number of rentals listed on its website growing exponentially each year. In Germany, no city is more popular than Berlin. That makes Berlin one of the hottest markets for Airbnb in Europe, with over 22,552 listings as of November 2018. With a size of 891 km², this means there are roughly 25 homes being rented out per km² in Berlin on Airbnb! But what can a host expect with respect to occupancy and earnings here in Berlin? What factors influence how in-demand it is?

The dataset we use here contains detailed review data for Berlin and was sourced from the Inside Airbnb website (scraped on November 07th, 2018). Today, we will mainly use visualizations to explore how big the demand for Airbnb is likely to be. However, the most interesting part is the preprocessing!

The corresponding notebook with the full code is here.

### 1. Obtaining and Viewing the Data

The dataset contains 401,963 records and 6 columns. Let’s have a glance:

In only 5 rows of data, we already have 4 different languages to deal with! What we see is that each and every review is listed and we probably have dozens of them per accommodation.

### 2. Preprocessing the Data

Several things need to be done. First, we have to count the reviews per month and per lodging within a defined time period, say 12 months. Second, we have to enrich the number of of reviews we got simply by counting with the listings data for more details, e.g. price, cleaning fee, location, number of bathrooms etc. Only then will we have the data needed to estimate the occupancy and income.

I would like to focus on obtaining those estimates here, but please feel free to look up the how the reviews were counted and how the data was merged in my notebook.

One of the biggest issues with Airbnb is getting the occupancy rate for each host or for a market. *Inside Airbnb*, the website I sourced the data from, uses an occupancy model which they call the “San Francisco Model” with the following methodology:

- A
**Review Rate**of 50% is used to convert reviews to estimated bookings. Other administrative authorities are said to use a review rate of 72% (however, this may be attributed to an unreliable source: Airbnb’s CEO and co-founder Brian Chesky) – or one of 30.5% (based on comparing public data of reviews to the The New York Attorney General’s report on Airbnb released in October 2014.)*Inside Airbnb*chose 50% as it sits almost exactly between 72% and 30.5%. It basically means that only 50% of all visitors write a review, so the number of reviews per month divided by the review rate equals an estimate of the number of actual visitors. - The
**average length of stay**for each city is usually published by Airbnb. This number multiplied by the number of estimated bookings for each listing over a period of time gives**the occupancy rate**. - Finally, the
**income**can be calculated by multiplying the occupancy rate by the price and the time period of interest – here, 12 months.

So breaking it down into formulas, this is what we need to do:

**Monthly Occupancy Rate = Average Length of Stay * (No. of reviews per Month / Review Rate)**

*(According to the latest Airbnb update, guests who booked stays in Berlin in 2017 via Airbnb spent 4.2 nights here on average.)*

**Yearly Income = Monthly Occupancy Rate * Price * 12 Months**

*Modest Occupancy Estimate*

With a very modest review rate of 0.5, we are assuming that only every second guest left a review. While I could see many more than just half the visitors actually writing feedback, let’s calculate this conservative estimate first:

```
avg_length_of_stay_berlin = 4.2
review_rate_modest = 0.5
# calculate the occupancy and round the result
df['modest_occupancy'] = round(avg_length_of_stay_berlin * (df['reviews_per_month']/review_rate_modest), 2)
# occupancy rates cannot be greater than 100% - let's drop those
df.drop(df[(df['modest_occupancy'] > 100)].index, axis=0, inplace=True)
```

*Optimistic Occupancy Estimate*

Now let’s try for a more optimistic estimate of occupancy using a review rate of 0.4, which assumes that only 40% of all guests left a review. This way, the number of reviews points to a higher occupancy rate than in the modest estimate we had before:

```
review_rate_optimistic = 0.4
# calculate the occupancy and round the result
df['optimistic_occupancy'] = round(avg_length_of_stay_berlin * (df['reviews_per_month']/review_rate_optimistic), 2)
# occupancy rates cannot be greater than 100% - let's drop those
df.drop(df[(df['optimistic_occupancy'] > 100)].index, axis=0, inplace=True)
```

Let’s compare both estimates:

Using our modest and optimistic estimates for the occupancy rate, we’ll now do the same for income:

```
df['modest_income'] = df['modest_occupancy'] * df['price'] * 12
df['optimistic_income'] = df['optimistic_occupancy'] * df['price'] * 12
```

### 3. Exploratory Data Analysis (EDA)

Let’s start with a heat map:

There is reason to believe that

- the number of people that can be
`accommodated`

is an indicator for**size or capacity**and - the
`latitude`

as a proxy for**location**may help in explaining the demand.

(The feature `reviews_per_month`

is what we used to estimate the occupancy, so no wonder the correlation is a vivid red.)

I assume there might be also some **seasonality** that strongly influences the demand. Furthermore, I believe that the so-called **super host status** does benefit these hosts, who probably get far more guests than standard hosts. As the heat map only uses numeric columns, it can’t show any such relationship — so let’s walk through all of these factors to visually investigate their effect on demand.

Let’s first investigate the seasonal demand by splitting the data into years – after the reviews per lodging have been counted.

Be aware that – unlike the other plots – the one for 2018 only ranges from January to October!

That being said, we can see the same pattern each and every year: the visitors peak from May to July, and then again in September and October. It drops significantly during August and the winter months. The pattern in August is interesting: tourists seem to avoid city trips in the summer, likely preferring to go on beach holidays.

Our next question is whether there are any differences in the demand for listings posted by super hosts vs. standard host. How many of each do we have in the Berlin dataset anyway?

It’s definitely worth aspiring to become a super host. The differences in occupancy and income are striking!

Next, let’s check out the neighbourhoods, i.e. the different districts:

It appears that *Reinickendorf* benefits from rather low room rates, which lead to a high occupancy. When it comes to earnings, *Mitte* is in the lead.

Last but not least, let’s look at the capacity:

Generally, bigger homes seem to be sold more often than smaller ones. Perhaps this is due to the fact that a group might be able to save more money than 1-2 persons would by using Airbnb. It follows that accommodations with a bigger capacity enjoy greater popularity.

### 4. Interpreting the Data

**> Seasonality**

The high season for lodgings in Berlin is during late spring and early summer (specifically from May to July) and during the autumn months September and October. You could use your apartment yourself during the rest of year and offer it in these periods to get the most out of it.

**> Super Host**

To be a superhost is to be a cash machine. The occupancy rate in superhost lodgings is almost twice as high as in standard host lodgings, and the income is 60% higher.

**> Location, Location, Location**

If you don’t want to live in the vibrant, loud center of Berlin yourselves – well, that’s bad luck for you! That’s precisely where tourists are looking to to rent Airbnb accommodations, particularly in *Mitte* or *Charlottenburg*, and they are willing to pay more for them than for lodgings in outlying districts.

**> Capacity**

Travelers in (bigger) groups benefit much more from how much they save by using Airbnb than couples or small groups do. That implies that lodgings accommodating 6+ people tend to be more in-demand than smaller ones. That’s it for today!

. . . . . . . . . . . . . .

Thank you for reading! The complete Jupyter Notebook can be found here.

I hope you enjoyed reading this article, and I am always happy to get

critical and friendly feedback, or suggestions for improvement!