Understanding Road Traffic Data through Compelling Visuals

Storytelling with data by using great data visualizations is an art form I greatly admire. I’m constantly trying to improve my data visualization skills, and would like to share some of my latest achievements, one of which is an analysis of UK road traffic data.

The analysis was part of a challenge I was given during a job application process. The files provide detailed road safety data about the circumstances of personal injury road accidents in the UK, the types of vehicles involved and the consequential casualties. The statistics only relate to personal injury accidents on public roads that have been reported to the police, and subsequently recorded using the STATS19 accident reporting form. The files I have chosen to work with span the years 2013 to 2017.

Obtaining the Data

I will only briefly touch on the process of obtaining and preprocessing the dataset, even though it wasn’t that much work to begin with. Regardless, feel free to check my full notebook if you’re interested. The last charts are borrowed from another notebook of mine.

The very first step – as always – is importing the libraries we need:

# import the usual suspects ...
import pandas as pd
import numpy as np
import glob

import matplotlib.pyplot as plt
import seaborn as sns

# suppress all warnings
import warnings

The accidents table I used contained 691641 records and 32 columns with data on accidents, their location, road conditions, weather conditions, and other characteristics.

The vehicles table spanned over a million records (1270711) and 23 columns containing data about the driver (age and sex), the vehicles (age, engine capacity, etc.), and the maneuvers the driver has performed.

Lastly, the tables on casualties, with 916713 records and 16 columns, consisted of data on the casualties themselves (e.g. age and sex) and the severity of injuries.

The date column in the accidents table was not stored in the proper format, which called for some action:

accidents['Date']= pd.to_datetime(accidents['Date'], format="%d/%m/%Y")

Exploratory Data Analysis

1. Has the number of accidents increased or decreased over the last few years?

To answer this, I went with a line plot, which is usually well-suited for time series. To add some spice, I added a line for the moving average:

# prepare plot
fig, ax = plt.subplots(figsize=(15,6))

# plot
accidents.set_index('Date').resample('M').size().plot(label='Total per Month', color='grey', ax=ax)
                           .plot(color='darkorange', linewidth=5, label='10-Months Moving Average', ax=ax)

ax.set_title('Accidents per Month', fontsize=14, fontweight='bold')
ax.set(ylabel='Total Count\n', xlabel='')
ax.legend(bbox_to_anchor=(1.1, 1.1), frameon=False)

# remove all spines
sns.despine(ax=ax, top=True, right=True, left=True, bottom=False);

So the answer is: the trend is pointing downwards!

While the above plot shows the data by month, we could also resample the dates by year and confirm the trend. This time I use a bar chart:

yearly_count = accidents['Date'].dt.year.value_counts().sort_index(ascending=False)

# prepare plot
fig, ax = plt.subplots(figsize=(12,5))

# plot
ax.bar(yearly_count.index, yearly_count.values, color='lightsteelblue')
ax.plot(yearly_count, linestyle=':', color='black')
ax.set_title('\nAccidents per Year\n', fontsize=14, fontweight='bold')
ax.set(ylabel='\nTotal Counts')

# remove all spines
sns.despine(ax=ax, top=True, right=True, left=True, bottom=True);

2. On which weekdays are accidents most likely to be caused?

To answer this, I first built a new dataframe with the desired data:

weekday_counts = pd.DataFrame(accidents.set_index('Date').resample('1d')['Accident_Index'].size().reset_index())
weekday_counts.columns = ['Date', 'Count']
weekday = weekday_counts['Date'].dt.weekday_name

weekday_averages = pd.DataFrame(weekday_counts.groupby(weekday)['Count'].mean().reset_index())
weekday_averages.columns = ['Weekday', 'Average_Accidents']
weekday_averages.set_index('Weekday', inplace=True)


Now I can start building a bar chart and highlight the weekday where most accidents happened:

# reorder the weekdays beginning with Monday (backwards because of printing behavior!)
days = ['Sunday', 'Saturday', 'Friday', 'Thursday', 'Wednesday', 'Tuesday', 'Monday']

# prepare plot
fig, ax = plt.subplots(figsize=(10,5))
colors=['lightsteelblue', 'lightsteelblue', 'navy', 'lightsteelblue', 
        'lightsteelblue', 'lightsteelblue', 'lightsteelblue']

# plot
weekday_averages.reindex(days).plot(kind='barh', ax=ax, color=[colors])
ax.set_title('\nAverage Accidents per Weekday\n', fontsize=14, fontweight='bold')
ax.set(xlabel='\nAverage Number', ylabel='')

# remove all spines
sns.despine(ax=ax, top=True, right=True, left=True, bottom=True);

I wanted to elaborate on this finding and further explore whether or not this weekday pattern holds true over the years.

Once again, I first build my dataframe …

weekday = accidents['Date'].dt.weekday_name
year    = accidents['Date'].dt.year

accident_table = accidents.groupby([year, weekday]).size()
accident_table = accident_table.rename_axis(['Year', 'Weekday'])\

… and then my heatmap:

sns.heatmap(accident_table, cmap='Reds')
plt.title('\nAccidents by Years and Weekdays\n', fontsize=14, fontweight='bold')

Without question, accidents happen most frequently on Fridays across all years.

3. What percentage of each category of accident severity do we have?

To answer this question, I wanted to try creating a donut chart, which essentially means coding a pie chart and adding a circle in the middle:

# assign the data
fatal   = accidents.Accident_Severity.value_counts()[1]
serious = accidents.Accident_Severity.value_counts()[2]
slight  = accidents.Accident_Severity.value_counts()[3]

names = ['Fatal Accidents','Serious Accidents', 'Slight Accidents']
size  = [fatal, serious, slight]
#explode = (0.2, 0, 0)

# create a pie chart
plt.pie(x=size, labels=names, colors=['red', 'darkorange', 'silver'], 
        autopct='%1.2f%%', pctdistance=0.6, textprops=dict(fontweight='bold'),
        wedgeprops={'linewidth':7, 'edgecolor':'white'})

# create circle for the center of the plot to make the pie look like a donut
my_circle = plt.Circle((0,0), 0.6, color='white')

# plot the donut chart
fig = plt.gcf()
plt.title('\nAccident Severity: Share in % (2013-2017)', fontsize=14, fontweight='bold')

4. How has the number of  fatalities developed over the years?

The graph here is similar to the one I built when exploring accidend trends – but this time I coded an area chart instead of a line chart:

# set the criterium to slice the fatalaties
criteria = accidents['Accident_Severity']==1
# create a new dataframe
weekly_fatalities = accidents.loc[criteria].set_index('Date').sort_index().resample('W').size()

# prepare plot
fig, ax = plt.subplots(figsize=(14,6))

# plot
weekly_fatalities.plot(label='Total Fatalities per Month', color='grey', ax=ax)
plt.fill_between(x=weekly_fatalities.index, y1=weekly_fatalities.values, color='grey', alpha=0.3)
                           .plot(color='darkorange', linewidth=5, label='10-Months Moving Average', ax=ax)

ax.set_title('\nFatalities', fontsize=14, fontweight='bold')
ax.set(ylabel='\nTotal Count', xlabel='')
ax.legend(bbox_to_anchor=(1.2, 1.1), frameon=False)

# remove all spines
sns.despine(ax=ax, top=True, right=True, left=True, bottom=True);

5. What is the most frequent age of casualties?

Nothing special here – just a clean, nice histogram:

# prepare plot
fig, ax = plt.subplots(figsize=(10,6))

# plot
casualties.Age_of_Casualty.hist(bins=20, ax=ax, color='lightsteelblue')
ax.set_title('\nCasualties by Age\n', fontsize=14, fontweight='bold')
ax.set(xlabel='Age', ylabel='Total Count of Casualties')

# remove all spines
sns.despine(top=True, right=True, left=True, bottom=True);

It looks like it’s the young folks that are the most affected!

6. What are the age and gender of the drivers who cause an accident?

Again,  some coding was needed to first build the data we want to plot:

# create a new dataframe
drivers = vehicles.groupby(['Age_Band_of_Driver', 'Sex_of_Driver']).size().reset_index()

# drop the values that have no value
drivers.drop(drivers[(drivers['Age_Band_of_Driver'] == -1) | \
                     (drivers['Sex_of_Driver'] == -1) | \
                     (drivers['Sex_of_Driver'] == 3)]\
                     .index, axis=0, inplace=True)
# rename the columns
drivers.columns = ['Age_Band_of_Driver', 'Sex_of_Driver', 'Count']

# rename the values to be more convenient for the reader resp. viewer
drivers['Sex_of_Driver'] = drivers['Sex_of_Driver'].map({1: 'male', 2: 'female'})
drivers['Age_Band_of_Driver'] = drivers['Age_Band_of_Driver'].map({1: '0 - 5', 2: '6 - 10', 3: '11 - 15',
                                                                   4: '16 - 20', 5: '21 - 25', 6: '26 - 35',
                                                                   7: '36 - 45', 8: '46 - 55', 9: '56 - 65',
                                                                   10: '66 - 75', 11: 'Over 75'})

Next, I built a grouped bar chart:

# seaborn barplot
fig, ax = plt.subplots(figsize=(14, 7))
sns.barplot(y='Age_Band_of_Driver', x='Count', hue='Sex_of_Driver', data=drivers, palette='bone')
ax.set_title('\nAccidents Cars\' Drivers by Age and Sex\n', fontsize=14, fontweight='bold')
ax.set(xlabel='Count', ylabel='Age Band of Driver')
ax.legend(bbox_to_anchor=(1.1, 1.), borderaxespad=0., frameon=False)

# remove all spines
sns.despine(top=True, right=True, left=True, bottom=True);

Watch out for men 😉 – but especially the ones between their mid-twenties and mid-thirties! Nevertheless, we need to keep in mind that we have way more men in our dataset than women, so the finding above is not entitely valid.

7. How do the accidents spread over the day?

This chart needs quite some preparation before, as we need to define the daytime groups we want to use , slice the hours out of the time column

# convert date from string to datetime format
accidents['Date']= pd.to_datetime(accidents['Date'], format="%d/%m/%Y")
# create a little dictionary to later look up the groups I will create
daytime_groups = {1: 'Morning (5-10)', 
                  2: 'Office Hours (10-15)', 
                  3: 'Afternoon Rush (15-19)', 
                  4: 'Evening (19-23)', 
                  5: 'Night(23-5)'}
# slice first and second string from time column
accidents['Hour'] = accidents['Time'].str[0:2]

# convert new column to numeric datetype
accidents['Hour'] = pd.to_numeric(accidents['Hour'])

# drop null values in our new column
accidents = accidents.dropna(subset=['Hour'])

# cast to integer values
accidents['Hour'] = accidents['Hour'].astype('int')
# define a function that turns the hours into daytime groups
def when_was_it(hour):
    if hour >= 5 and hour < 10:
        return "1"
    elif hour >= 10 and hour < 15:
        return "2"
    elif hour >= 15 and hour < 19:
        return "3"
    elif hour >= 19 and hour < 23:
        return "4"
        return "5"
# apply this function to our temporary hour column
accidents['Daytime'] = accidents['Hour'].apply(when_was_it)

# drop old time column and temporary hour column
accidents = accidents.drop(columns=['Time', 'Hour'])

Now we’ve done all that’s necessary to plot our bar chart:

# define labels by accessing look up dictionary above
labels = tuple(daytime_groups.values())

# plot total no. of accidents by daytime
accidents.groupby('Daytime').size().plot(kind='bar', color='lightsteelblue', figsize=(12,5), grid=True)
plt.xticks(np.arange(5), labels, rotation='horizontal')
plt.xlabel(''), plt.ylabel('Count\n')
plt.title('\nTotal Number of Accidents by Daytime\n', fontweight='bold')
sns.despine(top=True, right=True, left=True, bottom=True);

While the above chart shows the total count of accidents per daytime, the one below shows the average:

# plot average no. of casualties by daytime
accidents.groupby('Daytime')['Number_of_Casualties'].mean().plot(kind='bar', color='slategrey', 
                                                                 figsize=(12,4), grid=False)
plt.xticks(np.arange(5), labels, rotation='horizontal')
plt.xlabel(''), plt.ylabel('Average Number of Casualties\n')
plt.title('\nAverage Number of Casualties by Daytime\n', fontweight='bold')
sns.despine(top=True, right=True, left=True, bottom=True);

. . . . . . . . . . . . . .

That’s it for today. Thank you for reading! Feel free to check my full notebook if you’re interested in seeing more. I hope you enjoyed reading this article, and I am always happy to get critical and friendly feedback, or suggestions for improvement!