Understanding, Reading, and Manipulating Geospatial Data

This is the first of four blog posts on geospatial data. In this chapter, you will be introduced to the concept of geospatial data and how to represent such data in Python using the GeoPandas library. You will then learn about common geospatial file formats and the basics to read, explore and manipulate such data.

What is Geospatial Data?

Geospatial data is information that has a geographic aspect to it. Two very different but common data formats used to store geospatial data are vector and raster representations:

  • Vector Representation

Vector data uses X and Y coordinates – or longitudinal and latitudinal information – to define the locations of points, lines, and areas (polygons) that correspond to map features. As such, vector data tends to define centers and edges of features.

Vector Representation
  • Raster Representation

Raster data, on the other hand, uses a matrix of square areas to define where features are located. These squares, also called pixels, cells, and grids, are typically of uniform size, and their size determines the amount of detail that can be maintained in the dataset. Because raster data represents square areas, it describes interiors rather than boundaries – as is the case with vector data.

Raster Representation

Vector data is excellent for capturing and storing spatial details, while raster data is well suited for capturing, storing, and analyzing data such as elevation, temperature, soil pH, etc. that vary continuously from location to location. Raster data formats also are used to store aerial and satellite imagery.

Creating a Virtual Environment for Geospatial Analysis

Python offers many packages for working with geospatial data and its relevant file types. Specifically, we’ll make use of GeoPandas, Shapely, and (later) Folium. Installing these packages is not always straightforward due to the many libraries that need to be installed as dependencies (e.g. FIONA or GDAL).

I learned that setting up a virtual environment was the simplest and cleanest way to successfully install all relevant packages. If this topic is new to you, please check this primer first. The setup of a particular geospatial environment is described here.

# import pandas, matplotlib, and seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# import geospatial packages
import geopandas as gpd
from shapely.geometry import Point

Plotting Latitude and Longitude

Before we dive into specific geospatial file formats, let’s get a taste for traditional location plotting using simply longitude (always on x-axis!) and latitude (always on y-axis!). Lines of latitudes are those that run north to south, and lines of longitudes are those running east to west.

Style elements like color, marker type, and marker size help make your plot visually appealing, x- and y-labels add information, and grid lines give a more precise idea of where things are:

# read the airbnb csv file
airbnb = pd.read_csv('Data/airbnb_listings_berlin.csv')

# prepare plot
plt.figure(figsize=(12,6))

# plot all airbnb apartments in Berlin
plt.scatter(airbnb.longitude, airbnb.latitude, marker='p', s=0.8, color='navy')

# customize plot
plt.title('Airbnb Apartments in Berlin')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid()
sns.despine(top=True, right=True, left=True, bottom=True);

Geospatial File Formats

We just learned that in addition to point data (such as latitude and longitude) spatial data can be also made up of lines and polygons. Each object then consists of multiple points which cannot be easily represented in a csv file or a Pandas DataFrame. It’s therefore much more convenient to use specific spatial file formats:

  • The ESRI shapefile (.shp) is currently the most used vector format. However, despite its popularity, it comes with a drawback: it is a multi-file format, meaning one “file” consists of multiple files. If you receive a shapefile, you actually get a set of files (.shp, .dbf, .shx, .prj, ...). So when copying a shape file, make sure to copy all the files.

  • The GeoJSON (.geojson) is the lightweight format of JSON, often used in web applications. Unlike shapefiles, (.geojson) is a single file and thus easier to work with.

  • GeoPackage (.gpkg) is a relatively new open standard format with more advanced capabilities. In many cases, it’s a more modern and better working alternative to shapefiles.

Apart from these three, there are many other vector formats out in the wild.

To read such files and work with them in Python, we are going to use the GeoPandas library designed to work with vector data.

Reading Geospatial Files

GeoPandas can read many geospatial file formats with the read_file() method to which we pass the file path. In my example here, we use shape files that are stored in one directory, together with all the other files:

germany = gpd.read_file('Data/Cleansed_Data/Germany.shp')
germany.head()
type(germany)

Out[4]: geopandas.geodataframe.GeoDataFrame

You see that we now have a GeoPandas GeoDataFrame. You can think of it as a normal Pandas DataFrame supercharged with geospatial capabilities. A GeoDataFrame has always a geometry column that holds geometry information, while the other columns are attributes that describe each of the geometries.

Our dataset contains Germany’s federal states with the geometry made up of polygons representing them. Plotting a GeoDataFrame is as easy as calling GeoDataFrame.plot():

germany.plot(color='grey', figsize=(10,10));

It might look a bit skewed – we will come back to that issue in a minute!

Spatial Data and its Attributes

One of the specific aspects of a GeoDataFrame is that is has a .geometry attribute, which always returns the geometry column:

germany.geometry
Out[6]:
0     (POLYGON ((8.70837306976324 47.71555709838862,...
1     POLYGON ((10.13385963439953 50.5499992370606, ...
2     POLYGON ((13.17788982391357 52.39031982421886,...
3     POLYGON ((13.87950801849365 53.50106811523438,...
4     (POLYGON ((8.505060195922852 53.23289108276379...
5     POLYGON ((10.07161712646496 53.71823120117199,...
6     POLYGON ((9.49876976013195 51.63151931762707, ...
7     (POLYGON ((14.26472282409662 53.710693359375, ...
8     (POLYGON ((6.86527681350708 53.59597396850586,...
9     POLYGON ((8.666278839111328 52.52527999877941,...
10    POLYGON ((7.799629211425781 50.94301986694336,...
11    POLYGON ((7.037960052490234 49.64337921142589,...
12    POLYGON ((11.63250827789307 53.01641082763672,...
13    POLYGON ((12.8779993057251 51.67269897460938, ...
14    (POLYGON ((8.689722061157283 54.066806793213, ...
15    POLYGON ((10.77189064025873 51.6449089050293, ...
Name: geometry, dtype: object
type(germany.geometry)

Out[7]: geopandas.geoseries.GeoSeries

What is returned here is a GeoSeries object, the equivalent to a Pandas Series, but with additional specific spatial attributes and methods.

One example of one such method is the .area attribute of the GeoSeries, which returns the area of each geometry:

germany.geometry.area
Out[10]:
0     4.351014
1     8.651863
2     0.117153
3     3.896075
4     0.052821
5     0.104142
6     2.664529
7     3.159141
8     6.350174
9     4.434194
10    2.487793
11    0.324915
12    2.696926
13    2.361059
14    2.154141
15    2.077353
dtype: float64

Again, you might wonder what units these measurements are made in, as it looks as strange as the plot before. To understand this, we need familiarize ourselves with two additional important concepts: the Coordinate Reference System (or CRS) and map projections.

Map Projections

A map projection is a method for taking the curved surface of the earth and displaying it on something flat, like a computer screen or a piece of paper. Map projections are necessary to represent the earth in 2-dimensional space. There are several map projections which preserve some of the properties of the sphere at the expense of others:

Setting the CRS (Coordinate Reference System) for a GeoDataFrame tells GeoPandas how to interpet the longitude and latitude coordinates. Distance units are also dependent on the CRS being used.

The most common CRS systems are

  • EPSG:4326 and
  • EPSG:3857.

EPSG stands for European Petroleum Survey Group, the entity that developed these systems.

EPSG:4326 is used with applications like Google Earth, and its units here are decimal degrees, whereas EPSG:3857 is used in most map applications such as Google Map, Bing Maps, or Open Street Maps, and its units here are meters.

Using the .crs attribute returns the CRS used in our GeoDataFrame’s geometry attribute:

germany.geometry.crs
Out[8]: {'init': 'epsg:4326'}

Now we’ve learned that our GeoDataFrame’s geometry uses decimal degrees to measure the distances from the reference points. We can convert the geometry to measure distance in meters using the .to_crs() method.

Let’s convert the CRS to EPSG:3857 so that the resulting measurements are in meters. Note that the original latitude and longitude columns remain in decimal degree units. .to_crs() only changes the geometry column!

# convert geometry from decimal degrees to meters
germany.geometry = germany.geometry.to_crs(epsg=3857)
germany.head(2)

And now the returned value of the .area attribute is different from before:

germany.geometry.area
Out[10]:
0     8.146227e+10
1     1.633086e+11
2     2.384753e+09
3     7.927602e+10
4     1.092559e+09
5     2.172412e+09
6     5.203054e+10
7     6.621568e+10
8     1.300898e+11
9     8.823689e+10
10    4.788178e+10
11    6.185397e+09
12    5.430654e+10
13    4.654634e+10
14    4.562633e+10
15    4.082255e+10
dtype: float64

Look how the plot of Germany has changed after altering the CRS:

germany.plot(color='grey', figsize=(10,10));

Writing Geospatial Files

Geopandas can also write a GeoDataFrame to a file with the to_file() method. The first argument is the name of the resulting file or full path. In addition, you need to specify which file format you want to write the file in using the driver keyword:

# writing a Shapefile file
GeoDataFrame.to_file('mydata.shp', driver='ESRI Shapefile')

# writing a GeoJSON file
GeoDataFrame.to_file('mydata.geojson', driver='GeoJSON')

# writing a GeoPackage file
GeoDataFrame.to_file('mydata.gpkg', driver='GPKG')

. . . . . . . . . . . . . .

I hope this little tutorial helps to work with geospatial data! If you think I’ve missed something or have made an error somewhere, please let me know: bb@brittabettendorf.com