Plotting polygon Shapefiles on a Matplotlib Basemap with GeoPandas, Shapely and Descartes

I often use Python to plot data on a map and like to use the Matplotlib Basemap Toolkit. In practice, I use a lot of different libraries to access various data formats (raster, vector, serialized…), select and analyse them, generate, save and visualize outputs, and it’s not always obvious to string one’s favourite tools together into efficient processing chains.

For example,  I’ve also become fond of Pandas DataFrames, which offer a great interface to statistical analysis (e.g. the statsmodels module), and its geo-enabled version, GeoPandas. Both Basemap and GeoPandas can deal with the popular (alas!) ESRI Shapefile format, which is what many many (vector) GIS datasets are published in. But they aren’t made for working together. This post is about how I currently go about processing Shapefile data with GeoPandas first and then plotting it on a map using Basemap. I’m using an extremely simple example: a polygon Shapefile of the earth’s glaciated areas from the handy, and free,  NaturalEarth Data site. The data is already in geographic coordinates (latitudes/longitudes), with a WGS 84 datum. We therefore don’t have to worry about preprocessing the input with suitable coordinate transforms. (Often, you do have to think about this sort of thing before you get going…). All my code is available in an IPython (or Jupyther) Notebook, which should work with both Python 2 and 3.

So let’s say we have our glacier data in a file called ne_10m_glaciated_areas.shp. GeoPandas can read this file directly:

import geopandas as gp
glaciers = gp.GeoDataFrame.from_file(
    'ne_10m_glaciated_areas/ne_10m_glaciated_areas.shp')
glaciers.head()

The output looks something like this:

Screen Shot 2016-03-06 at 17.02.23

The geometry column (a GeoSeries) contains Shapely geometries, which is very convenient for further processing. These are either of type Polygon, or MultiPolygon for glaciers with multiple disjoint parts. GeoPandas GeoDataFrames or GeoSeries can be visualized extremely easily. However, for large global datasets, the result may be disappointing:

glaciers.plot()

allglaciers

If we want to focus on a small area of the earth, we have a number of options: we can use Matplotlib to set the x- and y-limits of the plot. Or we can filter the dataset geographically, and only, say, plot those glaciers that intersect a rectangular area in the vicinity of Juneau, AK, that is, the Alaskan Panhandle and the adjacent Western British Columbia. Filtering the dataset first also speeds up plotting, by a lot:

import shapely
studyarea = shapely.geometry.box(-136., 56., -130., 60.)
ax1 = glaciers[glaciers.geometry.intersects(studyarea)].plot()
ax1.set_aspect(2)
fig = plt.gcf()
fig.set_size_inches(10, 10)

glaciersselect

This is remarkable for so few lines of code, but it’s also as far as as we can get with GeoPandas alone. For more sophisticated maps, enter Basemap. The Basemap module offers two major tools:

  • a Basemap class that represents a map in a pretty good selection of coordinate systems and is able to transform arbitrary (longitude, latitude) coordinate pairs into the map’s coordinates
  • a rich database of country and state borders, water bodies, coast lines, all in multiple spatial resolutions

Features that add on to these include plotting parallels and meridians, scale bars, and reading Shapefiles. But we don’t want to use Basemap to read our Shapefile — we want to plot the selections we’ve already made from the Shapefile on top of it.

The basic syntax is to instantiate a Basemap with whatever options one finds suitable:

mm = Basemap(projection=..., width=..., height=...)

… and then to add whatever other features we want. To transform a (longitude, latitude) coordinate pair, we use mm(lon, lat). The resulting transformed coordinates can then be plotted on the map the usual Matplotlib way, for example via mm.scatter(x, y, size, ...). The code to plot our study area and the city of Juneau, in the Albers Equal Area conical projection (good for high- and low-latitude regions), at intermediate resolution, and including water, ocean, coastlines, country borders etc. is:

from mpl_toolkits.basemap import Basemap
import numpy as np
water = 'lightskyblue'
earth = 'cornsilk'
juneau_lon, juneau_lat = -134.4167, 58.3

fig, ax1 = plt.subplots(figsize=(12, 10))
mm = Basemap(
    width=600000, height=400000,
    resolution='i',
    projection='aea',
    ellps='WGS84',
    lat_1=55., lat_2=65.,
    lat_0=58., lon_0=-134)
coast = mm.drawcoastlines()
rivers = mm.drawrivers(color=water, linewidth=1.5)
continents = mm.fillcontinents(
    color=earth,
    lake_color=water)
bound= mm.drawmapboundary(fill_color=water)
countries = mm.drawcountries()
merid = mm.drawmeridians(
    np.arange(-180, 180, 2), 
    labels=[False, False, False, True])
parall = mm.drawparallels(
    np.arange(0, 80), 
    labels=[True, True, False, False])
x, y = mm(juneau_lon, juneau_lat)
juneau = mm.scatter(x, y, 80, 
    label="Juneau", color='red', zorder=10)

juneaumap

This result may even be quite suitable for publication-quality maps. To add our polygons, we need two more ingredients:

  • shapely.ops.transform is a function that applies a coordinate transformation (that is, a function that operates on coordinate pairs) to whole Shapely geometries
  • The Descartes library provides a PolygonPatch object suitable to be added to a Matplotlib plot

To put it together, we need to take into account the difference between Polygon and MultiPolygon types:

patches = []
selection = glaciers[glaciers.geometry.intersects(studyarea)]
for poly in selection.geometry:
    if poly.geom_type == 'Polygon':
        mpoly = shapely.ops.transform(mm, poly)
        patches.append(PolygonPatch(mpoly))
    elif poly.geom_type == 'MultiPolygon':
        for subpoly in poly:
            mpoly = shapely.ops.transform(mm, poly)
            patches.append(PolygonPatch(mpoly))
    else:
        print(poly, "is neither a polygon nor a multi-polygon. Skipping it.")
glaciers = ax1.add_collection(
    PatchCollection(patches, match_original=True))

The final result, now in high resolution, looks like this:

glaciersmap
We could do a lot more — add labels, plot glaciers in different colors, for example. Feel free to play with the code.

Data type mapping when using Python/GDAL to write Numpy arrays to GeoTIFF

Numpy arrays are a fundamental tool for scientific data processing in Python. To deal with spatial data that is geo-referenced on a rectangular-grid raster the GeoTIFF file format is similarly ubiquitous. Saving spatial data that is held in a Numpy array to a GeoTIFF file should therefore be an extremely common task, so it was surprising to me to run into some pitfalls. This post is a write-up on how to get around them.

To access GeoTIFF files I’m using the Geospatial Data Abstraction Library (GDAL), a powerful set of tools that comes with multiple command line utilities and bindings for the most common scripting languages used in science. As it is originally a C/C++ library, it can be quite unpythonic — one of many reasons why you might want to write your own library for your specific purpose.

Writing a Numpy array to a GeoTIFF file consists of these steps:

  • Figure out the spatial reference system (coordinate system and, if applicable, map projection), usually from the source data set, and get the Well-Known Text representation of it (examples).
  • Figure out the geotransform, that is the parameters that describe how the data has to be shifted and stretched to place it on the spatial reference system. This, too, will be derived from the source data and whatever manipulations were subsequently carried out.
  • Create a dataset object using GDAL’s “GTiff” driver, attach the spatial reference and geotransform, and write out the data

The details are described in the GDAL API tutorial and elsewhere on the web. In the simplest case, if the data originates from another GeoTIFF file, has only one raster band, and we didn’t sub-set or re-scale it (geographically), we could do this [1]:

from osgeo import gdal

src_dataset = gdal.Open("[input GeoTIFF file path]")
src_data = src_dataset.ReadAsArray()
# final_data is a 2-D Numpy array of the same dimensions as src_data
final_data = some_complicated_scientific_stuff(src_data, other_data, ...)

# get parameters
geotransform = src_dataset.GetGeoTransform()
spatialreference = src_dataset.GetProjection()
ncol = src_dataset.RasterXSize
nrow = src_dataset.RasterYSize
nband = 1

# create dataset for output
fmt = 'GTiff'
driver = gdal.GetDriverByName(fmt)
dst_dataset = driver.Create([output_filepath], ncol, nrow, nband, gdal.GDT_Byte)
dst_dataset.SetGeoTransform(geotransform)
dst_dataset.SetProjection(spatialreference)
dst_dataset.GetRasterBand(1).WriteArray(final_data)
dst_dataset = None

Thus far, there’s nothing difficult about it. But the problem arises on line 18, where the data type is passed to the Create() method. gdal.GDT_Byte refers to a code for GDAL’s Byte data type, that is, an 8-bit unsigned integer. If the final data is of a different type, 16-byte signed integers, say, or floating-point numbers, I could use one of the other GDAL data types.

But I’m writing a library and am therefore unlikely to know the data type beforehand. So what is needed is a general mapping from Numpy dtype objects to GDALDataType objects. And that problem had me stumped for a moment.

OK, it would be possible to guess — there aren’t that many of them — but shouldn’t there be a function?

I found out that in the gdal_array module, there is a function called NumericTypeCodeToGDALTypeCode, which is supposed to translate a “numeric” type into a GDAL type code, for example:

>>> print(gdal_array.NumericTypeCodeToGDALTypeCode(numpy.float32))
6

But it turns out that passing in the dtype attribute of a Numpy array doesn’t work:

>>> print(gdal_array.NumericTypeCodeToGDALTypeCode(my_data.dtype))
...
TypeError: Input must be a type

Nonetheless:

>>> my_data.dtype == numpy.float32
True

Huh? Well, the first thing I learnt from the Python documentation is that for the == operator to return True the two objects aren’t always required to have the same type. In some cases this seems to make more sense than in others.

The second is that evidently, gdal_array.NumericTypeCodeToGDALTypeCode expects an object of type type (that is, Python type), which numpy.float32 appears to be, whereas my_data.dtype is, surprise surprise of type numpy.dtype.

Apparently, the GDAL developers have recognized this behavior as a bug and fixed it in v. 2.0. What can we do meanwhile? The answer, from a StackOverflow post, is that we can instantiate a list of arrays of length 1 that cover all possible Nympy data types and then use numpy.asscalar to convert them to native Python objects with native Python types. For example:

import numpy as np
from osgeo import gdal, gdal_array

typemap = {}
for name in dir(np):
    obj = getattr(np, name)
    if hasattr(obj, 'dtype'):
        try:
            npn = obj(0)
            nat = np.asscalar(npn)
            if gdal_array.NumericTypeCodeToGDALTypeCode(npn.dtype.type):
                typemap[npn.dtype.name] = gdal_array.NumericTypeCodeToGDALTypeCode(npn.dtype.type)
        except:
            pass

(If we want the GDAL Data Type labels, we can use gdal.GetDataTypeName(typecodeinteger).)

This generates a conversion dictionary that looks like this:

NP2GDAL_CONVERSION = {
  "uint8": 1,
  "int8": 1,
  "uint16": 2,
  "int16": 3,
  "uint32": 4,
  "int32": 5,
  "float32": 6,
  "float64": 7,
  "complex64": 10,
  "complex128": 11,
}

(If we want the GDAL Data Type labels, we can use gdal.GetDataTypeName(typecodeinteger).)

That’s a start. Some hand-editing is in order, for example, mapping Booleans to 1 to make it possible to encode them as integers for persistence — clearly, GDAL has no notion of bit or binary objects. Also, it is odd that both int8 and uint8 should be mapped to GDAL Byte types, that is, unsigned integers. That needs to be taken into account when manipulating the data. In addition, several complex Numpy datatypes are missing and could be manually mapped to 10 or 11.

But I can work with this. To get back to the first listing, in the “get parameters” section I add a line and then create the destination dataset as follows:

gdaltype = NP2GDAL_CONVERSION[final_data.dtype.name]
[...]
dst_dataset = driver.Create([output_filepath], ncol, nrow, nband, gdaltype)

Voilà.

NOTES:

[1] I am aware I could have used CreateCopy() in such a simple case, but Create() is more generally versatile.

The second note is that I am aware that the problem isn’t specific to GeoTIFF files: it arises for  any data file with GDAL whose driver supports a Create() method. But to be honest, GDAL is pretty unwieldy for most scientific data formats, so if I were to write NetCDF or HDF5 files, I would use appropriate specific libraries, most of which tend to be aware of Numpy and its data types.