Basic Usage

Introduction

Most tasks can be completed using the exact_extract function.

The following examples demonstrate the use of exact_extract to summarize population and elevation data from the Gridded Population of the World and EU-DEM datasets.

The exactextractr R package package includes samples of both of these datasets, cropped to the extent of São Miguel, the largest and most populous island of the Azores archipelago. The package also provides boundaries for the six municipalities, or concelhos, that cover the island. These files can be accessed from GitHub.

Loading the sample data

The exact_extract function can accept inputs from several popular libraries such as GDAL, rasterio, fiona, and GeoPandas. A filename can also be provided, in which case exact_extract will load the data using whichever of these libraries is available.

Although it is not needed to complete the analysis, the following code can be used to plot the sample datasets.

The plot below shows the population count file from GPW. This raster provides the total population in each pixel for the calendar year 2020. On top of the population grid, we plot boundaries for the six municipalities, or concelhos, into which the island is divided. We can see that the population is concentrated along the coastlines, with smaller communities located in volcanic calderas inland.

[1]:

import exactextract
import geopandas as gpd
import rasterio
import rasterio.plot
from matplotlib import pyplot

pop_count = 'sao_miguel/gpw_v411_2020_count_2020.tif'
concelhos = 'sao_miguel/concelhos.gpkg'

fig, ax = pyplot.subplots()

with rasterio.open(pop_count) as r:
    rasterio.plot.show(r, ax=ax)

admin = gpd.read_file(concelhos)
admin.plot(ax=ax, facecolor='none', edgecolor='black')

[1]:

<Axes: >

Calculating the total population

Because the population count raster has been cropped and contains no land area outside of São Miguel, we can calculate the total population of the island by simply loading the raster into a NumPy array and using the sum function:

[2]:

with rasterio.open(pop_count) as r:
    r.read(masked=True).sum()

Calculating the population per region

We can use the exact_extract function to see how the population is divided among concelhos. By default, the results are returned in a GeoJSON-like format, with the order corresponding to the order of input features.

[3]:

from exactextract import exact_extract

exact_extract(pop_count, concelhos, 'sum')

[3]:

[{'type': 'Feature', 'properties': {'sum': 14539.87501166888}},
 {'type': 'Feature', 'properties': {'sum': 4149.850504104552}},
 {'type': 'Feature', 'properties': {'sum': 66866.70795363998}},
 {'type': 'Feature', 'properties': {'sum': 5293.967633164103}},
 {'type': 'Feature', 'properties': {'sum': 31920.496896551656}},
 {'type': 'Feature', 'properties': {'sum': 9093.449173829771}}]

Output refinements

Before reviewing the results obtained above, we’ll demonstrate a few options that can be used to refine the output of exact_extract.

Output formats

Outputs can be returned in different formats, e.g. a pandas dataframe:

[4]:

exact_extract(pop_count, concelhos, 'sum', output='pandas')

[4]:

	sum
0	14539.875012
1	4149.850504
2	66866.707954
3	5293.967633
4	31920.496897
5	9093.449174

Including source columns

In many cases, it is useful to include some information from the source dataset. Columns to copy from the source dataset can be specified with include_cols or include_geom:

[5]:

exact_extract(pop_count, concelhos, 'sum', include_cols='name', output='pandas')

[5]:

	name	sum
0	Lagoa	14539.875012
1	Nordeste	4149.850504
2	Ponta Delgada	66866.707954
3	Povoação	5293.967633
4	Ribeira Grande	31920.496897
5	Vila Franca do Campo	9093.449174

Multiple statistics

Multiple statistics can be calculated by providing a list:

[6]:

exact_extract(pop_count, concelhos, ['sum', 'max', 'max_center_x', 'max_center_y'], include_cols='name', output='pandas')

[6]:

	name	sum	max	max_center_x	max_center_y
0	Lagoa	14539.875012	2523.639404	-25.579167	37.745833
1	Nordeste	4149.850504	414.669891	-25.145833	37.829167
2	Ponta Delgada	66866.707954	3102.277100	-25.654167	37.745833
3	Povoação	5293.967633	544.421692	-25.245833	37.745833
4	Ribeira Grande	31920.496897	4133.354492	-25.579167	37.812500
5	Vila Franca do Campo	9093.449174	1334.649658	-25.437500	37.712500

We can also specify a column name that is different from the stat:

[7]:

exact_extract(pop_count, concelhos, 'total_pop=sum', include_cols='name', output='pandas')

[7]:

	name	total_pop
0	Lagoa	14539.875012
1	Nordeste	4149.850504
2	Ponta Delgada	66866.707954
3	Povoação	5293.967633
4	Ribeira Grande	31920.496897
5	Vila Franca do Campo	9093.449174

Interpreting the results

We might reasonably expect the total population to equal the value of 145,603 we previously obtained by summing the entire raster, but it doesn’t. In fact, 9% of the population is unaccounted for in the concelho totals.

[8]:

_['total_pop'].sum()

[8]:

131864.34717295895

The cause of the discrepancy can be seen by looking closely at the densely populated Ponta Delgada region on the southern coast. Many of the cells containing population are only partially covered by the concelho boundaries, so some of the total population calculated from the full raster is missing from the totals.

[9]:

fig, ax = pyplot.subplots()

ax.set_xlim([-25.75, -25.55])
ax.set_ylim([37.70, 37.77])

with rasterio.open(pop_count) as r:
    rasterio.plot.show(r, ax=ax)

gpd.read_file(concelhos) \
    .plot(ax=ax, facecolor='none', edgecolor='red')

[9]:

<Axes: >

Calculating population from population density

It turns out that we need a somewhat more complex solution to get accurate population counts when our polygons follow coastlines. Instead of using the population count raster, we bring in the population density raster, which provides the number of persons per square kilometer of land area in each pixel.

[10]:

pop_density = 'sao_miguel/gpw_v411_2020_density_2020.tif'

When calculating a sum, exactextract multiples the value of each pixel by the fraction of the pixel that is covered by the polygon. To calculate the total population from a population density raster, we would need the product of the population density, the fraction of the pixel that is covered, and the pixel area. We could compute this by creating a raster of pixel areas, and using the weighted_sum operation to calculate the product of these three values.

A simpler alternative is to provide an argument to the sum operation, stating that we want pixel values to be multiplied by the area of the pixel that is covered, rather than the fraction of the pixel that is covered.

[11]:

exact_extract(pop_density, concelhos, 'total_pop=sum(coverage_weight=area_spherical_km2)', include_cols='name', output='pandas')

[11]:

	name	total_pop
0	Lagoa	15788.098349
1	Nordeste	4537.404879
2	Ponta Delgada	71370.694085
3	Povoação	5997.344061
4	Ribeira Grande	36131.403400
5	Vila Franca do Campo	11768.354484

The total population is then calculated as:

[12]:

_['total_pop'].sum()

[12]:

145593.29925670772

This much more closely matches the value of 145,603 we obtained by summing the entire raster.

Population-weighted statistics

Suppose that we are interested in calculating the average elevation of a residence in each of the six concelhos. Loading the EU-DEM elevation data for the island, we can see that each concelho is at least partly occupied by interior mountains, indicating that the results of a simple mean would be unrepresentative of the primarily coastal population.

[13]:

dem = 'sao_miguel/eu_dem_v11.tif'

fig, ax = pyplot.subplots()

with rasterio.open(dem) as r:
    rasterio.plot.show(r, ax=ax, cmap='terrain')

admin = gpd.read_file(concelhos)
admin.plot(ax=ax, facecolor='none', edgecolor='black')

[13]:

<Axes: >

As in the previous section, we avoid working with the population count raster to avoid losing population along the coastline. We can formulate the population-weighted average elevation as in terms of population density and pixel areas as:

\[\bar{x}_\mathrm{pop} = \frac{ \Sigma_{i=0}^n {x_ic_id_ia_i}}{\Sigma_{i=0}^n{c_id_ia_i}}\]

where \(x_i\) is the elevation of pixel \(i\), \(c_i\) is the fraction of pixel \(i\) that is covered by a polygon, \(d_i\) is the population density of pixel \(i\), and \(a_i\) is the area of pixel \(i\).

Constant pixel area

If we are working with projected data, or geographic data over a small area such as São Miguel, we can assume all pixel areas to be equivalent, in which case the \(a_i\) components cancel each other out and we are left with the direct usage of population density as a weighting raster:

[14]:

exact_extract(dem, concelhos, ['mean', 'weighted_mean'], weights=pop_density, include_cols='name', output='pandas')

[14]:

	name	mean	weighted_mean
0	Lagoa	233.720475	76.874733
1	Nordeste	453.808604	192.470741
2	Ponta Delgada	274.439509	97.739504
3	Povoação	375.483205	170.464390
4	Ribeira Grande	312.035054	74.849762
5	Vila Franca do Campo	418.779671	92.205743

Non-constant pixel area

What if pixel areas do vary across the region of our analysis? In this case, we can again use the coverage_weight argument to specify that \(c_i\) in the above formula should be the covered area, not the covered fraction. In this case, we can see that considering variations in pixel area does not significantly affect the result.

[15]:

exact_extract(dem, concelhos,
             ['mean', 'simple_weighted_mean=weighted_mean', 'weighted_mean(coverage_weight=area_spherical_m2)'],
              weights=pop_density,
              include_cols='name',
              output='pandas')

[15]:

	name	mean	simple_weighted_mean	weighted_mean
0	Lagoa	233.720475	76.874733	76.873205
1	Nordeste	453.808604	192.470741	192.475210
2	Ponta Delgada	274.439509	97.739504	97.718672
3	Povoação	375.483205	170.464390	170.454343
4	Ribeira Grande	312.035054	74.849762	74.849524
5	Vila Franca do Campo	418.779671	92.205743	92.201693