Basic Usage

Introduction

Most tasks can be completed using the exact_extract function.

The following examples demonstrate the use of exact_extract to summarize population and elevation data from the Gridded Population of the World and EU-DEM datasets.

The exactextractr R package package includes samples of both of these datasets, cropped to the extent of São Miguel, the largest and most populous island of the Azores archipelago. The package also provides boundaries for the six municipalities, or concelhos, that cover the island. These files can be accessed from GitHub.

Loading the sample data

The exact_extract function can accept inputs from several popular libraries such as GDAL, rasterio, fiona, and GeoPandas. A filename can also be provided, in which case exact_extract will load the data using whichever of these libraries is available.

Although it is not needed to complete the analysis, the following code can be used to plot the sample datasets.

The plot below shows the population count file from GPW. This raster provides the total population in each pixel for the calendar year 2020. On top of the population grid, we plot boundaries for the six municipalities, or concelhos, into which the island is divided. We can see that the population is concentrated along the coastlines, with smaller communities located in volcanic calderas inland.

[1]:
import exactextract
import geopandas as gpd
import rasterio
import rasterio.plot
from matplotlib import pyplot

pop_count = 'sao_miguel/gpw_v411_2020_count_2020.tif'
concelhos = 'sao_miguel/concelhos.gpkg'

fig, ax = pyplot.subplots()

with rasterio.open(pop_count) as r:
    rasterio.plot.show(r, ax=ax)

admin = gpd.read_file(concelhos)
admin.plot(ax=ax, facecolor='none', edgecolor='black')
[1]:
<Axes: >
_images/basic_usage_2_1.png

Calculating the total population

Because the population count raster has been cropped and contains no land area outside of São Miguel, we can calculate the total population of the island by simply loading the raster into a NumPy array and using the sum function:

[2]:
with rasterio.open(pop_count) as r:
    r.read(masked=True).sum()

Calculating the population per region

We can use the exact_extract function to see how the population is divided among concelhos. By default, the results are returned in a GeoJSON-like format, with the order corresponding to the order of input features.

[3]:
from exactextract import exact_extract

exact_extract(pop_count, concelhos, 'sum')
[3]:
[{'type': 'Feature', 'properties': {'sum': 14539.87501166888}},
 {'type': 'Feature', 'properties': {'sum': 4149.850504104552}},
 {'type': 'Feature', 'properties': {'sum': 66866.70795363998}},
 {'type': 'Feature', 'properties': {'sum': 5293.967633164103}},
 {'type': 'Feature', 'properties': {'sum': 31920.496896551656}},
 {'type': 'Feature', 'properties': {'sum': 9093.449173829771}}]

Output refinements

Before reviewing the results obtained above, we’ll demonstrate a few options that can be used to refine the output of exact_extract.

Output formats

Outputs can be returned in different formats, e.g. a pandas dataframe:

[4]:
exact_extract(pop_count, concelhos, 'sum', output='pandas')
[4]:
sum
0 14539.875012
1 4149.850504
2 66866.707954
3 5293.967633
4 31920.496897
5 9093.449174

Including source columns

In many cases, it is useful to include some information from the source dataset. Columns to copy from the source dataset can be specified with include_cols or include_geom:

[5]:
exact_extract(pop_count, concelhos, 'sum', include_cols='name', output='pandas')
[5]:
name sum
0 Lagoa 14539.875012
1 Nordeste 4149.850504
2 Ponta Delgada 66866.707954
3 Povoação 5293.967633
4 Ribeira Grande 31920.496897
5 Vila Franca do Campo 9093.449174

Multiple statistics

Multiple statistics can be calculated by providing a list:

[6]:
exact_extract(pop_count, concelhos, ['sum', 'max', 'max_center_x', 'max_center_y'], include_cols='name', output='pandas')
[6]:
name sum max max_center_x max_center_y
0 Lagoa 14539.875012 2523.639404 -25.579167 37.745833
1 Nordeste 4149.850504 414.669891 -25.145833 37.829167
2 Ponta Delgada 66866.707954 3102.277100 -25.654167 37.745833
3 Povoação 5293.967633 544.421692 -25.245833 37.745833
4 Ribeira Grande 31920.496897 4133.354492 -25.579167 37.812500
5 Vila Franca do Campo 9093.449174 1334.649658 -25.437500 37.712500

We can also specify a column name that is different from the stat:

[7]:
exact_extract(pop_count, concelhos, 'total_pop=sum', include_cols='name', output='pandas')
[7]:
name total_pop
0 Lagoa 14539.875012
1 Nordeste 4149.850504
2 Ponta Delgada 66866.707954
3 Povoação 5293.967633
4 Ribeira Grande 31920.496897
5 Vila Franca do Campo 9093.449174

Interpreting the results

We might reasonably expect the total population to equal the value of 145,603 we previously obtained by summing the entire raster, but it doesn’t. In fact, 9% of the population is unaccounted for in the concelho totals.

[8]:
_['total_pop'].sum()
[8]:
131864.34717295895

The cause of the discrepancy can be seen by looking closely at the densely populated Ponta Delgada region on the southern coast. Many of the cells containing population are only partially covered by the concelho boundaries, so some of the total population calculated from the full raster is missing from the totals.

[9]:
fig, ax = pyplot.subplots()

ax.set_xlim([-25.75, -25.55])
ax.set_ylim([37.70, 37.77])

with rasterio.open(pop_count) as r:
    rasterio.plot.show(r, ax=ax)

gpd.read_file(concelhos) \
    .plot(ax=ax, facecolor='none', edgecolor='red')
[9]:
<Axes: >
_images/basic_usage_18_1.png

Calculating population from population density

It turns out that we need a somewhat more complex solution to get accurate population counts when our polygons follow coastlines. Instead of using the population count raster, we bring in the population density raster, which provides the number of persons per square kilometer of land area in each pixel.

[10]:
pop_density = 'sao_miguel/gpw_v411_2020_density_2020.tif'

When calculating a sum, exactextract multiples the value of each pixel by the fraction of the pixel that is covered by the polygon. To calculate the total population from a population density raster, we would need the product of the population density, the fraction of the pixel that is covered, and the pixel area. We could compute this by creating a raster of pixel areas, and using the weighted_sum operation to calculate the product of these three values.

A simpler alternative is to provide an argument to the sum operation, stating that we want pixel values to be multiplied by the area of the pixel that is covered, rather than the fraction of the pixel that is covered.

[11]:
exact_extract(pop_density, concelhos, 'total_pop=sum(coverage_weight=area_spherical_km2)', include_cols='name', output='pandas')
[11]:
name total_pop
0 Lagoa 15788.098349
1 Nordeste 4537.404879
2 Ponta Delgada 71370.694085
3 Povoação 5997.344061
4 Ribeira Grande 36131.403400
5 Vila Franca do Campo 11768.354484

The total population is then calculated as:

[12]:
_['total_pop'].sum()
[12]:
145593.29925670772

This much more closely matches the value of 145,603 we obtained by summing the entire raster.

Population-weighted statistics

Suppose that we are interested in calculating the average elevation of a residence in each of the six concelhos. Loading the EU-DEM elevation data for the island, we can see that each concelho is at least partly occupied by interior mountains, indicating that the results of a simple mean would be unrepresentative of the primarily coastal population.

[13]:
dem = 'sao_miguel/eu_dem_v11.tif'

fig, ax = pyplot.subplots()

with rasterio.open(dem) as r:
    rasterio.plot.show(r, ax=ax, cmap='terrain')

admin = gpd.read_file(concelhos)
admin.plot(ax=ax, facecolor='none', edgecolor='black')
[13]:
<Axes: >
_images/basic_usage_27_1.png

As in the previous section, we avoid working with the population count raster to avoid losing population along the coastline. We can formulate the population-weighted average elevation as in terms of population density and pixel areas as:

\[\bar{x}_\mathrm{pop} = \frac{ \Sigma_{i=0}^n {x_ic_id_ia_i}}{\Sigma_{i=0}^n{c_id_ia_i}}\]

where \(x_i\) is the elevation of pixel \(i\), \(c_i\) is the fraction of pixel \(i\) that is covered by a polygon, \(d_i\) is the population density of pixel \(i\), and \(a_i\) is the area of pixel \(i\).

Constant pixel area

If we are working with projected data, or geographic data over a small area such as São Miguel, we can assume all pixel areas to be equivalent, in which case the \(a_i\) components cancel each other out and we are left with the direct usage of population density as a weighting raster:

[14]:
exact_extract(dem, concelhos, ['mean', 'weighted_mean'], weights=pop_density, include_cols='name', output='pandas')
[14]:
name mean weighted_mean
0 Lagoa 233.720475 76.874733
1 Nordeste 453.808604 192.470741
2 Ponta Delgada 274.439509 97.739504
3 Povoação 375.483205 170.464390
4 Ribeira Grande 312.035054 74.849762
5 Vila Franca do Campo 418.779671 92.205743

Non-constant pixel area

What if pixel areas do vary across the region of our analysis? In this case, we can again use the coverage_weight argument to specify that \(c_i\) in the above formula should be the covered area, not the covered fraction. In this case, we can see that considering variations in pixel area does not significantly affect the result.

[15]:
exact_extract(dem, concelhos,
             ['mean', 'simple_weighted_mean=weighted_mean', 'weighted_mean(coverage_weight=area_spherical_m2)'],
              weights=pop_density,
              include_cols='name',
              output='pandas')
[15]:
name mean simple_weighted_mean weighted_mean
0 Lagoa 233.720475 76.874733 76.873205
1 Nordeste 453.808604 192.470741 192.475210
2 Ponta Delgada 274.439509 97.739504 97.718672
3 Povoação 375.483205 170.464390 170.454343
4 Ribeira Grande 312.035054 74.849762 74.849524
5 Vila Franca do Campo 418.779671 92.205743 92.201693