{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Basic Usage" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Introduction\n", "\n", "Most tasks can be completed using the `exact_extract` function.\n", "\n", "The following examples demonstrate the use of `exact_extract` to summarize population and elevation data from the\n", "[Gridded Population of the World](https://sedac.ciesin.columbia.edu/data/collection/gpw-v4)\n", "and [EU-DEM](https://www.eea.europa.eu/data-and-maps/data/copernicus-land-monitoring-service-eu-dem)\n", "datasets.\n", "\n", "The `exactextractr` R package package includes samples of both of these datasets,\n", "cropped to the extent of São Miguel, the largest and most populous island of the Azores archipelago.\n", "The package also provides boundaries for the six municipalities, or _concelhos_, that cover the island.\n", "These files can be accessed from [GitHub](https://github.com/isciences/exactextractr/tree/master/inst/sao_miguel).\n", "\n", "## Loading the sample data\n", "\n", "The `exact_extract` function can accept inputs from several popular libraries such as GDAL, rasterio, fiona, and GeoPandas.\n", "A filename can also be provided, in which case `exact_extract` will load the data using whichever of these libraries is available. \n", "\n", "Although it is not needed to complete the analysis, the following code can be used to plot the sample datasets.\n", "\n", "The plot below shows the population count file from GPW. This raster provides the total population in each pixel for the calendar year 2020. On top of the population grid, we plot boundaries for the six municipalities, or concelhos, into which the island is divided. We can see that the population is concentrated along the coastlines, with smaller communities located in volcanic calderas inland." ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "import exactextract\n", "import geopandas as gpd\n", "import rasterio\n", "import rasterio.plot\n", "from matplotlib import pyplot\n", "\n", "pop_count = 'sao_miguel/gpw_v411_2020_count_2020.tif'\n", "concelhos = 'sao_miguel/concelhos.gpkg'\n", "\n", "fig, ax = pyplot.subplots()\n", "\n", "with rasterio.open(pop_count) as r:\n", " rasterio.plot.show(r, ax=ax)\n", " \n", "admin = gpd.read_file(concelhos)\n", "admin.plot(ax=ax, facecolor='none', edgecolor='black')" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "## Calculating the total population\n", "\n", "Because the population count raster has been cropped and contains no land area outside of São Miguel,\n", "we can calculate the total population of the island by simply loading the raster into a NumPy array\n", "and using the `sum` function:" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "with rasterio.open(pop_count) as r:\n", " r.read(masked=True).sum()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## Calculating the population per region\n", "\n", "We can use the `exact_extract` function to see how the population is divided among _concelhos_. By default, the results are returned in a GeoJSON-like format, with the order corresponding to the order of input features." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "from exactextract import exact_extract\n", "\n", "exact_extract(pop_count, concelhos, 'sum')" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## Output refinements\n", "\n", "Before reviewing the results obtained above, we'll demonstrate a few options that can be used to refine the output of `exact_extract`.\n", "\n", "### Output formats\n", "\n", "Outputs can be returned in different formats, e.g. a pandas dataframe:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "exact_extract(pop_count, concelhos, 'sum', output='pandas')" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "### Including source columns\n", "\n", "In many cases, it is useful to include some information from the source dataset. Columns to copy from the source dataset can be specified with `include_cols` or `include_geom`:" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "exact_extract(pop_count, concelhos, 'sum', include_cols='name', output='pandas')" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "### Multiple statistics\n", "\n", "Multiple statistics can be calculated by providing a list:" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "exact_extract(pop_count, concelhos, ['sum', 'max', 'max_center_x', 'max_center_y'], include_cols='name', output='pandas')" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "We can also specify a column name that is different from the stat:" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "exact_extract(pop_count, concelhos, 'total_pop=sum', include_cols='name', output='pandas')" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "## Interpreting the results\n", "\n", "We might reasonably expect the total population to equal the value of 145,603 we previously obtained by summing the entire raster,\n", "but it doesn’t. In fact, 9% of the population is unaccounted for in the concelho totals." ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "_['total_pop'].sum()" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "The cause of the discrepancy can be seen by looking closely at the densely populated Ponta Delgada region on the southern coast. Many of the cells containing population are only partially covered by the _concelho_ boundaries, so some of the total population calculated from the full raster is missing from the totals." ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "fig, ax = pyplot.subplots()\n", "\n", "ax.set_xlim([-25.75, -25.55])\n", "ax.set_ylim([37.70, 37.77])\n", "\n", "with rasterio.open(pop_count) as r:\n", " rasterio.plot.show(r, ax=ax)\n", " \n", "gpd.read_file(concelhos) \\\n", " .plot(ax=ax, facecolor='none', edgecolor='red')" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "## Calculating population from population density\n", "\n", "It turns out that we need a somewhat more complex solution to get accurate population counts when our polygons follow coastlines.\n", "Instead of using the population count raster, we bring in the population density raster, which provides the number of persons per square kilometer of land area in each pixel." ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "pop_density = 'sao_miguel/gpw_v411_2020_density_2020.tif'" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "When calculating a sum, exactextract multiples the value of each pixel by the fraction of the pixel that is covered by the polygon.\n", "To calculate the total population from a population density raster, we would need the product of the population density, the fraction of the pixel that is covered, and the pixel area.\n", "We could compute this by creating a raster of pixel areas, and using the `weighted_sum` operation to calculate the product of these three values.\n", "\n", "A simpler alternative is to provide an argument to the `sum` operation, stating that we want pixel values to be multiplied by the _area_ of the pixel that is covered, rather than the fraction of the pixel that is covered." ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "exact_extract(pop_density, concelhos, 'total_pop=sum(coverage_weight=area_spherical_km2)', include_cols='name', output='pandas')" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "The total population is then calculated as:" ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "_['total_pop'].sum()" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "This much more closely matches the value of 145,603 we obtained by summing the entire raster." ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "## Population-weighted statistics\n", "\n", "Suppose that we are interested in calculating the average elevation of a residence in each of the six concelhos.\n", "Loading the EU-DEM elevation data for the island, we can see that each concelho is at least partly occupied by interior mountains,\n", "indicating that the results of a simple mean would be unrepresentative of the primarily coastal population." ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "dem = 'sao_miguel/eu_dem_v11.tif'\n", "\n", "fig, ax = pyplot.subplots()\n", "\n", "with rasterio.open(dem) as r:\n", " rasterio.plot.show(r, ax=ax, cmap='terrain')\n", " \n", "admin = gpd.read_file(concelhos)\n", "admin.plot(ax=ax, facecolor='none', edgecolor='black')" ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "As in the previous section, we avoid working with the population count raster to\n", "avoid losing population along the coastline. We can formulate the\n", "population-weighted average elevation as in terms of population density and\n", "pixel areas as:\n", "\n", "$$\n", "\\bar{x}_\\mathrm{pop} = \\frac{ \\Sigma_{i=0}^n {x_ic_id_ia_i}}{\\Sigma_{i=0}^n{c_id_ia_i}}\n", "$$\n", "where $x_i$ is the elevation of pixel $i$, $c_i$ is the fraction of pixel $i$ \n", "that is covered by a polygon, $d_i$ is the population density of pixel $i$, and\n", "$a_i$ is the area of pixel $i$.\n", "\n", "### Constant pixel area\n", "\n", "If we are working with projected data, or geographic data over a small area such\n", "as São Miguel, we can assume all pixel areas to be equivalent, in which case the\n", "$a_i$ components cancel each other out and we are left with the direct usage\n", "of population density as a weighting raster:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "exact_extract(dem, concelhos, ['mean', 'weighted_mean'], weights=pop_density, include_cols='name', output='pandas')" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "### Non-constant pixel area\n", "\n", "What if pixel areas do vary across the region of our analysis? In this case, we can again use the `coverage_weight` argument to specify that $c_i$ in the above formula should be the covered area, not the covered fraction.\n", "In this case, we can see that considering variations in pixel area does not significantly affect the result." ] }, { "cell_type": "code", "execution_count": null, "id": "31", "metadata": {}, "outputs": [], "source": [ "exact_extract(dem, concelhos,\n", " ['mean', 'simple_weighted_mean=weighted_mean', 'weighted_mean(coverage_weight=area_spherical_m2)'],\n", " weights=pop_density,\n", " include_cols='name',\n", " output='pandas')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" } }, "nbformat": 4, "nbformat_minor": 5 }