7. Standard tools for spatial data analysis II

Foreword on working directory, data and packages

The Data for this tutorial are provided via Github. In order to reproudce the code with your own data, replace the url with your local filepath

If you work locally, you may define the working directory

setwd("YOUR/FILEPATH") 
# this will not work. if you want to use it, define a path on your computer

This should be the place where your data and other folders are stored. more Information can be found here

Code chunks for saving data and results are disabeled in this tutorial (using a #). Please replace (“YOUR/FILEPATH”) with your own path.

The code is written in such a way, that packages will be automatically installed (if not already installed) and loaded. Packages will be stored in the default location of your computer.

Intro

Simple features, i.e. records that come with a geometry, can be manipulated or changed. for the case where these manipulations involve geometries, different functions than with data frames are necessary. This is the content of this tutorial.

We will look at different ways to check and change the geometries of sfobjects. Additionally, we will create new geometries

We will work with the sf package. The Tutorial is inspired by the sf package vignettes and this tutorial by Edzer Pebesma

if (!require(sf)){install.packages("sf"); library(sf)}
## Lade nötiges Paket: sf
## Warning: Paket 'sf' wurde unter R Version 3.6.2 erstellt
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

Methods for simple feature objects

A complete list of methods for a particular class can be obtained with methods(class = "sf"). To see al objects in the package use ls("package:sf") and for all functions us lsf("package:sf")

methods(class = "sf")
##  [1] $<-                   [                     [[<-                 
##  [4] aggregate             as.data.frame         cbind                
##  [7] coerce                dbDataType            dbWriteTable         
## [10] identify              initialize            merge                
## [13] plot                  print                 rbind                
## [16] show                  slotsFromS3           st_agr               
## [19] st_agr<-              st_area               st_as_sf             
## [22] st_bbox               st_boundary           st_buffer            
## [25] st_cast               st_centroid           st_collection_extract
## [28] st_convex_hull        st_coordinates        st_crop              
## [31] st_crs                st_crs<-              st_difference        
## [34] st_geometry           st_geometry<-         st_interpolate_aw    
## [37] st_intersection       st_intersects         st_is                
## [40] st_join               st_line_merge         st_nearest_points    
## [43] st_node               st_normalize          st_point_on_surface  
## [46] st_polygonize         st_precision          st_segmentize        
## [49] st_set_precision      st_simplify           st_snap              
## [52] st_sym_difference     st_transform          st_triangulate       
## [55] st_union              st_voronoi            st_wrap_dateline     
## [58] st_write              st_zm                
## see '?methods' for accessing help and source code

We will use the nc data set from Cressie (1991)1, which is stored in datasource. The data set contains data of sudden infant Death Syndrom (SIDS) from North Carolina’s counties between 1974-78 and 1979-84). For more information about the contents of the collumns refer to this page.

# load from Github
download.file("https://raw.githubusercontent.com/RafHo/teaching/master/angewandte_geodatenverarbeitung/datasource/nc.zip", destfile = "nc.zip")

# unzip shapefile
unzip("nc.zip")


nc_sf <- st_read("nc.shp")
## Reading layer `nc' from data source `C:\Users\Nils Riach\Github\spatial_thinking_website\blogdown\content\courses\angewandte_geodatenverarbeitung\nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs

Recap: using the right crs

The crs of nc_sfis EPSG: 4267 which has its units in Lon/Lat. We will transform it to “NAD83 / North Carolina” which has its units in metres. This is done for more convenient handling of the data.

st_crs(nc_sf)
## Coordinate Reference System:
##   EPSG: 4267 
##   proj4string: "+proj=longlat +datum=NAD27 +no_defs"
nc_sf <- st_transform(nc_sf, 32119) # NC state plane (a2), m
st_crs(nc_sf )
## Coordinate Reference System:
##   EPSG: 32119 
##   proj4string: "+proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
#in order to find out the units of the crs
st_crs(32119)$units
## [1] "m"

Geometric confirmation or logical queries

Tests for geometric characteristics e.g. topology.

  • st_intersects: touch or overlap
  • st_disjoint: !intersects
  • st_touches: touch
  • st_crosses: cross (do not touch)
  • st_within: within
  • st_contains: contains
  • st_overlaps: overlaps
  • st_covers: cover
  • st_covered_by: covered by
  • st_equals: equals
  • st_equals_exact: equals, with some fuzz

All these functions return a sparse matrix with matching (TRUE) indexes, or a full logical matrix. This can be specified in the functions. To illustrate this, we will check if two geometries intersect. First we reduce the object to only five geometries to make it more understandable and take a look at thier geometries:

nc1 <- nc_sf[1:5,] # only use rows 1 to 5 of the original data set
plot(st_geometry(nc1))

# default usage of the function
# sparse index list is returned
st_intersects(nc1, nc1)
## Sparse geometry binary predicate list of length 5, where the predicate was `intersects'
##  1: 1, 2
##  2: 1, 2, 3
##  3: 2, 3
##  4: 4
##  5: 5
#dense logical matrix with element i,j TRUE
st_intersects(nc1, nc1, sparse = FALSE)
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,]  TRUE  TRUE FALSE FALSE FALSE
## [2,]  TRUE  TRUE  TRUE FALSE FALSE
## [3,] FALSE  TRUE  TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE  TRUE FALSE
## [5,] FALSE FALSE FALSE FALSE  TRUE

Check out the other functions and their functionalities!

Geometric operations

While the functions above return matrixes, the following functions return new geometries

  • st_boundary: return the boundary of a geometry
  • st_buffer: compute a buffer around this geometry/each geometry
  • st_centroid: gives centroid of geometry
  • st_convex_hull: creates convex hull of set of points
  • st_line_merge: merges lines
  • st_polygonize: creates polygon from lines
  • st_segmentize: adds points to straight lines
  • st_simplify: simplifies lines by removing vertices
  • st_union: creates one geometry from several geometries
  • st_difference: difference between pairs of geometries that do not
  • st_interpolate: Areal-weighted interpolation of polygon data

Let’s take a look at the more basic geometric operations.

par(mfrow=c(1,2)) # both plots together
ncg <- st_geometry(nc_sf[1:3,]) # Get, set, or replace geometry from an sf object- here the first three objects
plot(ncg, col = sf.colors(3, categorical = TRUE))
u <- st_union(ncg) # unionize the three objects to one

plot(u, lwd = 2)
#check where the first and second object overlap or touch
plot(st_intersection(ncg[1], ncg[2]), col = 'red', add = TRUE) 
# set buffers inside and outside object
# dist/units depend on crs! Here it is 
plot(st_buffer(u, 10000), border = 'blue', add = TRUE)
# st_buffer(u, units::set_unit(10, km)) # with sf devel
plot(st_buffer(u, -5000), border = 'yellow', add = TRUE)

Geometric creation

  • st_triangulate: triangulates set of points (not constrained)
  • st_voronoi: creates voronoi tesselation
  • st_point: creates point geometry from XYZ numeric data.
  • st_ …. functions for creating geometries lines, polygons and their respective ‘multi’ versions

More advanced: aggregate, interpolate, relationships and join

Aggregate

Aggregation means condensating data. Aggregated data show some statistics about a variable (typically average or total) in relation to some kind of grouping variable.

In the following example, we will divide the counties into four regions and dissolve them. Assignment to the region is performed by extracting the longtitudinal coordinates of the centroids of each county.

After the counties are dissolved, the area of the new regions need to be calculated.

Meanwhile, the values of "SID79" are summarized and then the density of SID for the new regions is calculated.

cent <- st_centroid(nc_sf$geometry) # get centroids of counties
coords <- st_coordinates(cent) # get coordinates of centoids

nc_new <- nc_sf
nc_new$longitude <- coords[,1] #get x values which are the longtitudes 
nc_new$groups <- cut(nc_new$longitude, 4, labels = c("coast","center_east","center_west", "west")) # name the groups

#library(tidyverse, lib.loc= "T:/AG2019/R/win-library/3.5")

#nc_regions <- nc_new %>% 
 # dplyr::group_by(groups) %>%
  #dplyr::summarize() %>%
  #dplyr::mutate(area = as.numeric(st_area(.)))# as numeric damit m^2 weggeht

nc_regions <- nc_new
nc_regions <- as.numeric(st_area(nc_regions))
nc_regions <- nc_new %>% dplyr::group_by(groups) %>% dplyr::summarize()
# uses dplyr package

nc_avg_area <- aggregate(nc_sf["SID79"], nc_regions , FUN = mean)
plot(nc_avg_area)

Interpolation

Create a grid over the bounding box of nc with 20 columns and 10 rows

g <-  st_make_grid(nc_sf, n = c(20,10)) 
plot(g)

perform Areal-weighted interpolation of births in 1974 on to the grid

a1 <- st_interpolate_aw(nc_sf["BIR74"], g, extensive = FALSE) 
## Warning in st_interpolate_aw.sf(nc_sf["BIR74"], g, extensive = FALSE):
## st_interpolate_aw assumes attributes are constant over areas of x

From ?st_interpolate_aw we can see that extensive = logical; if TRUE, the attribute variables are assumed to be spatially extensive (like population) and the sum is preserved, otherwise, spatially intensive (like population density) and the mean is preserved.

To learn more about spatially extensive vs spatially intensive variables check this page

Intensive Data are independent of the spatial units (e.g., population density, percentages); data that has been normalized in some fashion.

Extensive Data are dependent on the spatial units (e.g., population totals)

To demonstrate the difference, a2 is interpolated with “extensive=TRUE”

a2 <- st_interpolate_aw(nc_sf["BIR74"], g, extensive = TRUE)
## Warning in st_interpolate_aw.sf(nc_sf["BIR74"], g, extensive = TRUE):
## st_interpolate_aw assumes attributes are constant over areas of x
## Warning in st_interpolate_aw(nc["BIR74"], g, extensive = TRUE):
## st_interpolate_aw assumes attributes are constant over areas of x

Differences in interpolation are demonstrated by dividing a1 and a2 through nc and mapping the result

sum(a1$BIR74) / sum(nc_sf$BIR74) #not close to one: property is assumed spatially intensive
## [1] 1.430389
sum(a2$BIR74) / sum(nc_sf$BIR74)
## [1] 1
#a1$intensive = a1$BIR74
#a1$extensive = a2$BIR74
#plot(a1[c("intensive", "extensive")])
a1$what = "intensive"
a2$what = "extensive"

if (!require(ggplot2)){install.packages("ggplot2"); library(ggplot2)}
## Lade nötiges Paket: ggplot2
## Warning: Paket 'ggplot2' wurde unter R Version 3.6.2 erstellt
l = st_cast(nc_sf, "MULTILINESTRING") # simplify the polygon to multilinestring

ggplot() + geom_sf(data = rbind(a1,a2), aes(fill = BIR74)) + 
    geom_sf(data = l, col = 'lightgray') + facet_grid(what~.) +
    scale_fill_gradientn(colors = sf.colors(10))

Joins

Joining different data to one another is one of the most needed operations in (geographic) data analysis! There are spatial joins, where the common denominator is space or joins based on attributes.

Spatial joins

Spatial joins. First we need to find counties matching a coordinate (in map units). Create point, make it spatial and join it to nc_sf.

pt <- st_point(c(472038, 210385))
pt_sfc <- st_sfc(pt, crs = st_crs(nc_sf)) # create simple feature geometry list column
pt_sf <- st_sf(a = 1, geom = pt_sfc) #transform to sf object
st_join(pt_sf, nc_sf)
## Simple feature collection with 1 feature and 15 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 472038 ymin: 210385 xmax: 472038 ymax: 210385
## epsg (SRID):    32119
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##   a  AREA PERIMETER CNTY_ CNTY_ID  NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 1 1 0.134      1.59  1980    1980 Rowan 37159  37159       80  4606     3
##   NWBIR74 BIR79 SID79 NWBIR79                  geom
## 1    1057  6427     8    1504 POINT (472038 210385)
st_join(nc_sf, pt_sf, left = FALSE)
## Simple feature collection with 1 feature and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 449767.8 ymin: 194348 xmax: 502410 ymax: 234481
## epsg (SRID):    32119
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##     AREA PERIMETER CNTY_ CNTY_ID  NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 50 0.134      1.59  1980    1980 Rowan 37159  37159       80  4606     3
##    NWBIR74 BIR79 SID79 NWBIR79 a                       geometry
## 50    1057  6427     8    1504 1 MULTIPOLYGON (((491842.7 19...

The first time the join is performed on the points, the second time on the Multipolygon.

Why left = TRUE? Because the MUltipoligon on which we joun has more than one row. So we need to specify how the join is performed.

The difference of “left” being TRUE or False is that either a left join or else an inner join is performed.

Left Join returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

Inner Join returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

Join with self:

nrow(nc_sf)
## [1] 100
x <- st_join(nc_sf,nc_sf)
nrow(x)
## [1] 590

Since we didn’t specify, a left join is performed; this means that all combination of the matches are returned, leading to 590 rows.

If we only want neigbours excluding nc_sf itself. we can specify the argument “join” within the st_join function. Values for argument join are the geometric confirmation or logical queries from above:

st_disjoint st_touches st_crosses st_within st_contains st_overlaps st_covers st_covered_by st_equals or st_equals_exact, or user-defined functions of the same profile.

We will use st_touches:

x <- st_join(nc_sf, nc_sf, join = st_touches)
nrow(x)
## [1] 490

Now only 490 rows are returned

Now we want to join only neighnours that touch over a line, not in a point. We have to define a function which we can retrieve from the explanation of the st_relate function. It is called st_rook, which refers to the so called DE9-IM relationship. Further info can be found here. The short version of the patterns can be found by using ?st_relate().

st_rook <- function(a, b, prepared) st_relate(a, b, pattern = "F***1****") # define function
# pattern means that relations touch over a line, not in a point
x <- st_join(nc_sf, nc_sf, join = st_rook) # use it on the nc data set
nrow(x)
## [1] 462

The number of rows is reduced again

If we want to find out which counties touch another county in a point only, we simply have to change the pattern in the function. After that the result is mapped:

sel <- unlist(st_relate(nc_sf, nc_sf, pattern = "F***0****")) #simplifies it to produce a vector which contains all the atomic components which occur in nc_sf e.g.that match the pattern
# The atomic modes are "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw"

plot(st_geometry(nc_sf)) # st_geometry: set, or retrieve sfc from an sf object
plot(st_geometry(nc_sf[sel,]), add = TRUE, col = 'red')

Joins based on attributes

It is even more common, and way more simple, to join features based on their attributes. Here, geometries are ignored. If geometries do not match, an empty geometry will be returned. Normally, a data frame is joined on to a sf object. The base R function merge or the dplyr functions left_join etc. work here:

We will use the nc1 object which contains the first 5 rows of the nc_sf object.

nc1 <- nc_sf[1:5,] # only use rows 1 to 5 of the original data set
names(nc1)
##  [1] "AREA"      "PERIMETER" "CNTY_"     "CNTY_ID"   "NAME"      "FIPS"     
##  [7] "FIPSNO"    "CRESS_ID"  "BIR74"     "SID74"     "NWBIR74"   "BIR79"    
## [13] "SID79"     "NWBIR79"   "geometry"

we also create a data frame from nc1. Then we add a column with random letters.

nc_data_frame <- as.data.frame(nc1$CNTY_ID)
View(nc_data_frame)
nc_data_frame$random_numbers <- c(10,20,30,40,50)
View(nc_data_frame)

Now we can merge the two :

nc_merge <- merge(nc1, nc_data_frame) # sf object comes first
nc_merge
## Simple feature collection with 25 features and 16 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 363713.3 ymin: 262348.5 xmax: 899687.6 ymax: 318259.9
## epsg (SRID):    32119
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## First 10 features:
##     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
## 2  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
## 3  0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
## 4  0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
## 5  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
## 6  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
## 7  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
## 8  0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
## 9  0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
## 10 0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
##    NWBIR74 BIR79 SID79 NWBIR79 nc1$CNTY_ID random_numbers
## 1       10  1364     0      19        1825             10
## 2       10   542     3      12        1825             10
## 3      208  3616     6     260        1825             10
## 4      123   830     2     145        1825             10
## 5     1066  1606     3    1197        1825             10
## 6       10  1364     0      19        1827             20
## 7       10   542     3      12        1827             20
## 8      208  3616     6     260        1827             20
## 9      123   830     2     145        1827             20
## 10    1066  1606     3    1197        1827             20
##                          geometry
## 1  MULTIPOLYGON (((387344.8 27...
## 2  MULTIPOLYGON (((408601.8 29...
## 3  MULTIPOLYGON (((478717.1 27...
## 4  MULTIPOLYGON (((878193.9 28...
## 5  MULTIPOLYGON (((769835.3 27...
## 6  MULTIPOLYGON (((387344.8 27...
## 7  MULTIPOLYGON (((408601.8 29...
## 8  MULTIPOLYGON (((478717.1 27...
## 9  MULTIPOLYGON (((878193.9 28...
## 10 MULTIPOLYGON (((769835.3 27...

nc_merge suddenly has 25 rows instead of the original 5.

We can specify the join further by using the arguments available in the merge function. By specifying the common denominator; the county id; only one match is possible so we keep 5 rows.

nc_merge <- merge(nc1, nc_data_frame, by.x="CNTY_ID", by.y="nc1$CNTY_ID") # specify the columns by which to join
nc_merge
## Simple feature collection with 5 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 363713.3 ymin: 262348.5 xmax: 899687.6 ymax: 318259.9
## epsg (SRID):    32119
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##   CNTY_ID  AREA PERIMETER CNTY_        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 1    1825 0.114     1.442  1825        Ashe 37009  37009        5  1091     1
## 2    1827 0.061     1.231  1827   Alleghany 37005  37005        3   487     0
## 3    1828 0.143     1.630  1828       Surry 37171  37171       86  3188     5
## 4    1831 0.070     2.968  1831   Currituck 37053  37053       27   508     1
## 5    1832 0.153     2.206  1832 Northampton 37131  37131       66  1421     9
##   NWBIR74 BIR79 SID79 NWBIR79 random_numbers                       geometry
## 1      10  1364     0      19             10 MULTIPOLYGON (((387344.8 27...
## 2      10   542     3      12             20 MULTIPOLYGON (((408601.8 29...
## 3     208  3616     6     260             30 MULTIPOLYGON (((478717.1 27...
## 4     123   830     2     145             40 MULTIPOLYGON (((878193.9 28...
## 5    1066  1606     3    1197             50 MULTIPOLYGON (((769835.3 27...

——-End——-


  1. Cressie, N (1991), Statistics for spatial data. New York: Wiley, pp. 386–389; Cressie, N, Chan NH (1989) Spatial modelling of regional variables. Journal of the American Statistical Association, 84, 393–401; Cressie, N, Read, TRC (1985) Do sudden infant deaths come in clusters? Statistics and Decisions Supplement Issue 2, 333–349; http://sal.agecon.uiuc.edu/datasets/sids.zip

Previous
Next