1. R (Studio) Basics

This tutorial aims to introduce basic concepts of R. It does not specifically consider spatial data. However, most of the functionalities may be applied to spatial data. The tutrial is inspired by Robert J. Hijman’s Introduction to R (2019)

In this tutorial, only the “base” package is needed.

Foreword on working directory, data and packages

The Data for this tutorial are provided via Github. In order to reproudce the code with your own data, replace the url with your local filepath

If you work locally, you may define the working directory

setwd("YOUR/FILEPATH") 
# this will not work. if you want to use it, define a path on your computer

This should be the place where your data and other folders are stored. more Information can be found here

Code chunks for saving data and results are disabeled in this tutorial (using a #). Please replace (“YOUR/FILEPATH”) with your own path.

The code is written in such a way, that packages will be automatically installed (if not already installed) and loaded. Packages will be stored in the default location of your computer.

Intro and useful commands

Hundreds of packages extend the core functionality of R and contribute to it’s popularity and versatility. Packages need to be installed first (install.packages(“tidyverse”)) and loaded (library(tidyverse)) if their functions need to be loaded. All packages can be found on the CRAN website. Currently, there are >15k packages available and the number is growing. They are assigned to 18 topics, such as statistics, spatial, machine learning etc.

To get built-in help for packages and functions, use the questionmark in the console. In order to search all of R use ??.

  ?             help (unary and binary)
?base
?sf
only loaded packages are searched
To search all of R use ??
??sf

Most problems can be solved with a little experience. This website serves as an excellent starting point to solve your problems.

Kidding aside, here are some useful help pages.

Operartors

The following list of operators is somewhat similar to other programming languages. By using an operator, you tell R to perform a certain task. R’s operators (a function that takes one or two arguments and can be written without parentheses) are either arithmetic or logical. We will take a look at them in the two “basics tutorials”.

Arithmetic operators

Operator Description
+ addition
- subtraction
* multiplication
/ division
^ exponentiation

Logical operators

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x y
x & y x AND y
isTRUE(x) test if X is TRUE
$ component
[ indexing
: sequence operator
~ as in formulae
<- assignment (right to left)
? help (unary and binary)

Basic data types

Here we discuss the basic data types used in R. Here we mainly show they are created. Normally, the data and data types already exist and are defined and may need to be manipulated for further analysis. Here, knowledge of the basic data types comes in handy.

Numeric values

Assign a number to a variable. Then check the value of the variable and the class (data type) of the variable.

#   <-          assignment (right to left)
a  <- 7
print(a)
## [1] 7
class(a)
## [1] "numeric"

numeric means that a is a real (decimal) number. Its value is equivalent to 7.000. In a few cases it can be useful, or even necessary, to use integer (whole number) values. Using integers may reduce processing times, especially in large data sets. Use as.integer to convert the numeric variable.

a  <- as.integer(7)
class(a)
## [1] "integer"

In order to create a vector of numbers, use the c function (combine) to attach all the numbers. The separator within numerics is “.”, to separate between numbers use “,”:

b <- c(1.35, 7.5, 3.0)
print(b)
## [1] 1.35 7.50 3.00

To create a regualar sequence use “:”.

d <- 1:10
print(d)
##  [1]  1  2  3  4  5  6  7  8  9 10
# change order to descending:
10:1
##  [1] 10  9  8  7  6  5  4  3  2  1

Create sequences more flexibly by using the seq function

e <- seq(from = 0, to= 100, by = 10)
print(e)
##  [1]   0  10  20  30  40  50  60  70  80  90 100

Character values

Character values, often refered to as “strings”, are letters or words. Character values need to be distinguished from variable names through quotes “’” or double quotes """". R is case-sensitive: a is not the same as A. In most computing contexts, a and A are entirely different and, for most intents and purposes, unrelated symbols.

x <- "jkl"
y  <- "hello world"
class(x)
## [1] "character"

Now let’s create variable countries holding a character vector of five elements.

countries <-c('China', 'China', 'Japan', 'South Korea', 'Japan')
class(countries)
## [1] "character"
print(countries)
## [1] "China"       "China"       "Japan"       "South Korea" "Japan"

Check the length of the vector (number of elements) or the number of characters in each element of the vector

length(countries)
## [1] 5
nchar(countries)
## [1]  5  5  5 11  5

Lettersreturns the alphabet (LETTERS returns them in upper-case), and toupper and tolower can be used to change case

z <- letters
print(z)
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
upper <- toupper(z)
print(upper)
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"

To conentrate strings use paste. This may be useful if one wants to combine names of a data set

g <- "Geography"
m <- "matters"
paste (g, "really", m)
## [1] "Geography really matters"

To replace characters in a string use gsub To identify patterns, use grep.

gsub('l', '!!', 'Hello World')
## [1] "He!!!!o Wor!!d"
gsub('Hello', 'Bye bye', 'Hello World')
## [1] "Bye bye World"
f <-c('az20', 'az21', 'az22', 'ba30', 'ba31', 'ba32')
i <-grep('b', f)

print(i)
## [1] 4 5 6
d[i] # print all elements in d where condition i is met
## [1] 4 5 6
# the "[]" brackets refer to "indexing" which is discussed later

Logical values (boolean)

These values are either TRUE or FALSE. Although one could use T or F, writing TRUE and FALSE is recommended since both are constants that can not be changed.

z <-FALSE
z
## [1] FALSE
class(z)
## [1] "logical"
z <-c(TRUE,TRUE,FALSE)
z
## [1]  TRUE  TRUE FALSE

Logicals are often the result of computation. One can use them to check whether conditions are met or not.

x <- 5
x > 3 # larger?
## [1] TRUE
x == 3 # equal?
## [1] FALSE
x <= 2 # smaller or equal?
## [1] FALSE

Factors

Factors are categorial variables such as gender (in the 20st century sense like male and female). In R you typically need to convert (cast) a character variable to a factor to identify groups for use in statistical tests and models. But numbers max also be used, for example indicating group membership. In that case mathematical operations can not be performed with the numbers. They are merely labels!

f1 <- as.factor(countries)
f1
## [1] China       China       Japan       South Korea Japan      
## Levels: China Japan South Korea
f2 <-c(5:7, 5:7, 5:7)
f2
## [1] 5 6 7 5 6 7 5 6 7
f2 <- as.factor(f2) # overwrite

Missing values

Many, if not most, real life data sets have missing values or “NA”s (not available).

m <-c(2,NA, 5, 2,NA, 2)
m
## [1]  2 NA  5  2 NA  2

Properly treating missing values is very important. The first question to ask when they appear is whether they should be missing (or did you make a mistake in the data manipulation?). If they should be missing, the second question becomes how to treat them. Can they be ignored? Should the records with NAs be removed. NAs can significantly influence your (statistical) analysis. This can be shown with a correlation analysis, depending on th metod used.

x = matrix(c(-2,-1,0,1,2,1.5,2,0,1,2,NA,NA,0,1,2),5)
cor(x)
##      [,1] [,2] [,3]
## [1,]    1    0   NA
## [2,]    0    1   NA
## [3,]   NA   NA    1
#?cor
cor(x, use = "everything")
##      [,1] [,2] [,3]
## [1,]    1    0   NA
## [2,]    0    1   NA
## [3,]   NA   NA    1
cor(x, use = "complete.obs")
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
## [3,]    1    1    1
cor(x, use = "na.or.complete")
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
## [3,]    1    1    1
cor(x, use = "pairwise.complete.obs")
##      [,1] [,2] [,3]
## [1,]    1    0    1
## [2,]    0    1    1
## [3,]    1    1    1

The in-package description gives more details on NA treatment: “If use is”everything“, NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA. If use is”all.obs“, then the presence of missing observations will produce an error. If use is”complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error). “na.or.complete” is the same unless there are no complete cases, that gives NA. Finally, if use has the value “pairwise.complete.obs” then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, “pairwise.complete.obs” only works with the “pearson” method. Note that (the equivalent of) var(double(0), use = *) gives NA for use = “everything” and “na.or.complete”, and gives an error in the other cases."

Time

Time in R is tricky business…There are different calendars, hours, days, months, and leap years to consider. As a basic introduction, here is simple way to create date values.

d1 <-as.Date('2015-4-11')
d2 <-as.Date('2015-3-11')
class(d1)
## [1] "Date"

Create a sequence of dates, for example for grid lines in a plot

d3 <- seq(from=as.Date("2006-01-01"), to=as.Date("2011-01-01"), by=365/4)

d3
##  [1] "2006-01-01" "2006-04-02" "2006-07-02" "2006-10-01" "2007-01-01"
##  [6] "2007-04-02" "2007-07-02" "2007-10-01" "2008-01-01" "2008-04-01"
## [11] "2008-07-01" "2008-09-30" "2008-12-31" "2009-04-01" "2009-07-01"
## [16] "2009-09-30" "2009-12-31" "2010-04-01" "2010-07-01" "2010-09-30"
## [21] "2010-12-31"

More advanced datetime manipulations can be performed on the “POSIXlt” class

as.POSIXlt(d1)
## [1] "2015-04-11 UTC"
as.POSIXct(d1)
## [1] "2015-04-11 02:00:00 CEST"

Data structures

In the previous chapter we got to know the most common data types in R. They were all stored in a one dimensional structure - called a vector. This section focusses multi demensional data structures.

Matrix

Think of a Matrix as a primitive spreadsheet similar to those in Excel.However, a matrix can only store one single data type

Create a (two dimensional) matrix from 1-6 with 3 collumns and 2 rows.

matrix(1:6, ncol = 3, nrow = 2)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Two switch number of columns and rows use the transpose function t

m <- matrix(1:6, ncol = 3, nrow = 2)
t(m)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6

Most of the time matrixes are created by row- and/or column-binding vectors. See the difference between the two options?

#create 2 vectors
a <-c(1,2,3)
b <- 5:7

#bind columns a nd b
m1 <- cbind(a, b)

# bind rows a and b
m2 <- rbind(a,b)

Using rbind and cbind one can also combine matrices as long as number of rows and columns are the same

#create 2 vectors
m3 <- cbind(b,b,a)
m <- cbind(m1, m3)
m
##      a b b b a
## [1,] 1 5 5 5 1
## [2,] 2 6 6 6 2
## [3,] 3 7 7 7 3

To check out the structure of a matrix use the following functions:

#number of rows or columns?
nrow(m)
## [1] 3
ncol(m)
## [1] 5
# dimensions (number of cells)
dim(m)
## [1] 3 5
#length of 
length(m)
## [1] 15

Columns have variable names that can be changed. The same is true for rows

#get column names
colnames(m)
## [1] "a" "b" "b" "b" "a"
#define column names
colnames(m) <- c('ID', 'X', 'Y', 'v1', 'v2')
rownames(m) <- paste0('row', 1:nrow(m))
m
##      ID X Y v1 v2
## row1  1 5 5  5  1
## row2  2 6 6  6  2
## row3  3 7 7  7  3

Lists

Lists are very flexible containers to store different types of data. A list element may contain any type of R object such as vectors, matrices, data frames or even other lists etc.

A simple list.

list(1:3)
## [[1]]
## [1] 1 2 3

The first (and only) element in the list [[1]] contains a vector of 1,2,3

A list with to types of data:

c <- list(c(2,5), 'abc')
c
## [[1]]
## [1] 2 5
## 
## [[2]]
## [1] "abc"

List elements can be names

names(c) <- c("first", "last")
c
## $first
## [1] 2 5
## 
## $last
## [1] "abc"

A more complex list, containing a vector, a matrix and a list.

m <-matrix(1:6, ncol=3, nrow=2)
f <-list(e, m, 'abc')
f
## [[1]]
##  [1]   0  10  20  30  40  50  60  70  80  90 100
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## [[3]]
## [1] "abc"

Data frame

Most of the time we use data frames in R. It looks like a matrix but unlike one, data frames are able to store columns (variables) of dofferent data types. Data frames are what you get when you import spreadsheest (like excel or csv) via the read.table or read.csv functions. But they can also be created from scratch.

# four vectors
ID <-as.integer(1:4)
name <-c('Ana', 'Rob', 'Liu', 'Veronica')
sex <-as.factor(c('F','M','M','F'))
score <-c(10.2, 9, 13.5, 18)
d <-data.frame(ID, name, sex, score, stringsAsFactors=FALSE)

d
##   ID     name sex score
## 1  1      Ana   F  10.2
## 2  2      Rob   M   9.0
## 3  3      Liu   M  13.5
## 4  4 Veronica   F  18.0

Indexing your data

Indexing means extracting data fromm all kinds of different data structures. Brackets [ ] are used for indexing, parentheses are used to call a function.

Indexing vectors

First, we create a vector. Then we acces different elements of the vector.

b <- 15:30
b
##  [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
# get first element
b[1]
## [1] 15
#get fifth element
b[5]
## [1] 19
# get 2nd to 8th element
b[2:8]
## [1] 16 17 18 19 20 21 22
#this produces the same result
b[c(2,3,4,5,6,7,8)]
## [1] 16 17 18 19 20 21 22
# leave out one element
b[c(1,3:15)]
##  [1] 15 17 18 19 20 21 22 23 24 25 26 27 28 29
# or alternatively use the - simbol
b[-2]
##  [1] 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Change values by simply assigning new ones:

b[1] <- 2
b
##  [1]  2 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
b[3:6] <- -99
b
##  [1]   2  16 -99 -99 -99 -99  21  22  23  24  25  26  27  28  29  30

Indexing matrices

Matrices can accessed like vectors, but since they have two dimensions, both of them need to be considered. This means one needs to access i-row and j-column, separated by a , like this: [i,j]. If only a row/column is selected use [i,] for row or [,j] for column.

# create matrix
m <-matrix(1:9, nrow=3, ncol=3, byrow=TRUE)
colnames(m) <-c('a', 'b', 'c')
m
##      a b c
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
# value in 2nd row and 2nd column
m[2,2]
## b 
## 5
# first row, second column
m[1,2]
## b 
## 2
# 2 rows and 2 columns
m[1:2,1:2]
##      a b
## [1,] 1 2
## [2,] 4 5
#entire row
m[1,]
## a b c 
## 1 2 3
# entire column
m[,2]
## [1] 2 5 8

One may also use the column names for subsetting

m[,'b']
## [1] 2 5 8
# two columns
m[, c('a', 'c')]
##      a c
## [1,] 1 3
## [2,] 4 6
## [3,] 7 9

Indexing lists

Indexing lists may be a little bit tricky as you can both refer to the elements of the list, or the elements of the data (perhaps a matrix) in one of the list elements. Single brackets [] return a list of length 1, double brackets return whats inside that list element:

m <-matrix(1:9, nrow=3, ncol=3, byrow=TRUE)
colnames(m) <-c('a', 'b', 'c')
e <-list(list(1:3),c('a', 'b', 'c', 'd'), m)

#structure of the list
str(e)
## List of 3
##  $ :List of 1
##   ..$ : int [1:3] 1 2 3
##  $ : chr [1:4] "a" "b" "c" "d"
##  $ : int [1:3, 1:3] 1 4 7 2 5 8 3 6 9
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:3] "a" "b" "c"
e[2]
## [[1]]
## [1] "a" "b" "c" "d"
e[[2]]
## [1] "a" "b" "c" "d"

List elements can have names. Elements can may be extracted by their name, either as an index or by using the $ symbol:

names(e) <- c('zzz', 'xyz', 'abc')

e$zzz
## [[1]]
## [1] 1 2 3
e[['xyz']]
## [1] "a" "b" "c" "d"

The $ symbol can also be used to access data frames

Indexing data frames

Indexing data frames is similar to accessing lists or matrices. First, we need a data frame. This is done by converting matrix m

d <- data.frame(m)

class(d)
## [1] "data.frame"

Similar to matrices, columns can be extracted by column number.

d[,2]
## [1] 2 5 8

Column names or or column numbers may be used for indexing. However, the $ symbol is the most common form of indexing data frames.

d[,'b']
## [1] 2 5 8
d[['b']]
## [1] 2 5 8
d$b
## [1] 2 5 8

All these return vector data structures. If you want to keep the complexity of the data frame, the following indexing methods can be used:

d['b']
##   b
## 1 2
## 2 5
## 3 8
#or
d[,'b', drop = FALSE]
##   b
## 1 2
## 2 5
## 3 8

This may seem a small difference, however, it may produce errors if functions expect certain data structures such as matrices or data frames and get a vector as an input.

indexing certain conditions

Sometimes, indexing is performed on a number of values where their position and number is not clear. In that case indexing is done based on conditions e.g. for example all values >15

#create sequence
x <- 1:100

#use which as condition
i <- which(x > 15)
i
##  [1]  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
## [20]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
## [39]  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
## [58]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
## [77]  92  93  94  95  96  97  98  99 100

One can also use a logical vector for indexing (is the value TRUE or FALSE)

b <- x > 15
b
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE

If one wants to find out if a certain value is present in the data the %in% operator comes in handy.

x <- 1:100
j <- c(200, 10,20,30, 40,50, 101,102)

# return a logical vector
j %in% x
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# returns the position of values in j that are present in x
which(j %in% x)
## [1] 2 3 4 5 6

With the function match we can detect which values in j and x match. The function is assymetric, match(j,x) does not equal match(x,j)!

match(j,x)
## [1] NA 10 20 30 40 50 NA NA
match(x,j)
##   [1] NA NA NA NA NA NA NA NA NA  2 NA NA NA NA NA NA NA NA NA  3 NA NA NA NA NA
##  [26] NA NA NA NA  4 NA NA NA NA NA NA NA NA NA  5 NA NA NA NA NA NA NA NA NA  6
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

——-End——-

Next