2. R (Studio) Basics II

This tutorial aims to introduce basic concepts of R. It does not specifically consider spatial data. However, most of the functionalities may be applied to spatial data. The tutorial is inspired by Robert J. Hijman’s Introduction to R (2019).

After the very basics from the first tutorial, this one applies the introduced concepts and builds on them. In this tutorial, only the “base” package is needed.

Foreword on working directory, data and packages

The Data for this tutorial are provided via Github. In order to reproudce the code with your own data, replace the url with your local filepath

If you work locally, you may define the working directory

setwd("YOUR/FILEPATH") 
# this will not work. if you want to use it, define a path on your computer

This should be the place where your data and other folders are stored. more Information can be found here

Code chunks for saving data and results are disabeled in this tutorial (using a #). Please replace (“YOUR/FILEPATH”) with your own path.

The code is written in such a way, that packages will be automatically installed (if not already installed) and loaded. Packages will be stored in the default location of your computer.

Algebra in R

Using algebraic expressions, new vectors and matrices can be computet from existing vectors (and matrices). Let’s start with vectors

Vector algebra

Let’s create two vectors and multiply them with each other. Notice that multiplication (or any mathematical operation) works element by element.

a <- 1:10
b <- 11:20

d <- a*b
# check what we have
a
##  [1]  1  2  3  4  5  6  7  8  9 10
b
##  [1] 11 12 13 14 15 16 17 18 19 20
d
##  [1]  11  24  39  56  75  96 119 144 171 200
# now multiply all elements in a vector with one value
a*3
##  [1]  3  6  9 12 15 18 21 24 27 30

The above two vectors were both equally long or one vector was of length 1. What happens when they are not?

a+c(1,10)
##  [1]  2 12  4 14  6 16  8 18 10 20

The shorter one is being “recycled”. It repeats itself. Be cautious for this may be a source of errors.

Logical comparisons

Rember that == is used as a test for equality

a
##  [1]  1  2  3  4  5  6  7  8  9 10
# test if elements in a equal 2
a == 2
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# test if any element in a is larger than 2
f <- a > 2
f
##  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

The & symbol is the equivalent of the boolean AND, the | stands for the boolean OR.

a 
##  [1]  1  2  3  4  5  6  7  8  9 10
b
##  [1] 11 12 13 14 15 16 17 18 19 20
b > 6 & b < 8
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# now combine a and b
b > 9 | a < 2
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

“Less than or equal” is <=, and “more than or equal” is >=.

b >= 9
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
a <= 2
##  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
b >= 9 | a <=2
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
b >= 9 & a <= 2
##  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Basic mathematical functions

We can obtain vectorized algebra such as the square root or the exponential function of a vector

sqrt(a)
##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
##  [9] 3.000000 3.162278
exp(a)
##  [1]     2.718282     7.389056    20.085537    54.598150   148.413159
##  [6]   403.428793  1096.633158  2980.957987  8103.083928 22026.465795

The following functions do not return vectors but a single value. They form the basis of basic statistics. If in doubt what the function calculates, use the help files (e.g. ?mean) to find out more.

min(a)
## [1] 1
max(a)
## [1] 10
range(a)
## [1]  1 10
mean(a)
## [1] 5.5
median(a)
## [1] 5.5
prod(a)
## [1] 3628800
sd(a)
## [1] 3.02765
# a handy function summatizes some of the above functions
summary(a)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00

Creating random numbers

In data analysis it is quite common to to create a vector of random numbers. This is done to demonstrate how procedures work. To get 10 random numbers from a uniform distribution between 0 and 1 use:

r <- runif(10)

For normally distributed random numbers use rnorm

r <- rnorm(10, mean = 10, sd = 2)
# if mean and sd are not specified, the default values of 0 and 1 are used
r
##  [1] 10.122100 13.313314 11.497967 12.423787  7.451405  9.002163  9.435267
##  [8]  6.912408 10.284431 11.685427

The “random” numbers are actually not really random. Every time you run the code the numbers will be different, in fact,however, they are pseudo randomly generated numbers that approximate a sequence of random numbers. To learn more click this link. In order to be able to exactly reproduce examples or data analysis, for example while debugging, it may often be necessary to produce exactly the same “random” sample every time the code is run. More info can be found here. The set.seed function initializes the random number to a specific point.

#not the same
runif(3)
## [1] 0.6673927 0.8886813 0.4988101
runif(3)
## [1] 0.2611575 0.4209163 0.9136672
# same sequence (45 was randomly used here)
set.seed(45) 
runif(3)
## [1] 0.6333728 0.3175366 0.2409218
set.seed(45)
runif(3)
## [1] 0.6333728 0.3175366 0.2409218

Matrix algebra

Computation with a matrix works similar to vectors. Computation is done row wise.

# set up matrix
m <- matrix(1:6, ncol = 3, nrow = 2, byrow = TRUE)
m
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
#multiply by two
m*2
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    8   10   12
#square the values
m^2
##      [,1] [,2] [,3]
## [1,]    1    4    9
## [2,]   16   25   36

Mathematical operations may be performed between a matrix and a vector. Note that shorter vectors are recycled! multiplying two matrices is also possible.

m *1:2
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    8   10   12
m*m
##      [,1] [,2] [,3]
## [1,]    1    4    9
## [2,]   16   25   36

Reading and writing files

Up to now we have create vectors, matrices, data frames and lists from scratch. In data analysis it is more common to read existing files with data. Depending on the complexity of the file format, reading files can be more or less complicated. Reading spreadsheet-like data structures, however, is quite straightforward (spreadsheet like data in our case means .csv and .txt formats). However, a few details need to be considered. First we check what the working directory is. Otherwise we have to define it every time. Notice that in R the file path is seperated by forward slashes /. In Windows file paths are seperated by backslashes.

getwd()
# this will show you a filepath on your computer

Now we will create a data frame and save it to the working directory as csv (comma separated values) and as txt (text) file.

# create data frame with three named columns
d <- data.frame(id=1:10, name = letters[1:10], value=seq(10,28,2))
d
##    id name value
## 1   1    a    10
## 2   2    b    12
## 3   3    c    14
## 4   4    d    16
## 5   5    e    18
## 6   6    f    20
## 7   7    g    22
## 8   8    h    24
## 9   9    i    26
## 10 10    j    28

If not specified otherwise, the file is saved to the location of the working directory. In order to write the file we need to at least provide the data frame and the name under which we want to save the file.

write.csv(d, "test.csv", row.names = FALSE)

#and the same for the .txt
write.table(d, "test.txt", row.names = FALSE)

If you want to save the file to a different location, provide the filepath before the file name: write.csv(d, "YOUR/FILEPATH/FILE.csv", row.names = FALSE)

Now read the files. Normally you would read the file from your local working directory like this: d2 <- read.csv("test.csv", stringsAsFactors = FALSE). In our case we load it from a publicly available Github repository.

urlfile <- "https://raw.githubusercontent.com/RafHo/teaching/master/angewandte_geodatenverarbeitung/datasource/test.csv" # get url from github repositury

d2 <- read.csv(url(urlfile), stringsAsFactors = FALSE)
head(d2)
##   id name value
## 1  1    a    10
## 2  2    b    12
## 3  3    c    14
## 4  4    d    16
## 5  5    e    18
## 6  6    f    20
#now the txt d2 is overwritten
#d2 <- read.table("test.txt", stringsAsFactors = FALSE)
urlfile2 <- "https://raw.githubusercontent.com/RafHo/teaching/master/angewandte_geodatenverarbeitung/datasource/test.txt" # get url from github repositury
d2 <- read.table(url(urlfile2), stringsAsFactors = FALSE)
head(d2)
##   V1   V2    V3
## 1 id name value
## 2  1    a    10
## 3  2    b    12
## 4  3    c    14
## 5  4    d    16
## 6  5    e    18

d2 has been overwritten. Notice how second time the column names were not automatically recognized. We will specify them:

d2 <- read.table(url(urlfile2), header = TRUE, stringsAsFactors = FALSE)

read.table can also read csv files if the separator is specified:

d2 <- read.table(url(urlfile2), sep = ',',, header = TRUE, stringsAsFactors = FALSE)

In RStudio you can also navigate to the files you want to read manually. This may be helpful in the beginning, since there is a preview of how the data are loaded into R and the code is also provided. To do so, import the data set via the global environment. Additional help on reading files can be found on the help pages (e.g. ?read.table).

Exploring data

After we’ve read data into R we finally get to do some analysis. This is where the fun starts :) Exploratory data analysis, inspecting the data, is a very important step to gain an understanding of your data. It is important to understand your data before you start specific analyses. Otherwise errors are very likely. During exploratory data anlysis, errors within the data can be corrected. Here lies one of the powers of R because the correction takes place within the software. This means that the original data is not changed. Additinally, all changes are reproducable via the code.

Summarizing the Data set

First, we create the data frame d:

d <-data.frame(id=1:10,name=c('Bob', 'Bobby', '???', 'Bob', 'Bab', 'Jim', 'Jim', 'jim', '', 'Jim'),
               score1=c(8, 10, 7, 9, 2, 5, 1, 6, 3, 4),
               score2=c(3,4,5,-999,5,5,-999,2,3,4), stringsAsFactors=FALSE)
d
##    id  name score1 score2
## 1   1   Bob      8      3
## 2   2 Bobby     10      4
## 3   3   ???      7      5
## 4   4   Bob      9   -999
## 5   5   Bab      2      5
## 6   6   Jim      5      5
## 7   7   Jim      1   -999
## 8   8   jim      6      2
## 9   9            3      3
## 10 10   Jim      4      4

Since d only has 10 rows, it is easy to get an overview of the data. In many cases we deal with data frames of 100s or 1000s of rows and/or several columns. This makes it increasingly hard to spot errors or get an overview of the data. Here the function summary comes in handy:

summary(d)
##        id            name               score1          score2       
##  Min.   : 1.00   Length:10          Min.   : 1.00   Min.   :-999.00  
##  1st Qu.: 3.25   Class :character   1st Qu.: 3.25   1st Qu.:   2.25  
##  Median : 5.50   Mode  :character   Median : 5.50   Median :   3.50  
##  Mean   : 5.50                      Mean   : 5.50   Mean   :-196.70  
##  3rd Qu.: 7.75                      3rd Qu.: 7.75   3rd Qu.:   4.75  
##  Max.   :10.00                      Max.   :10.00   Max.   :   5.00

Each column is summarized. In variable score 2, the minimum value -999. This is often done to indicate missing values. In R, however, -999 is treated as a numeric. To avoid errors, we change the value to NA. Check how the summarizing values change in score2 before and after we changed the value.

summary(d)
##        id            name               score1          score2       
##  Min.   : 1.00   Length:10          Min.   : 1.00   Min.   :-999.00  
##  1st Qu.: 3.25   Class :character   1st Qu.: 3.25   1st Qu.:   2.25  
##  Median : 5.50   Mode  :character   Median : 5.50   Median :   3.50  
##  Mean   : 5.50                      Mean   : 5.50   Mean   :-196.70  
##  3rd Qu.: 7.75                      3rd Qu.: 7.75   3rd Qu.:   4.75  
##  Max.   :10.00                      Max.   :10.00   Max.   :   5.00
# identify the values in score2 that are -999
i <- d$score2 == -999
#change -999 to NAs
d$score2[i] <- NA
# Those two steps are normally done in a single line: d$score2[d$score2 == -999] <- NA

#check difference of mean, median etc.
summary(d)
##        id            name               score1          score2     
##  Min.   : 1.00   Length:10          Min.   : 1.00   Min.   :2.000  
##  1st Qu.: 3.25   Class :character   1st Qu.: 3.25   1st Qu.:3.000  
##  Median : 5.50   Mode  :character   Median : 5.50   Median :4.000  
##  Mean   : 5.50                      Mean   : 5.50   Mean   :3.875  
##  3rd Qu.: 7.75                      3rd Qu.: 7.75   3rd Qu.:5.000  
##  Max.   :10.00                      Max.   :10.00   Max.   :5.000  
##                                                     NA's   :2

For character and integer values it may be helpful to use unique and table:

unique(d$name)
## [1] "Bob"   "Bobby" "???"   "Bab"   "Jim"   "jim"   ""
table(d$name)
## 
##         ???   Bab   Bob Bobby   jim   Jim 
##     1     1     1     2     1     1     3

Often, names are spelled with small variations: Let’s correct some of the names here to Bob.

d$name[d$name %in% c('Bab', 'Bobby')] <- 'Bob'
table(d$name)
## 
##     ??? Bob jim Jim 
##   1   1   4   1   3

jim should be Jim. It is easy enough to replace as done above. But what if there were many cases like that? It would be easy to make all character values lower- or uppercase withd$name <- toupper(d$name) but I want only the first letter to be uppercase.

# get the first letters
first <-substr(d$name, 1, 1)
# get the remainder
remainder <-substr(d$name, 2,nchar(d$name))
# assure that the first letter is upper case
first <-toupper(first)
#combine them
name <-paste0(first, remainder)

# assign back to the variable
d$name <- name

table(d$name)
## 
##     ??? Bob Jim 
##   1   1   4   4

The ??? should also be changed to NAs.

d$name[d$name == '???'] <-NA
table(d$name)
## 
##     Bob Jim 
##   1   4   4

By default, NAs are not counted. But if we want to count the we adjust the table function:

table(d$name, useNA = 'ifany')
## 
##       Bob  Jim <NA> 
##    1    4    4    1
table(d$name, useNA = 'ifany')
## 
##       Bob  Jim <NA> 
##    1    4    4    1

There is also an empty value which should be replaced:

d$name [9]
## [1] ""
# use the '' with nothing in between for empty values
d$name[d$name == ''] <- NA

table(d$name, useNA = 'ifany')
## 
##  Bob  Jim <NA> 
##    4    4    2

Summarizing statistics

Some functions are useful to get an understanding of the value range or distribution of data. Namely they are called quantile, range and mean

quantile(d$score1)
##    0%   25%   50%   75%  100% 
##  1.00  3.25  5.50  7.75 10.00
range(d$score1)
## [1]  1 10
mean(d$score1)
## [1] 5.5

Some functions need you to specify how to treat NAs. If you fail to do so, one single NA in the data will produce a NA result. This is an important hint when your result produces an NA.

#quantile(d$score2)
# produces an error
#do not include NAs in computation
quantile(d$score2, na.rm = TRUE)
##   0%  25%  50%  75% 100% 
##    2    3    4    5    5
range(d$score2, na.rm = TRUE)
## [1] 2 5

During Data exploration, plots are very useful to summarize data. Very basic plots are shown below:

par(mfrow=c(2,2))
plot(d$score1, d$score2)
boxplot(d[, c('score1','score2')])
hist(d$score2)

Outlook

After a first exploration of the data, we often need tidy the the data in order to conduct further analysis. We will use the packages from the so called tidyverse. These packages provide powerful tools for data analysis and will be discussed later

Previous
Next