The following code is required for proper rendering of the document. Please do not modify it. When working in the Rmd version of this file, do not attempt to run this chunk.

knitr::opts_chunk$set(error = TRUE)

Module 3: Advanced R and basic statistics

Last updated on 2023-Feb-03.

Original text: Chad M. Eliason, PhD

Revisions: Nick M. A. Crouch, PhD; Lucas J. Legendre, PhD; and Carlos A. Rodriguez-Saltos, PhD

Exercises: Lucas J. Legendre, PhD

Rmarkdown implementation: Carlos A. Rodriguez-Saltos, PhD

Principal course instructor: Julia A. Clarke, PhD

These modules are part of the course “Curiosity to Question: Research Design, Data Analysis and Visualization”, taught by Dr. Julia A. Clarke and Dr. Adam Papendieck at UT Austin.

For questions or comments, please send an email to Dr. Clarke ().

How to cite

Eliason, C. M., Proffitt, J. V., Crouch, N. M. A., Legendre, L. J., Rodriguez-Saltos, C. A., Papendieck, A., & Clarke, J. A. (2020). The Clarke Lab’s Introductory Course to R. Retrieved from https://juliaclarke.squarespace.com

Manipulating data

For this section we will work with the mtcars dataset, which comes preloaded with R.

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Sorting

Let’s sort the dataset by mpg and cyl.

plot(mtcars$mpg)

newdata <- mtcars[order(mtcars$mpg, mtcars$cyl), ]

order(mtcars$mpg, mtcars$cyl)
##  [1] 15 16 24  7 17 31 14 23 22 29 12 13 11  6  5 10 25 30  1  2 32  4 21  3  9
## [26]  8 27 26 19 28 18 20
plot(newdata$mpg)

We can sort by mpg and cyl in ascending and descending order, respectively.

newdata <- mtcars[order(mtcars$mpg, -mtcars$cyl), ]

plot(newdata$cyl)

head(newdata)
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8

Merging

Let’s say you have two datasets and want to merge them based on a common ID variable.

df1 <- data.frame(id=c(1:6), prod=c(rep("toaster", 3), rep("radio", 3)))
df1
##   id    prod
## 1  1 toaster
## 2  2 toaster
## 3  3 toaster
## 4  4   radio
## 5  5   radio
## 6  6   radio
df2 <- data.frame(id = c(2, 4, 6), state = c(rep('Alabama', 2), 'Ohio'))
df2
##   id   state
## 1  2 Alabama
## 2  4 Alabama
## 3  6    Ohio

We can merge then as follows:

merge(df1, df2, by="id")
##   id    prod   state
## 1  2 toaster Alabama
## 2  4   radio Alabama
## 3  6   radio    Ohio

Adding observations

Use rbind.

rbind(df1, df1)
##    id    prod
## 1   1 toaster
## 2   2 toaster
## 3   3 toaster
## 4   4   radio
## 5   5   radio
## 6   6   radio
## 7   1 toaster
## 8   2 toaster
## 9   3 toaster
## 10  4   radio
## 11  5   radio
## 12  6   radio

rbind, however, won’t work if your column names differ

rbind(df1, df2)
## Error in match.names(clabs, names(xi)): names do not match previous names

Tabulating your data

table returns a table with the number of occurences for each level in a factor.

mtcars$cyl
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

You can use it for more than one column at the same time. The output is called a contingency table.

table(mtcars$cyl, mtcars$am)
##    
##      0  1
##   4  3  8
##   6  4  3
##   8 12  2

To show the name of the coumns in the table, use the following code:

xtabs(~ cyl + am, data = mtcars)
##    am
## cyl  0  1
##   4  3  8
##   6  4  3
##   8 12  2

You can pass the output of xtabs to plot and summary.

xt <- xtabs(~cyl + am, data = mtcars)
xt
##    am
## cyl  0  1
##   4  3  8
##   6  4  3
##   8 12  2
plot(xt)

summary(xt)
## Call: xtabs(formula = ~cyl + am, data = mtcars)
## Number of cases in table: 32 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 8.741, df = 2, p-value = 0.01265
##  Chi-squared approximation may be incorrect

Looping over rows or columns

We often want to apply a function to each row or column of a data frame.

Let’s try some examples on the iris dataset.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

To get row-wise averages:

iris[, 1:4]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1            5.1         3.5          1.4         0.2
## 2            4.9         3.0          1.4         0.2
## 3            4.7         3.2          1.3         0.2
## 4            4.6         3.1          1.5         0.2
## 5            5.0         3.6          1.4         0.2
## 6            5.4         3.9          1.7         0.4
## 7            4.6         3.4          1.4         0.3
## 8            5.0         3.4          1.5         0.2
## 9            4.4         2.9          1.4         0.2
## 10           4.9         3.1          1.5         0.1
## 11           5.4         3.7          1.5         0.2
## 12           4.8         3.4          1.6         0.2
## 13           4.8         3.0          1.4         0.1
## 14           4.3         3.0          1.1         0.1
## 15           5.8         4.0          1.2         0.2
## 16           5.7         4.4          1.5         0.4
## 17           5.4         3.9          1.3         0.4
## 18           5.1         3.5          1.4         0.3
## 19           5.7         3.8          1.7         0.3
## 20           5.1         3.8          1.5         0.3
## 21           5.4         3.4          1.7         0.2
## 22           5.1         3.7          1.5         0.4
## 23           4.6         3.6          1.0         0.2
## 24           5.1         3.3          1.7         0.5
## 25           4.8         3.4          1.9         0.2
## 26           5.0         3.0          1.6         0.2
## 27           5.0         3.4          1.6         0.4
## 28           5.2         3.5          1.5         0.2
## 29           5.2         3.4          1.4         0.2
## 30           4.7         3.2          1.6         0.2
## 31           4.8         3.1          1.6         0.2
## 32           5.4         3.4          1.5         0.4
## 33           5.2         4.1          1.5         0.1
## 34           5.5         4.2          1.4         0.2
## 35           4.9         3.1          1.5         0.2
## 36           5.0         3.2          1.2         0.2
## 37           5.5         3.5          1.3         0.2
## 38           4.9         3.6          1.4         0.1
## 39           4.4         3.0          1.3         0.2
## 40           5.1         3.4          1.5         0.2
## 41           5.0         3.5          1.3         0.3
## 42           4.5         2.3          1.3         0.3
## 43           4.4         3.2          1.3         0.2
## 44           5.0         3.5          1.6         0.6
## 45           5.1         3.8          1.9         0.4
## 46           4.8         3.0          1.4         0.3
## 47           5.1         3.8          1.6         0.2
## 48           4.6         3.2          1.4         0.2
## 49           5.3         3.7          1.5         0.2
## 50           5.0         3.3          1.4         0.2
## 51           7.0         3.2          4.7         1.4
## 52           6.4         3.2          4.5         1.5
## 53           6.9         3.1          4.9         1.5
## 54           5.5         2.3          4.0         1.3
## 55           6.5         2.8          4.6         1.5
## 56           5.7         2.8          4.5         1.3
## 57           6.3         3.3          4.7         1.6
## 58           4.9         2.4          3.3         1.0
## 59           6.6         2.9          4.6         1.3
## 60           5.2         2.7          3.9         1.4
## 61           5.0         2.0          3.5         1.0
## 62           5.9         3.0          4.2         1.5
## 63           6.0         2.2          4.0         1.0
## 64           6.1         2.9          4.7         1.4
## 65           5.6         2.9          3.6         1.3
## 66           6.7         3.1          4.4         1.4
## 67           5.6         3.0          4.5         1.5
## 68           5.8         2.7          4.1         1.0
## 69           6.2         2.2          4.5         1.5
## 70           5.6         2.5          3.9         1.1
## 71           5.9         3.2          4.8         1.8
## 72           6.1         2.8          4.0         1.3
## 73           6.3         2.5          4.9         1.5
## 74           6.1         2.8          4.7         1.2
## 75           6.4         2.9          4.3         1.3
## 76           6.6         3.0          4.4         1.4
## 77           6.8         2.8          4.8         1.4
## 78           6.7         3.0          5.0         1.7
## 79           6.0         2.9          4.5         1.5
## 80           5.7         2.6          3.5         1.0
## 81           5.5         2.4          3.8         1.1
## 82           5.5         2.4          3.7         1.0
## 83           5.8         2.7          3.9         1.2
## 84           6.0         2.7          5.1         1.6
## 85           5.4         3.0          4.5         1.5
## 86           6.0         3.4          4.5         1.6
## 87           6.7         3.1          4.7         1.5
## 88           6.3         2.3          4.4         1.3
## 89           5.6         3.0          4.1         1.3
## 90           5.5         2.5          4.0         1.3
## 91           5.5         2.6          4.4         1.2
## 92           6.1         3.0          4.6         1.4
## 93           5.8         2.6          4.0         1.2
## 94           5.0         2.3          3.3         1.0
## 95           5.6         2.7          4.2         1.3
## 96           5.7         3.0          4.2         1.2
## 97           5.7         2.9          4.2         1.3
## 98           6.2         2.9          4.3         1.3
## 99           5.1         2.5          3.0         1.1
## 100          5.7         2.8          4.1         1.3
## 101          6.3         3.3          6.0         2.5
## 102          5.8         2.7          5.1         1.9
## 103          7.1         3.0          5.9         2.1
## 104          6.3         2.9          5.6         1.8
## 105          6.5         3.0          5.8         2.2
## 106          7.6         3.0          6.6         2.1
## 107          4.9         2.5          4.5         1.7
## 108          7.3         2.9          6.3         1.8
## 109          6.7         2.5          5.8         1.8
## 110          7.2         3.6          6.1         2.5
## 111          6.5         3.2          5.1         2.0
## 112          6.4         2.7          5.3         1.9
## 113          6.8         3.0          5.5         2.1
## 114          5.7         2.5          5.0         2.0
## 115          5.8         2.8          5.1         2.4
## 116          6.4         3.2          5.3         2.3
## 117          6.5         3.0          5.5         1.8
## 118          7.7         3.8          6.7         2.2
## 119          7.7         2.6          6.9         2.3
## 120          6.0         2.2          5.0         1.5
## 121          6.9         3.2          5.7         2.3
## 122          5.6         2.8          4.9         2.0
## 123          7.7         2.8          6.7         2.0
## 124          6.3         2.7          4.9         1.8
## 125          6.7         3.3          5.7         2.1
## 126          7.2         3.2          6.0         1.8
## 127          6.2         2.8          4.8         1.8
## 128          6.1         3.0          4.9         1.8
## 129          6.4         2.8          5.6         2.1
## 130          7.2         3.0          5.8         1.6
## 131          7.4         2.8          6.1         1.9
## 132          7.9         3.8          6.4         2.0
## 133          6.4         2.8          5.6         2.2
## 134          6.3         2.8          5.1         1.5
## 135          6.1         2.6          5.6         1.4
## 136          7.7         3.0          6.1         2.3
## 137          6.3         3.4          5.6         2.4
## 138          6.4         3.1          5.5         1.8
## 139          6.0         3.0          4.8         1.8
## 140          6.9         3.1          5.4         2.1
## 141          6.7         3.1          5.6         2.4
## 142          6.9         3.1          5.1         2.3
## 143          5.8         2.7          5.1         1.9
## 144          6.8         3.2          5.9         2.3
## 145          6.7         3.3          5.7         2.5
## 146          6.7         3.0          5.2         2.3
## 147          6.3         2.5          5.0         1.9
## 148          6.5         3.0          5.2         2.0
## 149          6.2         3.4          5.4         2.3
## 150          5.9         3.0          5.1         1.8
apply(iris[, 1:4], MARGIN = 1, FUN = mean)
##   [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500
##  [13] 2.325 2.125 2.800 3.000 2.750 2.575 2.875 2.675 2.675 2.675 2.350 2.650
##  [25] 2.575 2.450 2.600 2.600 2.550 2.425 2.425 2.675 2.725 2.825 2.425 2.400
##  [37] 2.625 2.500 2.225 2.550 2.525 2.100 2.275 2.675 2.800 2.375 2.675 2.350
##  [49] 2.675 2.475 4.075 3.900 4.100 3.275 3.850 3.575 3.975 2.900 3.850 3.300
##  [61] 2.875 3.650 3.300 3.775 3.350 3.900 3.650 3.400 3.600 3.275 3.925 3.550
##  [73] 3.800 3.700 3.725 3.850 3.950 4.100 3.725 3.200 3.200 3.150 3.400 3.850
##  [85] 3.600 3.875 4.000 3.575 3.500 3.325 3.425 3.775 3.400 2.900 3.450 3.525
##  [97] 3.525 3.675 2.925 3.475 4.525 3.875 4.525 4.150 4.375 4.825 3.400 4.575
## [109] 4.200 4.850 4.200 4.075 4.350 3.800 4.025 4.300 4.200 5.100 4.875 3.675
## [121] 4.525 3.825 4.800 3.925 4.450 4.550 3.900 3.950 4.225 4.400 4.550 5.025
## [133] 4.250 3.925 3.925 4.775 4.425 4.200 3.900 4.375 4.450 4.350 3.875 4.550
## [145] 4.550 4.300 3.925 4.175 4.325 3.950

To get column-wise averages:

apply(iris[, 1:4], MARGIN = 2, FUN = mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Standard deviation for each column:

apply(iris[, 1:4], MARGIN = 2, FUN = sd)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377

When you want to apply a function to each level in a factor:

# Mean of sepal length for each species
tapply(iris$Sepal.Length, iris$Species, FUN = mean)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

The aggregate function produces similar output, but with a nicer format. It splits the dataset into subsets, according to the levels of the factors, and computes summary stats for each subset.

aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)
##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588

You can use it for all variables at once.

aggregate(. ~ Species, data = iris, FUN = mean)
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

Or just some of the variables:

aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species, data = iris, FUN = mean)
##      Species Sepal.Length Sepal.Width
## 1     setosa        5.006       3.428
## 2 versicolor        5.936       2.770
## 3  virginica        6.588       2.974

If, in general, you are interested in applying a function to each element of an object, check the functions sapply and lapply. They can be used for elements of a vector or a list –The latter is a special type of vector that can include elements of different types.

The package dplyr can also help to apply functions to each column of a data frame. It includes the function summarise, which makes it easy to calculate several statistics at once to summarize our dataset. We will use the function summarise in conjunction with the “pipe” operator (%>%). The latter feeds the output of a function to the input of another function (think of it as water flowing through several segments of a pipe).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
iris %>% 
  group_by(Species) %>% 
  summarise(
    sepal.mean = mean(Sepal.Length), 
    sepal.sd = sd(Sepal.Length)
  )
## # A tibble: 3 × 3
##   Species    sepal.mean sepal.sd
##   <fct>           <dbl>    <dbl>
## 1 setosa           5.01    0.352
## 2 versicolor       5.94    0.516
## 3 virginica        6.59    0.636

For more information, check the help files of the above functions.

?group_by
?summarise

The function mutate, also from dplyr, allows to add a variable based on a calculation performed on other variables in the dataset.

mutate(iris, myvar = Sepal.Length * Sepal.Width)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species myvar
## 1            5.1         3.5          1.4         0.2     setosa 17.85
## 2            4.9         3.0          1.4         0.2     setosa 14.70
## 3            4.7         3.2          1.3         0.2     setosa 15.04
## 4            4.6         3.1          1.5         0.2     setosa 14.26
## 5            5.0         3.6          1.4         0.2     setosa 18.00
## 6            5.4         3.9          1.7         0.4     setosa 21.06
## 7            4.6         3.4          1.4         0.3     setosa 15.64
## 8            5.0         3.4          1.5         0.2     setosa 17.00
## 9            4.4         2.9          1.4         0.2     setosa 12.76
## 10           4.9         3.1          1.5         0.1     setosa 15.19
## 11           5.4         3.7          1.5         0.2     setosa 19.98
## 12           4.8         3.4          1.6         0.2     setosa 16.32
## 13           4.8         3.0          1.4         0.1     setosa 14.40
## 14           4.3         3.0          1.1         0.1     setosa 12.90
## 15           5.8         4.0          1.2         0.2     setosa 23.20
## 16           5.7         4.4          1.5         0.4     setosa 25.08
## 17           5.4         3.9          1.3         0.4     setosa 21.06
## 18           5.1         3.5          1.4         0.3     setosa 17.85
## 19           5.7         3.8          1.7         0.3     setosa 21.66
## 20           5.1         3.8          1.5         0.3     setosa 19.38
## 21           5.4         3.4          1.7         0.2     setosa 18.36
## 22           5.1         3.7          1.5         0.4     setosa 18.87
## 23           4.6         3.6          1.0         0.2     setosa 16.56
## 24           5.1         3.3          1.7         0.5     setosa 16.83
## 25           4.8         3.4          1.9         0.2     setosa 16.32
## 26           5.0         3.0          1.6         0.2     setosa 15.00
## 27           5.0         3.4          1.6         0.4     setosa 17.00
## 28           5.2         3.5          1.5         0.2     setosa 18.20
## 29           5.2         3.4          1.4         0.2     setosa 17.68
## 30           4.7         3.2          1.6         0.2     setosa 15.04
## 31           4.8         3.1          1.6         0.2     setosa 14.88
## 32           5.4         3.4          1.5         0.4     setosa 18.36
## 33           5.2         4.1          1.5         0.1     setosa 21.32
## 34           5.5         4.2          1.4         0.2     setosa 23.10
## 35           4.9         3.1          1.5         0.2     setosa 15.19
## 36           5.0         3.2          1.2         0.2     setosa 16.00
## 37           5.5         3.5          1.3         0.2     setosa 19.25
## 38           4.9         3.6          1.4         0.1     setosa 17.64
## 39           4.4         3.0          1.3         0.2     setosa 13.20
## 40           5.1         3.4          1.5         0.2     setosa 17.34
## 41           5.0         3.5          1.3         0.3     setosa 17.50
## 42           4.5         2.3          1.3         0.3     setosa 10.35
## 43           4.4         3.2          1.3         0.2     setosa 14.08
## 44           5.0         3.5          1.6         0.6     setosa 17.50
## 45           5.1         3.8          1.9         0.4     setosa 19.38
## 46           4.8         3.0          1.4         0.3     setosa 14.40
## 47           5.1         3.8          1.6         0.2     setosa 19.38
## 48           4.6         3.2          1.4         0.2     setosa 14.72
## 49           5.3         3.7          1.5         0.2     setosa 19.61
## 50           5.0         3.3          1.4         0.2     setosa 16.50
## 51           7.0         3.2          4.7         1.4 versicolor 22.40
## 52           6.4         3.2          4.5         1.5 versicolor 20.48
## 53           6.9         3.1          4.9         1.5 versicolor 21.39
## 54           5.5         2.3          4.0         1.3 versicolor 12.65
## 55           6.5         2.8          4.6         1.5 versicolor 18.20
## 56           5.7         2.8          4.5         1.3 versicolor 15.96
## 57           6.3         3.3          4.7         1.6 versicolor 20.79
## 58           4.9         2.4          3.3         1.0 versicolor 11.76
## 59           6.6         2.9          4.6         1.3 versicolor 19.14
## 60           5.2         2.7          3.9         1.4 versicolor 14.04
## 61           5.0         2.0          3.5         1.0 versicolor 10.00
## 62           5.9         3.0          4.2         1.5 versicolor 17.70
## 63           6.0         2.2          4.0         1.0 versicolor 13.20
## 64           6.1         2.9          4.7         1.4 versicolor 17.69
## 65           5.6         2.9          3.6         1.3 versicolor 16.24
## 66           6.7         3.1          4.4         1.4 versicolor 20.77
## 67           5.6         3.0          4.5         1.5 versicolor 16.80
## 68           5.8         2.7          4.1         1.0 versicolor 15.66
## 69           6.2         2.2          4.5         1.5 versicolor 13.64
## 70           5.6         2.5          3.9         1.1 versicolor 14.00
## 71           5.9         3.2          4.8         1.8 versicolor 18.88
## 72           6.1         2.8          4.0         1.3 versicolor 17.08
## 73           6.3         2.5          4.9         1.5 versicolor 15.75
## 74           6.1         2.8          4.7         1.2 versicolor 17.08
## 75           6.4         2.9          4.3         1.3 versicolor 18.56
## 76           6.6         3.0          4.4         1.4 versicolor 19.80
## 77           6.8         2.8          4.8         1.4 versicolor 19.04
## 78           6.7         3.0          5.0         1.7 versicolor 20.10
## 79           6.0         2.9          4.5         1.5 versicolor 17.40
## 80           5.7         2.6          3.5         1.0 versicolor 14.82
## 81           5.5         2.4          3.8         1.1 versicolor 13.20
## 82           5.5         2.4          3.7         1.0 versicolor 13.20
## 83           5.8         2.7          3.9         1.2 versicolor 15.66
## 84           6.0         2.7          5.1         1.6 versicolor 16.20
## 85           5.4         3.0          4.5         1.5 versicolor 16.20
## 86           6.0         3.4          4.5         1.6 versicolor 20.40
## 87           6.7         3.1          4.7         1.5 versicolor 20.77
## 88           6.3         2.3          4.4         1.3 versicolor 14.49
## 89           5.6         3.0          4.1         1.3 versicolor 16.80
## 90           5.5         2.5          4.0         1.3 versicolor 13.75
## 91           5.5         2.6          4.4         1.2 versicolor 14.30
## 92           6.1         3.0          4.6         1.4 versicolor 18.30
## 93           5.8         2.6          4.0         1.2 versicolor 15.08
## 94           5.0         2.3          3.3         1.0 versicolor 11.50
## 95           5.6         2.7          4.2         1.3 versicolor 15.12
## 96           5.7         3.0          4.2         1.2 versicolor 17.10
## 97           5.7         2.9          4.2         1.3 versicolor 16.53
## 98           6.2         2.9          4.3         1.3 versicolor 17.98
## 99           5.1         2.5          3.0         1.1 versicolor 12.75
## 100          5.7         2.8          4.1         1.3 versicolor 15.96
## 101          6.3         3.3          6.0         2.5  virginica 20.79
## 102          5.8         2.7          5.1         1.9  virginica 15.66
## 103          7.1         3.0          5.9         2.1  virginica 21.30
## 104          6.3         2.9          5.6         1.8  virginica 18.27
## 105          6.5         3.0          5.8         2.2  virginica 19.50
## 106          7.6         3.0          6.6         2.1  virginica 22.80
## 107          4.9         2.5          4.5         1.7  virginica 12.25
## 108          7.3         2.9          6.3         1.8  virginica 21.17
## 109          6.7         2.5          5.8         1.8  virginica 16.75
## 110          7.2         3.6          6.1         2.5  virginica 25.92
## 111          6.5         3.2          5.1         2.0  virginica 20.80
## 112          6.4         2.7          5.3         1.9  virginica 17.28
## 113          6.8         3.0          5.5         2.1  virginica 20.40
## 114          5.7         2.5          5.0         2.0  virginica 14.25
## 115          5.8         2.8          5.1         2.4  virginica 16.24
## 116          6.4         3.2          5.3         2.3  virginica 20.48
## 117          6.5         3.0          5.5         1.8  virginica 19.50
## 118          7.7         3.8          6.7         2.2  virginica 29.26
## 119          7.7         2.6          6.9         2.3  virginica 20.02
## 120          6.0         2.2          5.0         1.5  virginica 13.20
## 121          6.9         3.2          5.7         2.3  virginica 22.08
## 122          5.6         2.8          4.9         2.0  virginica 15.68
## 123          7.7         2.8          6.7         2.0  virginica 21.56
## 124          6.3         2.7          4.9         1.8  virginica 17.01
## 125          6.7         3.3          5.7         2.1  virginica 22.11
## 126          7.2         3.2          6.0         1.8  virginica 23.04
## 127          6.2         2.8          4.8         1.8  virginica 17.36
## 128          6.1         3.0          4.9         1.8  virginica 18.30
## 129          6.4         2.8          5.6         2.1  virginica 17.92
## 130          7.2         3.0          5.8         1.6  virginica 21.60
## 131          7.4         2.8          6.1         1.9  virginica 20.72
## 132          7.9         3.8          6.4         2.0  virginica 30.02
## 133          6.4         2.8          5.6         2.2  virginica 17.92
## 134          6.3         2.8          5.1         1.5  virginica 17.64
## 135          6.1         2.6          5.6         1.4  virginica 15.86
## 136          7.7         3.0          6.1         2.3  virginica 23.10
## 137          6.3         3.4          5.6         2.4  virginica 21.42
## 138          6.4         3.1          5.5         1.8  virginica 19.84
## 139          6.0         3.0          4.8         1.8  virginica 18.00
## 140          6.9         3.1          5.4         2.1  virginica 21.39
## 141          6.7         3.1          5.6         2.4  virginica 20.77
## 142          6.9         3.1          5.1         2.3  virginica 21.39
## 143          5.8         2.7          5.1         1.9  virginica 15.66
## 144          6.8         3.2          5.9         2.3  virginica 21.76
## 145          6.7         3.3          5.7         2.5  virginica 22.11
## 146          6.7         3.0          5.2         2.3  virginica 20.10
## 147          6.3         2.5          5.0         1.9  virginica 15.75
## 148          6.5         3.0          5.2         2.0  virginica 19.50
## 149          6.2         3.4          5.4         2.3  virginica 21.08
## 150          5.9         3.0          5.1         1.8  virginica 17.70

Crash course in using R for statistics

A commonly asked question in research is whether two or more populations differ in a statistic, such as the mean. We can use parametric statistics to ask this question when each population is normally distributed. In statistics, the word normal, rather than meaning “typical”, is used to describe frequency curves that have the shape of a bell.

T-tests

T tests are used to compare two groups between each other. We will present an example of how to use these tests by working with the cats dataset. Specifically, we will test whether cat weight depends on sex.

First, we load the dataset.

install.packages("MASS")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(cats)
head(cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

This is the distribution of weight according to sex.

?cats
boxplot(Bwt ~ Sex, data = cats)

We can also visualize the distributions using a histogram.

malewt <- with(cats, Bwt[Sex=='M'])
femalewt <- with(cats, Bwt[Sex=='F'])

hist(
  x = femalewt, 
  # The function `rgb` lets you set colors for each plot
  col = rgb(r=0.25, g=0, b=1, 0.5), 
  ylim = c(0, 25),
  xlim = range(c(femalewt, malewt))
)

hist(
  x = malewt, 
  col = rgb(r=0, g=1, b=0, 0.5), 
  ylim = c(0, 25), 
  add=TRUE
)

legend(
  x = "topright", 
  legend=c("Female", "Male"),
  fill=c(
    rgb(r=0.25, g=0, b=1, 0.5), 
    rgb(r=0, g=1, b=0, 0.5))
)

?hist

From the two types of plots, it looks like the distribution of weight varies according to sex. Males seem to be larger than females. Beyond what we seem to see in the plots, can we objectively confirm that there are differences? For that, we use a T-test.

T-tests require that the data are normally-distributed. That means that within each group, the data should appear as if they come from a bell-shaped curve (to learn more about normal distributions, check this link: https://www.analyticsvidhya.com/blog/2020/04/statistics-data-science-normal-distribution/).

Do the data look normally distributed? Apart from trying to guess the shape of the distributions, we can use a quantile-quantile (Q-Q) plot to find out.

qqnorm(femalewt)
qqline(femalewt)

In this type of plot, if the dots fall along the line, then data are normally distributed. As we can see, female weights do not seem to be normally distributed.

qqnorm(malewt)
qqline(malewt)

Male weight look pretty good as assessed with a Q-Q plot.

Now that the data are normally-distributed, we will use the t-test to find difference between the sexes. We can use a version of test that assumes that variance between groups is equal:

t.test(malewt, femalewt, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  malewt and femalewt
## t = 7.3307, df = 142, p-value = 1.59e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3946927 0.6861584
## sample estimates:
## mean of x mean of y 
##  2.900000  2.359574

Or one that assumes unequal variance:

t.test(malewt, femalewt, var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  malewt and femalewt
## t = 8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4177242 0.6631268
## sample estimates:
## mean of x mean of y 
##  2.900000  2.359574

If your data are non-normally distributed, you can use a non-parametric test. For each parametric test there is a non-parametric version. For the t-test, it is the Wilcoxon signed-rank test.

wilcox.test(malewt, femalewt)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  malewt and femalewt
## W = 3801.5, p-value = 8.201e-11
## alternative hypothesis: true location shift is not equal to 0

Tip: If your sample size is large enough, you may still be able to use a parametric test, even when your data are not normally distributed. Be careful; although many statisticians will say that the sample size needs to be at least 30, that number may need to be much larger depending on your case. To learn more, read about the central limit theorem, which you can do by following this link: http://strata.uga.edu/8370/lecturenotes/means.html

Most tests in R return several pieces of info as a list. We have to do a bit of extra work if we need one particular piece of information. At least, we want the p value, which tell us how likely it is that the measured difference between groups is due to chance, and not to real differences between the populations of males and females (consequently, the lower the p value, the more likely it is that the differences are real).

mod <- wilcox.test(malewt, femalewt)
str(mod)
## List of 7
##  $ statistic  : Named num 3802
##   ..- attr(*, "names")= chr "W"
##  $ parameter  : NULL
##  $ p.value    : num 8.2e-11
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "location shift"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Wilcoxon rank sum test with continuity correction"
##  $ data.name  : chr "malewt and femalewt"
##  - attr(*, "class")= chr "htest"
mod$p.value
## [1] 8.200502e-11
mod
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  malewt and femalewt
## W = 3801.5, p-value = 8.201e-11
## alternative hypothesis: true location shift is not equal to 0

As you can see, the p value is very low, and therefore, we can be confident that body weight differs between males and females.

Many other tests are also available in R. They include Kolmogorov-Smirnov, Mann-Whitney, Shapiro-Wilk, Fisher-Snedecor, Bartlett, etc.. A quick search on Google is usually enough to find a basic function to perform them!

Warning: Knowing the function in R does NOT always mean that you know what the test does! ALWAYS take the time to read about the actual statistical method you want to use. Make sure you understand its assumptions and parameters.

ANOVAs

t-tests are good for comparing 2 groups. if you want to compare more than that, use an ANOVA.

We will work again with the iris dataset.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Let’s say that our hypothesis is that petal length varies significantly among species. Therefore:

  • Alternative hypothesis: setosa != versicolor != virginica.

  • Null hypothesis: setosa = versicolor = virginica.

Let’s first look at the distribution of our data.

boxplot(Petal.Length ~ Species, data = iris)

Just by eyeballing the data we seem to get a good idea of which hypothesis is supported by the data. Now, let’s run the statistical analysis:

iris.anova <- aov(Petal.Length ~ Species, data = iris)
summary(iris.anova)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  437.1  218.55    1180 <2e-16 ***
## Residuals   147   27.2    0.19                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given our p-value, we can reject the null hypothesis even at an alpha level of 0.001. Therefore, at least one of the species is significantly different from the others. But the ANOVA does NOT tell us which one(s)! To know that, we need to run post-hoc pairwise comparisons between species. They are called post-hoc because they are done after the ANOVA.

The Tukey HSD test allows us to run post-hoc comparisons between all levels of our variable.

TukeyHSD(iris.anova)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Petal.Length ~ Species, data = iris)
## 
## $Species
##                       diff     lwr     upr p adj
## versicolor-setosa    2.798 2.59422 3.00178     0
## virginica-setosa     4.090 3.88622 4.29378     0
## virginica-versicolor 1.292 1.08822 1.49578     0

Here, p = 0 for all species, which means that all groups are different from each other.

What does “p adj” mean? When you do a lot of significance tests on a dataset, you increase the likelihood that you will get p < 0.05 just by chance. This comic explains it quite well: https://xkcd.com/882/. Tukey HSD test adjusts p values for multiple tests. “p adj” means simply adjusted p-values.

The output of TukeyHSD allows us to plot confidence intervals for the mean of each species.

plot(TukeyHSD(iris.anova))

As a rule of thumb, if confidence intervals for a particular pair-wise comparison do not overlap zero, then the means of the group are different.

What if we need a non-parametric test (due to non-normal data and small sample sizes)?. Then, we use the Kruskal-Wallis test (non-parametric equivalent of ANOVA):

kruskal.test(Petal.Length ~ Species, data = iris)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Petal.Length by Species
## Kruskal-Wallis chi-squared = 130.41, df = 2, p-value < 2.2e-16

2-way ANOVA

Lets work with the crabs dataset.

library(MASS)
data(crabs)

Let’s explore different ways in which the size of the frontal lobes of the crabs may be affected by different factors.

head(crabs)
##   sp sex index   FL  RW   CL   CW  BD
## 1  B   M     1  8.1 6.7 16.1 19.0 7.0
## 2  B   M     2  8.8 7.7 18.1 20.8 7.4
## 3  B   M     3  9.2 7.8 19.0 22.4 7.7
## 4  B   M     4  9.6 7.9 20.1 23.1 8.2
## 5  B   M     5  9.8 8.0 20.3 23.0 8.2
## 6  B   M     6 10.8 9.0 23.0 26.5 9.8
boxplot(FL ~ sp, data = crabs)

boxplot(FL ~ sex, data = crabs)

Now that we are using more factors in our plots, it is a good time to introduce ggplot2, which is a package for advanced plotting. Look how easy it is to produce boxplots for 2 different factors in the same plot.

library(ggplot2)
ggplot(data = crabs, aes(x = sp, y = FL, fill = sex)) +
  geom_boxplot()

We can try different colors.

ggplot(data = crabs, aes(x = sp, y = FL, fill = sex)) +
  geom_boxplot() +
  scale_fill_manual(values = c('F' = 'purple', 'M' = 'chartreuse3'))

(A cheat sheet for colors in R is available at: https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf)

We want to know whether frontal lobe size depends on species or sex, or on a particular combination of species and sex. The effect of each variable by itself is called a main effect, whereas the effect of a combination between two variables is called an interaction. In the aov function, we can ask R to test main effects and interactions simply by separating the variables with an asterisk (*).

crab.anova <- aov(FL ~ sex * sp, data =crabs)
summary(crab.anova)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## sex           1    4.6     4.6   0.476  0.49128    
## sp            1  466.3   466.3  48.627 4.63e-11 ***
## sex:sp        1   80.6    80.6   8.409  0.00416 ** 
## Residuals   196 1879.7     9.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From above, we see that frontal lobe is affected by species and an interaction between species and sex. As with the one-way ANOVA, we will do post-hoc tests.

TukeyHSD(crab.anova)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = FL ~ sex * sp, data = crabs)
## 
## $sex
##      diff        lwr      upr     p adj
## M-F 0.302 -0.5617106 1.165711 0.4912819
## 
## $sp
##      diff      lwr      upr p adj
## O-B 3.054 2.190289 3.917711     0
## 
## $`sex:sp`
##           diff         lwr       upr     p adj
## M:B-F:B  1.572 -0.03289788 3.1768979 0.0572712
## F:O-F:B  4.324  2.71910212 5.9288979 0.0000000
## M:O-F:B  3.356  1.75110212 4.9608979 0.0000010
## F:O-M:B  2.752  1.14710212 4.3568979 0.0000868
## M:O-M:B  1.784  0.17910212 3.3888979 0.0227162
## M:O-F:O -0.968 -2.57289788 0.6368979 0.4022824
plot(TukeyHSD(crab.anova), las=1)

We see that the confidence intervals overlap zero when comparing sexes, which means that we cannot be confident about the differences between the sexes. For the rest of comparisons, we can be confident about differences between species (bar does not overlap zero), and about differences for four comparisons arranged by sex and species (can you identify which comparisons?).

Correlation and linear regression

We use correlation and linear regression when our independent and dependent variables are continuous.

Using the cats dataset, we want to know whether body weight and heart weight are correlated.

head(cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6
plot(Hwt ~ Bwt, cats)

We will use cor.test.

cor.cats <- cor.test(x = cats$Bwt, y = cats$Hwt)
cor.cats$estimate
##       cor 
## 0.8041274

The output of cor is the linear correlation coefficient (lcc), which can go from -1 (negative correlation) to 1 (positive correlation). In this case, we have a positive correlation.

Using the lcc, we can estimate R^2, which in this case is the percentage of the variation in the heart weight that can be explained by body weight.

cor.cats$estimate ^ 2
##       cor 
## 0.6466209

Linear models in R give similar results to cor.test. Models represent hypothesized relationships among variables. Usually one is the response variable (y) and one or more the predictors (x’s). With a model you can build an equation that will allow you to predict the dependent variable based on the independent variable.

The formula for a linear model is as follows:

y = mx + b + error

In the equation above, m is often called the slope, and b the intercept.

We will apply a linear model to the cats dataset…

cats.lm <- lm(Hwt ~ Bwt, data = cats)
summary(cats.lm) ## same r-squared value
## 
## Call:
## lm(formula = Hwt ~ Bwt, data = cats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5694 -0.9634 -0.0921  1.0426  5.1238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3567     0.6923  -0.515    0.607    
## Bwt           4.0341     0.2503  16.119   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared:  0.6466, Adjusted R-squared:  0.6441 
## F-statistic: 259.8 on 1 and 142 DF,  p-value: < 2.2e-16

… and plot the model.

plot(Hwt ~ Bwt, cats)
abline(cats.lm, lty = 2)

We can extract the coefficients of the model (intercept and slope - the slope’s name is the same as that of the predictor).

coef(cats.lm)
## (Intercept)         Bwt 
##  -0.3566624   4.0340627

If you want to predict the average value of y for a given x, all you need to do is plug-in these coefficients into the equation of the linear model shown above, but without the error term.

For example, when x is 3, we predict that the average value of y will be… (check the plot of the linear model to confirm the prediction)

y <- coef(cats.lm)[2] * 3 + coef(cats.lm)[1]
print(y)
##      Bwt 
## 11.74553

Of course, due to expected error, when x is 3, y will not be exactly equal to 11.7455257 (that is why we mentioned that that number is an “average value of y”). When the error is larger, the exact values of y will spread further from the predicted average, and thus, the ability of our equation to predict y falls down. We can use R^2 as a measure of that ability. It can be calculated from the sum of squares of residuals. A residual is the difference between an observed y-value and its corresponding value in the line fitted to the data. Just for fun, we can make a function to estimate sums of squares:

ss <- function(x) {  
  sum((x - mean(x))^2)   
}
ss(x = c(1, 2, 3))
## [1] 2

The R^2 is a ratio between two numbers: 1) the sum of squares for the residuals of the model and, 2) the sum of squares for the residuals of a model in which body weight does not predict heart weight. In other words, we compare the alternative hypothesis to the null hypothesis. To come up a model for the null hypothesis, we simply generate an horizontal line.

plot(Hwt ~ Bwt, cats)
abline(h = mean(cats$Hwt))

Just by eyeballing the graph we can see that such a model is unlikely to explain the data. When that is the case, we expect the sum of squares of residuals to be high.

ss1 <- ss(cats$Hwt - mean(cats$Hwt))
ss1
## [1] 847.6256

In contrast, look at the sums of squares of residuals for the model that we fitted to the data.

ss2 <- ss(residuals(cats.lm))
ss2
## [1] 299.5331

To calculate R^2, we use the following formula:

(SS1 - SS2) / SS1

(ss1 - ss2) / ss1
## [1] 0.6466209

As you can see, it is the same value that we got before. R^2 can take any value from 0 to 1. The higher it is, the higher the predictive power of the model will be.

Multiple regression

Multiple regression is used when you have multiple x-variables but only one y-variable.

We will use multiple regression with the database airquality, and test whether air temperature is correlated with solar radiation and wind speed.

data(airquality)
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

To explore patterns, let’s plot of all variables, in a pairwise fashion.

pairs(airquality)

lm.airquality <- lm(Temp ~ Wind + Solar.R, data = airquality)

summary(lm.airquality)
## 
## Call:
## lm(formula = Temp ~ Wind + Solar.R, data = airquality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6521  -5.0358   0.9309   5.0250  18.0943 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 84.899708   2.476595  34.281  < 2e-16 ***
## Wind        -1.155726   0.188311  -6.137 7.82e-09 ***
## Solar.R      0.025697   0.007337   3.503 0.000615 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.944 on 143 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.2687, Adjusted R-squared:  0.2585 
## F-statistic: 26.27 on 2 and 143 DF,  p-value: 1.916e-10

mfrow defines the number of plots per row and column (here, 2 and 2). To design more complex arrangements of plots, with non-matching numbers of rows and columns, you can also check the layout function.

par(mfrow=c(2, 2)) ##the 'par' function changes parameters of a plot
?layout

termplot(lm.airquality, partial = TRUE)[1]
## [1] 2
plot(Temp ~ Wind, data = airquality)
plot(Temp ~ Solar.R, data = airquality)

How much is the position of the regression line affected by chance? Had we added a couple more of observations to our sample, or had we taken measurements at different times, would we expect the regression line to stay in the same exact position? We can answer these questions by visualizing the confidence intervals of the regression lines.

library(visreg)

par(mfrow=c(1,2))
visreg(lm.airquality, "Wind")
visreg(lm.airquality, "Solar.R")

In this case, the confidence intervals indicate regions in the plot where we expect the regression line to lie if we decided to repeat the analysis with a new sample. The regions seem to be tightly embracing each respective regression line; thus, we can be confident about using the fitted regression model to explain the relationship between our variables or to predict the average value of y.

Testing assumptions of statistical tests (independence, normality, etc.)

Normality

We have already seen how to check graphically for normality using a Q-Q plot. Now, let us do that again using one of the many available normality tests, the Shapiro-Wilk test, using the crabs dataset again:

library(MASS)
data(crabs)
shapiro.test(crabs$FL)
## 
##  Shapiro-Wilk normality test
## 
## data:  crabs$FL
## W = 0.99037, p-value = 0.2023
shapiro.test(crabs$CW)
## 
##  Shapiro-Wilk normality test
## 
## data:  crabs$CW
## W = 0.99106, p-value = 0.2542

The Shapiro-Wilk test, like many other tests of normality, tests wether the distribution of the data is significantly different from a normal distribution, which means that a significant p value indicates that the data are not normally distributed. In our case, p > 0.05; therefore, data are normally distributed.

In a regression model, normality is checked for the residuals of the regression and not the raw data (Remember: residuals are the differences between raw data and values predicted by the regression model).

crabs.lm<-lm(FL~CL,data=crabs)
summary(crabs.lm)
## 
## Call:
## lm(formula = FL ~ CL, data = crabs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86395 -0.51746 -0.02826  0.50456  1.77009 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.15316    0.23477   0.652    0.515    
## CL           0.48060    0.00714  67.313   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.717 on 198 degrees of freedom
## Multiple R-squared:  0.9581, Adjusted R-squared:  0.9579 
## F-statistic:  4531 on 1 and 198 DF,  p-value: < 2.2e-16
residuals<-crabs.lm$residuals
residuals
##            1            2            3            4            5            6 
##  0.209214133 -0.051982222 -0.084520582 -0.213178578 -0.109298213 -0.406913293 
##            7            8            9           10           11           12 
## -0.491391835 -0.327810559  0.016368894 -0.464229284 -1.073485457 -0.733186368 
##           13           14           15           16           17           18 
## -0.865724728 -0.425425639 -0.521545275 -0.133186368 -0.606023817 -0.654083635 
##           19           20           21           22           23           24 
## -0.213784546 -0.286621995 -0.895878168 -0.884237074 -0.243937986 -0.724536163 
##           25           26           27           28           29           30 
## -0.772595981 -0.476476346 -0.612895070 -1.658569968 -0.593493248 -0.297373612 
##           31           32           33           34           35           36 
## -0.826031608 -1.254689603 -0.962450332 -0.910510150 -1.863945776 -0.750809239 
##           37           38           39           40           41           42 
## -1.179467234 -0.546928874 -0.691108327 -1.023646687 -1.052304683 -1.332902861 
##           43           44           45           46           47           48 
## -0.992603772 -1.586338487 -0.846039398 -0.509620674 -0.590218852 -1.114996483 
##           49           50           51           52           53           54 
## -0.730517940 -1.489329376 -0.017948418 -0.428700035  0.055778507 -0.280640218 
##           55           56           57           58           59           60 
## -0.072879489 -0.157358031 -0.097657120 -0.089896391 -0.182135662 -0.166614204 
##           61           62           63           64           65           66 
##  0.033385796  0.089206343 -0.523930195 -0.375870377  0.056667983 -0.420049831 
##           67           68           69           70           71           72 
## -0.748707826  0.024129623 -0.360348919 -0.096767644 -0.377365821 -0.185126550 
##           73           74           75           76           77           78 
## -0.761844364 -0.273485457 -0.317664910 -0.361844364 -0.402143453 -0.198263088 
##           79           80           81           82           83           84 
## -0.671100537  0.080839645  0.280839645 -0.388117439 -0.916775434 -0.336177257 
##           85           86           87           88           89           90 
## -0.384237074 -0.288117439 -1.245433430 -0.520655799 -0.324536163 -0.705134341 
##           91           92           93           94           95           96 
## -0.316775434 -0.845433430 -0.601253977 -0.933792337 -0.533792337 -0.189612883 
##           97           98           99          100          101          102 
## -1.043048510 -1.112005594 -1.204244865 -0.609620674  0.920855227  0.338761605 
##          103          104          105          106          107          108 
##  0.598462516  0.337266160  1.196967072  0.716368894  0.051292174  0.022634179 
##          109          110          111          112          113          114 
##  0.462933267  0.330394907 -0.186621995  0.057557459 -0.038562177  0.153677094 
##          115          116          117          118          119          120 
##  0.253677094  0.686215454  0.013378005  0.076959281  0.609497641  0.432779828 
##          121          122          123          124          125          126 
##  0.096361103  0.732779828  0.311882561  0.523523654  0.959942379  0.125908575 
##          127          128          129          130          131          132 
## -0.102749421  0.814267481  0.718147846  0.345310397  0.012772037  0.208891673 
##          133          134          135          136          137          138 
## -0.600364501  0.032173859  0.280233677  0.712772037  0.768592584 -0.192603772 
##          139          140          141          142          143          144 
##  0.455456046  0.119037322  0.434558779 -0.122757211  0.229182971 -0.520372291 
##          145          146          147          148          149          150 
## -0.516491927  0.588883882 -0.072312474  0.512166069  0.162610807  0.070371536 
##          151          152          153          154          155          156 
##  0.262043791  0.817864338  0.764428712  0.431890352  0.347411810  0.870693996 
##          157          158          159          160          161          162 
##  0.005617276  0.642036001  0.321138734  0.328899463  0.380839645  0.452181650 
##          163          164          165          166          167          168 
##  0.259942379  0.548301285  0.648301285  0.759942379  0.856062014  0.859942379 
##          169          170          171          172          173          174 
##  1.023523654  1.039045112  0.766207663  0.766207663  1.342925477  1.106506752 
##          175          176          177          178          179          180 
##  1.170088028 -0.367826141  1.277848757  1.137549668  0.705011308  0.564712219 
##          181          182          183          184          185          186 
##  1.770088028  1.441430032  1.301130944  0.780233677  0.732173859  0.370977504 
##          187          188          189          190          191          192 
##  0.811276593  0.474857868  0.430678415  0.819037322  1.211276593  1.122917686 
##          193          194          195          196          197          198 
##  0.502020420  1.570977504  0.096644611  1.446199873  1.505900784  1.273362424 
##          199          200 
##  1.681123153  0.743208984
shapiro.test(residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.99496, p-value = 0.7446

Homoscedasticity

Many parametric analyses require homoscedasticity; that is, variances between groups in a dataset should be homogenous. Homoscedasticity can be checked graphically by plotting the residuals.

par(mfrow=c(1,1))
plot(
  x = crabs$CL, 
  y = residuals, 
  ylab="Residuals", 
  xlab="Carapace length", 
  main="Normalized Residuals v Fitted Values"
)
abline(c(0,0))

…but it looks better when using a function from the car package:

library("car")
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
residualPlot(crabs.lm)

If the assumption of homoscedasticity is true, the dots in the previous plots should look homogeneous (a point cloud with no clear outliers).

To check for outliers, you can use studentized residuals, which are residuals weighted by their standard deviation.

spreadLevelPlot(crabs.lm)

## 
## Suggested power transformation:  -0.8908105

The data do not look very homogeneous.

If you want to test for homoscedasticity, use a non-constant error variance test, such as the Breusch-Pagan test.

ncvTest(crabs.lm)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 22.77853, Df = 1, p = 1.8178e-06

Here, p is significant, which means that the residuals are not homoscedastic. In this case, using weighted least squares (WLS) might be better than using ordinary least squares when estimating the linear model. If you want to know more about this topic, check the following link: https://towardsdatascience.com/when-and-how-to-use-weighted-least-squares-wls-models-a68808b1a89d

Multicollinearity

When you have multiple predictors, you need to check whether the independent variables (x variables) are correlated to each other. If they are, the individual effect of each predictor on the dependent variable cannot be properly estimated.

Correlation among independent variables be assessed by estimating the variance inflation factor (VIF):

library(car)
# Fitting the linear model
crabs.lm <- lm(FL~CL+CW,data=crabs)
summary(crabs.lm)
## 
## Call:
## lm(formula = FL ~ CL + CW, data = crabs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73089 -0.36896 -0.05988  0.28899  1.82283 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.59245    0.22046   2.687  0.00782 ** 
## CL           0.92406    0.06443  14.342  < 2e-16 ***
## CW          -0.40305    0.05827  -6.917 6.29e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6448 on 197 degrees of freedom
## Multiple R-squared:  0.9663, Adjusted R-squared:  0.966 
## F-statistic:  2826 on 2 and 197 DF,  p-value: < 2.2e-16
# Testing for multicolinearity
vif(crabs.lm)
##       CL       CW 
## 100.7036 100.7036
sqrt(vif(crabs.lm))
##       CL       CW 
## 10.03512 10.03512

The square root of VIF indicates how much larger the standard error is compared to a model where predictors are independent. Here, it is 10 times more! Carapace length and width are highly collinear.

Non-independence of errors (autocorrelation)

Observations for each variable in a linear model are assumed to be independent from each other. If you just want to test autocorrelation on your dataset, use a Durbin-Watson test:

durbinWatsonTest(crabs.lm)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.5563061     0.8747409       0
##  Alternative hypothesis: rho != 0

A p value that short (close to 0!) suggests that the samples are not independent. In cases such as this, it is better to use generalized least squares when fitting a regression. The function gls from the package nlme allows you to implement this type of analysis in R.

A few other tricks

The function gvlma can help you quickly check several assumptions of a linear model:

install.packages("gvlma")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(gvlma)
model<-gvlma(crabs.lm)
summary(model)
## 
## Call:
## lm(formula = FL ~ CL + CW, data = crabs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73089 -0.36896 -0.05988  0.28899  1.82283 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.59245    0.22046   2.687  0.00782 ** 
## CL           0.92406    0.06443  14.342  < 2e-16 ***
## CW          -0.40305    0.05827  -6.917 6.29e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6448 on 197 degrees of freedom
## Multiple R-squared:  0.9663, Adjusted R-squared:  0.966 
## F-statistic:  2826 on 2 and 197 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = crabs.lm) 
## 
##                     Value   p-value                   Decision
## Global Stat        31.580 2.331e-06 Assumptions NOT satisfied!
## Skewness            7.355 6.686e-03 Assumptions NOT satisfied!
## Kurtosis            2.130 1.445e-01    Assumptions acceptable.
## Link Function       0.187 6.655e-01    Assumptions acceptable.
## Heteroscedasticity 21.908 2.861e-06 Assumptions NOT satisfied!

In the following link, find a lot of great examples on how to analyze linear models, using a geological dataset on soil content indicators: http://www.css.cornell.edu/faculty/dgr2/_static/files/R_PDF/mhw.pdf

A few words on style

Coding with style can help make your scripts readable. Hadley Wickham, creator of many popular packages such as ggplot2, has written this very useful post on coding with style in R: http://adv-r.had.co.nz/Style.html (also a good example of well-formatted Markdown output!). In particular, we recommend that you read the “Line length” section first, because, in our experience, most problems with code readability are related to having very long lines of code. But do check other parts of the post, and use the document as a reference that you go back to as your code starts increasing in complexity.

The first time that you write a piece of code it does not need to be perfect and you do not need to spend much time working on the format. Revising your code, however, will make it easier for your readers to understand your analyses. Programmers have a special name for this process: “refactoring”, which is just a synonym of “revising”. This is one of the many ways in which coding is like writing a manuscript.

Tip: When you have some spare time, go back to your code and ask yourself, how can I make it more readable? Do I need to change the style?, can I rewrite variables and functions in a way that produces the same output but with less code?. If you have something you can change, do it.

HOMEWORK - DUE NEXT CLASS

Download the dataset Weather from this link: https://vincentarelbundock.github.io/Rdatasets/csv/mosaicData/Weather.csv.

This dataset belongs to the mosaicData package. To learn more about the dataset, install and load mosaicData and then search for the corresponding help file.

In each corresponding chunk:

  • Import the dataset into R as a dataframe

  • What was the maximum temperature ever reached in Mumbai in December 2017? Which days of the month was it?

  • For how many days was there fog, rain, and a thunderstorm in Auckland in 2016?

  • Does average humidity influence average visibility in Beijing?

  • Are the data used for the previous question normal? Homoscedastic? Check these assumptions graphically and with a test.

  • Focusing on the last three months of 2017 in Chicago; does minimal sea pressure vary with any specific weather event (column “events”)?

  • Do a boxplot of the average temperature for each city.

  • Test if cities have significantly different average temperatures.

  • If yes, which cities have different temperatures?

  • We want to compare how temperature profiles differ between Beijing and Chicago through 2017. Plot two histograms side by side of the average temperature, one for each city. The histograms must have different colors.