Correlations
A correlation is performed when we want to investigate the potential association between two continuous quantitative variables, or between some ordinal variables. We assume that the association is linear, like in a linear regression, and that one variable increases or decreases by a constant amount for a corresponding unit increase or decrease in the other variable. This does not suggest that one variable explains the other – that is the purpose of regression, as seen in the previous Chapter. Like all statistical tests, correlation requires a series of assumptions as well:
- pair-wise data
- absence of outliers
- linearity
- normality of distribution
- homoscedasticity
- level (type) of measurement
- continuous data (Pearson correlation)
- ordinal data (Spearman correlation)
The degree of association is measured by a correlation coefficient, denoted by \(r\) (note, in a regression we use the \(r^{2}\), which must not be confused for \(r\)). The \(r\) statistic is sometimes called the correlation coefficient; it is a measure of linear association. The value for \(r\) varies from -1 to 1, with 0 indicating that there is absolutely no association, 1 showing a perfect positive association, and -1 a perfect inverse correlation.
Pearson correlation
In order to investigate correlations in biological data lets load the
ecklonia
dataset.
# Load libraries
library(tidyverse)
library(ggpubr)
library(corrplot)
# Load data
<- read.csv("../data/ecklonia.csv") ecklonia
We will also create a subsetted version of our data by removing all of the categorical variables. If we have a dataframe where each column represents pair-wise continuous/ordinal measurements with all of the other columns we may very quickly and easily perform a much wider range of correlation analyses.
<- ecklonia %>%
ecklonia_sub select(-species, - site, - ID)
When the values we are comparing are continuous, we may use a Pearson test. This is the default and so requires little work on our end. The resulting statistic from this test is known as the Pearson correlation coefficient:
# Perform correlation analysis on two specific variables
# Note that we do not need the final two arguments in this function to be stated
# as they are the defaut settings.
# They are only shown here to illustrate that they exist.
cor.test(x = ecklonia$stipe_length, ecklonia$frond_length,
use = "everything", method = "pearson")
R>
R> Pearson's product-moment correlation
R>
R> data: ecklonia$stipe_length and ecklonia$frond_length
R> t = 4.2182, df = 24, p-value = 0.0003032
R> alternative hypothesis: true correlation is not equal to 0
R> 95 percent confidence interval:
R> 0.3548169 0.8300525
R> sample estimates:
R> cor
R> 0.6524911
Above we have tested the correlation between the length of
Ecklonia maxima stipes and the length of their fronds. A
perfect positive (negative) relationship would produce a value of 1
(-1), whereas no relationship would produce a value of 0. The result
above, cor = 0.65
is relatively strong.
As is the case with everything else we have learned thus far, a good visualisation can go a long way to help us understand what the statistics are doing. Below we visualise the stipe length to frond length relationship.
# Calculate Pearson r beforehand for plotting
<- paste0("r = ",
r_print round(cor(x = ecklonia$stipe_length, ecklonia$frond_length),2))
# Then create a single panel showing one correlation
ggplot(data = ecklonia, aes(x = stipe_length, y = frond_length)) +
geom_smooth(method = "lm", colour = "grey90", se = F) +
geom_point(colour = "mediumorchid4") +
geom_label(x = 300, y = 240, label = r_print) +
labs(x = "Stipe length (cm)", y = "Frond length (cm)")
Just by eye-balling this scatterplot it should be clear that these data tend to increase at a roughly similar rate. Our Pearson r value is an indication of what that is.
Should our dataset contain multiple variables, as
ecklonia
does, we may investigate all of the correlations
simultaneously. Remember that in order to do so we want to ensure that
we may perform the same test on each of our paired variables. In this
case we will use ecklonia_sub
as we know that it contains
only continuous data and so are appropriate for use with a Pearson test.
By default R will use all of the data we give it and perform a Pearson
test so we do not need to specify any further arguments. Note however
that this will only output the correlation coefficients, and does not
produce a full test of each correlation. This will however be useful for
us to have just now.
<- round(cor(ecklonia_sub), 2)
ecklonia_pearson ecklonia_pearson
R> stipe_length stipe_diameter frond_length digits
R> stipe_length 1.00 0.59 0.65 0.24
R> stipe_diameter 0.59 1.00 0.39 0.24
R> frond_length 0.65 0.39 1.00 0.36
R> digits 0.24 0.24 0.36 1.00
R> primary_blade_width 0.34 0.83 0.28 0.14
R> primary_blade_length 0.13 0.32 -0.02 0.10
R> stipe_mass 0.58 0.82 0.39 0.07
R> frond_mass 0.51 0.51 0.57 0.28
R> epiphyte_length 0.61 0.54 0.61 0.05
R> primary_blade_width primary_blade_length stipe_mass
R> stipe_length 0.34 0.13 0.58
R> stipe_diameter 0.83 0.32 0.82
R> frond_length 0.28 -0.02 0.39
R> digits 0.14 0.10 0.07
R> primary_blade_width 1.00 0.34 0.83
R> primary_blade_length 0.34 1.00 0.16
R> stipe_mass 0.83 0.16 1.00
R> frond_mass 0.36 0.15 0.47
R> epiphyte_length 0.41 0.26 0.51
R> frond_mass epiphyte_length
R> stipe_length 0.51 0.61
R> stipe_diameter 0.51 0.54
R> frond_length 0.57 0.61
R> digits 0.28 0.05
R> primary_blade_width 0.36 0.41
R> primary_blade_length 0.15 0.26
R> stipe_mass 0.47 0.51
R> frond_mass 1.00 0.44
R> epiphyte_length 0.44 1.00
write.csv(ecklonia_pearson, "../data/ecklonia_pearson.csv")
The output, called a pairwise correlation matrix, can be seen in Excel where it looks like this:
Task 1: What does the outcome of this test show? Please describe in detail how to interpret the output, and discuss the interesting characteristics of the correlation matrix.
How would be visualise this matrix of correlations? It is relatively straightforward to quickly plot correlation results for all of our variables in one go. In order to show which variables associate most with which other variables all at once, without creating chaos, we will create what is known as a heat map. This visualisation uses a range of colours, usually blue to red, to demonstrate where more of something is. In this case, we use it to show where more correlation is occurring between morphometric properties of the kelp Ecklonia maxima.
corrplot(ecklonia_pearson, method = "circle")
Task 2: What does the series of dark blue circles through the middle of this plot mean?
Task 3: Why does the scale bar on the right show the colour red?
Task 4: Which morphometric properties correlate best/worst?
Task 5: Browse the internet and find two or three other ways (hint: find other packages, for example) to visualise pairwise correlations, and create corresponding figures for the Ecklonia data.
Spearman rank correlation
When the data we want to compare are not continuous, but rather
ordinal, we will need to use Spearman’s rank correlation. This is not
often a test one uses in biology because we tend to want to compare
continuous data within categories. In the code below we will add a
column of ordinal data to our ecklonia
data to so that we
may look at this test.
# Create ordinal data
$length <- as.numeric(cut((ecklonia$stipe_length + ecklonia$frond_length), breaks = 3))
ecklonia
# What does this new column look like?
head(select(ecklonia, c(species, site, stipe_length, frond_length, length)), 10)
R> species site stipe_length frond_length length
R> 1 maxima Boulders Beach 456 116 1
R> 2 maxima Boulders Beach 477 141 2
R> 3 maxima Boulders Beach 427 144 1
R> 4 maxima Boulders Beach 347 127 1
R> 5 maxima Boulders Beach 470 160 2
R> 6 maxima Boulders Beach 478 181 2
R> 7 maxima Boulders Beach 472 174 2
R> 8 maxima Boulders Beach 459 95 1
R> 9 maxima Boulders Beach 397 87 1
R> 10 maxima Boulders Beach 541 127 2
Task 6: Explain in writing what the code above does. Focus specifically on the line of code with the
cut()
function in it.
Now let us correlate the new length
variable with any
one of the other variables:
cor.test(ecklonia$length, ecklonia$digits, method = "spearman")
R>
R> Spearman's rank correlation rho
R>
R> data: ecklonia$length and ecklonia$digits
R> S = 1930, p-value = 0.08906
R> alternative hypothesis: true rho is not equal to 0
R> sample estimates:
R> rho
R> 0.3401765
Note that unlike Pearson’s correlation where the correlation coefficient is denoted by \(r\), a Spearman correlation return the \(\rho\) coefficient. The interpretation is the same is for Pearson’s correlation coefficient, however.
Kendall rank correlation
This test will work for both continuous and ordinal data. A sort of
dealers choice of correlation tests. It’s main purpose is to allow us to
perform a correlation on non-normally distributed data. Let’s look at
the normality of our ecklonia
variables and pull out those
that are not normal in order to see how the results of this test may
differ from our Pearson tests.
<- ecklonia_sub %>%
ecklonia_norm gather(key = "variable") %>%
group_by(variable) %>%
summarise(variable_norm = as.numeric(shapiro.test(value)[2]))
ecklonia_norm
R> # A tibble: 9 × 2
R> variable variable_norm
R> <chr> <dbl>
R> 1 digits 0.0671
R> 2 epiphyte_length 0.626
R> 3 frond_length 0.202
R> 4 frond_mass 0.277
R> 5 primary_blade_length 0.00393
R> 6 primary_blade_width 0.314
R> 7 stipe_diameter 0.170
R> 8 stipe_length 0.213
R> 9 stipe_mass 0.817
Task 7: Explain in written langauge what each line of the code, above, does.
From this analysis we may see that the values for primary blade length are not normally distributed. In order to make up for this violation of our assumption of normality we may use the Kendall test.
cor.test(ecklonia$primary_blade_length, ecklonia$primary_blade_width, method = "kendall")
R>
R> Kendall's rank correlation tau
R>
R> data: ecklonia$primary_blade_length and ecklonia$primary_blade_width
R> z = 2.3601, p-value = 0.01827
R> alternative hypothesis: true tau is not equal to 0
R> sample estimates:
R> tau
R> 0.3426171
Here the correlation coefficient is called Kendall’s \(\tau\) but it is interpreted as we would Pearson’s.
Exercises
Exercise 1 Find your own datasets suitable for i) Pearson’s, ii) Spearman’s, and iii) Kendall’s correlations. For each, set the null and alternative hypotheses, apply the correlations, and create suitable visualisations. Ensure that one of the datasets lends itself to some pairwise comparisons.