Chapter 10 Testing assumptions or: How I learned to stop worrying and transform the data

10.1 Backgroud

In this chapter we will learn how to test for the two most common assumptions we will make in the biological sciences.

Let us recap—the most important assumptions that we need to make sure are met before doing t-tests, ANOVAs or linear regressions are:

  • The dependent variable must be continuous.
  • The data must be independent of each other.
  • The data most be normally distributed.
  • The data must be homoscedastic.

For data conforming to this expectation, we say that the data are independent and identically distributed, or i.i.d. We will deal in particular with the the assumptions of normality and heteroscedasticity in this chapter. Whether or not the dependent data are continuous and independent comes down to proper experimental design, so if these are violated then… (I’ll say no more).

How do we know this? Here are your options, followed by a quick refresher:

  • Perform any of the diagnostic plots we covered in the earlier Chapters.
  • Compare the variances and see if they differ by more than a factor of four.
  • Do a Levene’s test to test for equal variances.
  • Do a Shapiro-Wilk test for normality.

10.1.1 Normality

Before we begin, let’s go ahead and activate our packages and load our data.

library(tidyverse)
chicks <- as_tibble(ChickWeight)

The quickest method of testing the normality of a variable is with the Shapiro-Wilk normality test. This will return two values, a W score and a p-value. FOr the purposes of this course we may safely ignore the W score and focus on the p-value. When p >= 0.05 we may assume that the data are normally distributed. If p < 0.05 then the data are not normally distrubted. Let’s look at all of the chicks data without filtering it:

shapiro.test(chicks$weight)
R> 
R>  Shapiro-Wilk normality test
R> 
R> data:  chicks$weight
R> W = 0.90866, p-value < 2.2e-16

Are these data normally distributed? How do we know? Now let’s filter the data based on the different diets for only the weights taken on the last day (21):

chicks %>% 
  filter(Time == 21) %>% 
  group_by(Diet) %>% 
  summarise(norm_wt = as.numeric(shapiro.test(weight)[2]))
R> # A tibble: 4 x 2
R>   Diet  norm_wt
R>   <fct>   <dbl>
R> 1 1       0.591
R> 2 2       0.949
R> 3 3       0.895
R> 4 4       0.186

How about now?

10.1.2 Homoscedasticity

Here we need no fancy test. We must simply calculate the variance of the variables we want to use and see that they are not more than 3-4 times greater than one another.

chicks %>% 
  filter(Time == 21) %>% 
  group_by(Diet) %>% 
  summarise(var_wt = var(weight))
R> # A tibble: 4 x 2
R>   Diet  var_wt
R>   <fct>  <dbl>
R> 1 1      3446.
R> 2 2      6106.
R> 3 3      5130.
R> 4 4      1879.

Well, do these data pass the two main assumptions?

10.1.3 Epic fail. Now what?

After we have tested our data for the two key assumptions we are faced with a few choices. The basic guidelines below apply to paired tests, one- and two-sample tests, as well as one- and two-sided hypotheses (i.e. t-tests and their ilk):

Assumption R function Note
Equal variances t.test(..., var.equal=TRUE) Student’s t-test
Unequal variances t.test(...) Using Welch’s approximation of variances
Normal data t.test(...) As per equal/unequal variance cases, above
Data not normal wilcox.test(...) Wilcoxon (1-sample) or Mann-Whitney (2-sample) tests

When we compare two or more groups we usually do an ANOVA, and the same situation is true. For ANOVAs our options include (but are not limited to):

Assumption R function Note
Normal data, equal variances aov(...) A vanilla analysis of variance
Normal data, unequal variances oneway.test(...) Using Welch’s approximation of variances, if needed, but robust if variances differ no more than 4-fold; could also stabilise variances using a square-root transformation; may also use kruskal.test()
Data not normal (and/or non-normal) kruskal.test(...) Kruskal-Wallis rank sum test

See this discussion if you would like to know about some more advanced options when faced with heteroscedastic data.

Our tests for these two assumptions fail often with real data. Therefore, we must often identify the way in which our data are distributed (refert to Chapter 5) so we may better decide how to transform them in an attempt to coerce them into a format that will pass the assumptions.

10.2 Transforming data

When transforming data, one does a mathematical operation on the observations and then use these transformed numbers in the statistical tests. After one as conducted the statistical analysis and calculated the mean ± SD (or ± 95% CI), these values are back transformed (i.e. by applying the reverse of the transformation function) to the original scale before being reported. Note that in back-transformed data the SD (or CI) are not necessarily symmetrical, so one cannot simply compute one (e.g. the upper) and then assumed the lower one would be the same distance away from the mean.

When transforming data, it is a good idea to know a bit about how data within your field of study are usually transformed—try and use the same approach in your own work. Don’t try all the various transformations until you find one that works, else it might seems as if you are trying to massage the data into an acceptable outcome. The effects of transformations are often difficult to see on the shape of data distributions, especially when you have few samples, so trust that what you are doing is correct. Unfortunately, as I said before, transforming data requires a bit of experience and knowledge with the subject matter, so read widely before you commit to one.

(Some of the texst below comes from here). Also see here for more detailed discussion around these issues.

10.2.1 Log transform

The logarithm, \(x\) to \(log_{10}\) of \(x\), or \(x\) to \(log_{e}\) of \(x\) (\(ln x\), also known as natural logs), or \(x\) to \(log_{2}\) of x, is a strong transformation with a major effect on distribution shape. It is commonly used for reducing right skewness and is often appropriate for measured variables. It can not be applied to zero or negative values. One unit on a logarithmic scale means a multiplication by the base of logarithms being used. In biology many response variables (i.e. measurements of metabolic or growth rate processes, etc.) have a log-normal disttribution, and performing a log-transform to such data will make them normal. It can also be used to transform exponential data to linearity.

The back transformation of \(log_{10}\) is to raise \(10\) to the power of the number. For example, if the mean of the base-10 log-transformed data is \(1.43\), the back transformed mean is \(10^{1.43}=26.9\). If the mean of your \(log_{e}\) data is 3.65, the back transformed mean is \(e^{3.65}=38.5\).

10.2.2 Arcsine transform

The arcsine transformation stabilises variance and normalises proportional data (0 to 1). The use of this transformation is today seen as a bit controversial, and we shall not cover it here. Use a statistical test such as a Generalised Linear Model (not covered in this course) or a logistic regression when dealing with these kinds of data.

10.2.3 Cube root

The cube root, \(x\) to \(x^{1/3}\). This is a fairly strong transformation with a substantial effect on distribution shape: it is weaker than the logarithm. It is also used for reducing right skewness, and has the advantage that it can be applied to zero and negative values. Note that the cube root of a volume has the units of a length. It is commonly applied to rainfall data.

10.2.4 Square root transform

The square root, \(x\) to \(x^{0.5} = \sqrt{x}\), is a transformation with a moderate effect on distribution shape: it is weaker than the logarithm and the cube root. It is also used for reducing right skewness, and also has the advantage that it can be applied to zero values. Note that the square root of an area has the units of a length. It is commonly applied to counted data—i.e. those typically follow a Poisson distribution. In a Poisson distribution the mean equals the variance, so group means will always be correlated with within-group variances.

If you have negative numbers, you obviously cannot take the square root, and in such instances it is advisable to add a constant to each number to make them positive. The square-root transformation is used when the measurement is a count of something, such as bacterial colonies per petri dish, blood cells going through a capillary per minute, mutations per generation, etc.

The back transformation is to square the number.

10.3 Exercises

10.3.1 Exercise 1

Find one of the days of measurement where the chicken weights do not pass the assumptions of normality, and another day (not day 21!) in which they do.

10.3.2 Exercise 2

Transform the data so that they may pass the assumptions.