Testing assumptions or: How I learned to stop worrying and transform the data

Paranormal

Assumptions

Throughout the preceding sections I have stressed the importance of testing the assumptions underly- ing some statistical tests, in particular t-tests, ANOVAs, regressions, and correlations. These statistics are called parametric statistics and they require that the assumption of normality and homogeneity of vari- ances are met. This is the kind of statistic you would normally be required to calculate, and because they are commonly used, most people are familiar with parametric statistics.

Let us recap—the most important assumptions that we need to make sure are met before doing t-tests, ANOVAs or linear regressions are:

  • The dependent variable must be continuous.
  • The data must be independent of each other.
  • The data most be normally distributed.
  • The data must be homoscedastic.

For data conforming to this expectation, we say that the data are independent and identically distributed, or i.i.d. We will deal in particular with the the assumptions of normality and heteroscedasticity in this chapter. Whether or not the dependent data are continuous and independent comes down to proper experimental design, so if these are violated then… (I’ll say no more).

How do we know this? Here are your options, followed by a quick refresher:

  • Perform any of the diagnostic plots we covered in the earlier Chapters.
  • Compare the variances and see if they differ by more than a factor of four.
  • Do a Levene’s test to test for equal variances.
  • Do a Shapiro-Wilk test for normality.

However, when the data are not normal (i.e. skewed) or the variances are unequal — as sometimes happens — the resultant parametric test statistics cannot be used. When this happens, we have two options:

  • Apply the non-parametric equivalent for the statistical test in question.
  • Transform the data.

It is the intention of this Chapter to discuss some options for checking the assumptions and to show some data transformations. But before we do that, please revise the non-parametric options available as replacements for the main parametric approaches as may be seen in our online textbook and the succinct summary presented in the Methods Cheatsheet.

Checking assumptions

Normality

Before we begin, let’s go ahead and activate our packages and load our data.

library(tidyverse)
chicks <- as_tibble(ChickWeight)

The quickest method of testing the normality of a variable is with the Shapiro-Wilk normality test. This will return two values, a W score and a p-value. FOr the purposes of this course we may safely ignore the W score and focus on the p-value. When p >= 0.05 we may assume that the data are normally distributed. If p < 0.05 then the data are not normally distrubted. Let’s look at all of the chicks data without filtering it:

shapiro.test(chicks$weight)
R> 
R>  Shapiro-Wilk normality test
R> 
R> data:  chicks$weight
R> W = 0.90866, p-value < 2.2e-16

Are these data normally distributed? How do we know? Now let’s filter the data based on the different diets for only the weights taken on the last day (21):

chicks %>% 
  filter(Time == 21) %>% 
  group_by(Diet) %>% 
  summarise(norm_wt = as.numeric(shapiro.test(weight)[2]))
R> # A tibble: 4 × 2
R>   Diet  norm_wt
R>   <fct>   <dbl>
R> 1 1       0.591
R> 2 2       0.949
R> 3 3       0.895
R> 4 4       0.186

How about now?

Homoscedasticity

Here we need no fancy test. We must simply calculate the variance of the variables we want to use and see that they are not more than 3-4 times greater than one another.

chicks %>% 
  filter(Time == 21) %>% 
  group_by(Diet) %>% 
  summarise(var_wt = var(weight))
R> # A tibble: 4 × 2
R>   Diet  var_wt
R>   <fct>  <dbl>
R> 1 1      3446.
R> 2 2      6106.
R> 3 3      5130.
R> 4 4      1879.

Well, do these data pass the two main assumptions?

Epic fail. Now what?

After we have tested our data for the two key assumptions we are faced with a few choices. The basic guidelines below apply to paired tests, one- and two-sample tests, as well as one- and two-sided hypotheses (i.e. t-tests and their ilk):

Assumption R function Note
Equal variances t.test(..., var.equal=TRUE) Student’s t-test
Unequal variances t.test(...) Using Welch’s approximation of variances
Normal data t.test(...) As per equal/unequal variance cases, above
Data not normal wilcox.test(...) Wilcoxon (1-sample) or Mann-Whitney (2-sample) tests

When we compare two or more groups we usually do an ANOVA, and the same situation is true. For ANOVAs our options include (but are not limited to):

Assumption R function Note
Normal data, equal variances aov(...) A vanilla analysis of variance
Normal data, unequal variances oneway.test(...) Using Welch’s approximation of variances, if needed, but robust if variances differ no more than 4-fold; could also stabilise variances using a square-root transformation; may also use kruskal.test()
Data not normal (and/or non-normal) kruskal.test(...) Kruskal-Wallis rank sum test

See this discussion if you would like to know about some more advanced options when faced with heteroscedastic data.

Our tests for these two assumptions fail often with real data. Therefore, we must often identify the way in which our data are distributed (refert to Chapter 5) so we may better decide how to transform them in an attempt to coerce them into a format that will pass the assumptions.

Data transformations

When transforming data, one does a mathematical operation on the observations and then use these transformed numbers in the statistical tests. After one as conducted the statistical analysis and calculated the mean ± SD (or ± 95% CI), these values are back transformed (i.e. by applying the reverse of the transformation function) to the original scale before being reported. Note that in back-transformed data the SD (or CI) are not necessarily symmetrical, so one cannot simply compute one (e.g. the upper) and then assumed the lower one would be the same distance away from the mean.

When transforming data, it is a good idea to know a bit about how data within your field of study are usually transformed—try and use the same approach in your own work. Don’t try all the various transformations until you find one that works, else it might seems as if you are trying to massage the data into an acceptable outcome. The effects of transformations are often difficult to see on the shape of data distributions, especially when you have few samples, so trust that what you are doing is correct. Unfortunately, as I said before, transforming data requires a bit of experience and knowledge with the subject matter, so read widely before you commit to one.

Some of the texst below comes from this discussion and from John H. McDonald. Below (i.e. the text on log transformation, square-root transformation, and arcsine transformation) I have extracted mostly verbatim the excellent text produced by John H MacDonald from his Handbook of Biological Statistics. Please attribute this text directly to him. I have made minor editorial changes to point towards some R code, but aside from that the text is more-or-less used verbatim. I strongly suggest reading the preceding text under his Data transformations section, as well as consulting the textbook for in-depth reading about biostatistics. Highly recommended!

Log transformation

This consists of taking the log of each observation. You can use either base-10 logs (log10(x)) or base-e logs, also known as natural logs (log(x)). It makes no difference for a statistical test whether you use base-10 logs or natural logs, because they differ by a constant factor; the base- 10 log of a number is just 2.303…× the natural log of the number. You should specify which log you’re using when you write up the results, as it will affect things like the slope and intercept in a regression. I prefer base-10 logs, because it’s possible to look at them and see the magnitude of the original number: \(log(1) = 0\), \(log(10) = 1\), \(log(100) = 2\), etc.

The back transformation is to raise 10 or e to the power of the number; if the mean of your base-10 log-transformed data is 1.43, the back transformed mean is \(10^{1.43} = 26.9\) (in R, 10^1.43). If the mean of your base-e log-transformed data is 3.65, the back transformed mean is \(e^{3.65} = 38.5\) (in R, exp(3.65)). If you have zeros or negative numbers, you can’t take the log; you should add a constant to each number to make them positive and non-zero (i.e. log10(x) + 1). If you have count data, and some of the counts are zero, the convention is to add 0.5 to each number.

Many variables in biology have log-normal distributions, meaning that after log-transformation, the values are normally distributed. This is because if you take a bunch of independent factors and multiply them together, the resulting product is log-normal. For example, let’s say you’ve planted a bunch of maple seeds, then 10 years later you see how tall the trees are. The height of an individual tree would be affected by the nitrogen in the soil, the amount of water, amount of sunlight, amount of insect damage, etc. Having more nitrogen might make a tree 10% larger than one with less nitrogen; the right amount of water might make it 30% larger than one with too much or too little water; more sunlight might make it 20% larger; less insect damage might make it 15% larger, etc. Thus the final size of a tree would be a function of nitrogen×water × sunlight × insects, and mathematically, this kind of function turns out to be log-normal.

Square-root transformation

This consists of taking the square root of each observation. The back transformation is to square the number. If you have negative numbers, you can’t take the square root; you should add a constant to each number to make them all positive.

People often use the square-root transformation when the variable is a count of something, such as bacterial colonies per petri dish, blood cells going through a capillary per minute, mutations per generation, etc.

Arcsine transformation

This consists of taking the arcsine of the square root of a number (in R, arcsin(sqrt(x))). (The result is given in radians, not degrees, and can range from −π/2 to π/2.) The numbers to be arcsine transformed must be in the range 0 to 1. This is commonly used for proportions, which range from 0 to 1, […] the back-transformation is to square the sine of the number (in R, sin(x)^2).

Other transformations

These are by no means the only types of transformations available. Let us classify the above transforma- tions, and a few others, into categories of the types of corrective actions needed:

Slightly skewed data

sqrt(x) for positively skewed data

sqrt(max(x+1) - x) or x^2 for negatively skewed data

Moderately skewed data

log10(x) for positively skewed data,

log10(max(x + 1) - x) or x^3 for negatively skewed data

Severely skewed data

1/x for positively skewed data

1/(max(x + 1) - x) or higher powers than cubes for negatively skewed data

Deviations from linearity and heteroscedasticity

log(x) when the dependent variable starts to increase more and more rapidly with increasing independent variable values

x^2 when the dependent variable values decrease more and more rapidly with increasing independent variable values

– Regression models do not necessarily require data transformations to deal with heteroscedasticity. Generalised Linear Models (GLM) can be used with a variety of variance and error structures in the residuals via so-called link functions. Please consult the glm() function for details.

– The linearity requirement specifically applies to linear regressions. However, regressions do not have to be linear. Some degree of curvature can be accommodated by additive (polynomial) models, which are like linear regressions, but with additional terms (you already have the knowledge you need to fit such models). More complex departures from linearity can be modelled by non-linear models (e.g. exponential, logistic, Michaelis-Menten, Gompertz, von Bertalanffy and their ilk) or Generalised Additive Models (GAM) — these more complex relation- ships will not be covered in this module. The gam() function in the mgcv package fits GAMs. After fitting these parametric or semi-parametric models to accommodate non-linear regressions, the residual error structure still does to meet the normality requirements, and these can be tested as before with simple linear regressions.

Knowing how to successfully implement transformations can be as much art as science and requires a great deal of experience to get right. Due to the multitude of options I cannot offer comprehensive ex- amples to deal with all eventualities — so I will not provide any examples at all! I suggest reading widely on the internet or textbooks, and practising by yourselves on your own datasets.

Exercises

Exercise 1 Find one of the days of measurement where the chicken weights do not pass the assumptions of normality, and another day (not day 21!) in which they do.

Exercise 2 Transform the data so that they may pass the assumptions.