Types of data
“The plural of anecdote is not data.”
— Roger Brinner
In this chapter we will, firstly, recap how to work with data in R. Then we will look at the different kinds of biological and environmental data that are typically encountered by most biologists. The data seen here are not an exhaustive list of all the various types of data out there, but it should represent the bulk of our needs.
After we have become familiar with the different kinds of data, we will look at summaries of these data, which is generally required as the starting point for our analyses. After summarising the data in tables and so forth, we may want to produce graphical summaries to see broad patterns and trends; visual data representations, which complement the tabulated data, will be covered in a later chapter (Chapter 4). Both of these approaches form the basis of ‘exploratory data analysis.’
Recap: working with data
We revisit the process for reading data into R. R will read in many types of data, including spreadsheets, text files, binary files, and files from other statistical packages.
Preparing data
For R to be able to analyse your data, it needs to be in a consistent
format, with each variable in its own column and each sample in its own
row. This is called tidy data. The format within each variable (column)
needs to be consistent and is commonly one of the following types: a
continuous numeric
variable (e.g., fish length (say in m):
0.133, 0.145); a factor
or categorical variable (e.g.,
algal colour: red, green, brown); a logical
variable (i.e.,
TRUE
or FALSE
), or a date
(e.g. 1981-09-01
or some other accepted date or time
format). You can also use other more specific general text
(string
) formats.
Naming variables
R has pedantic requirements for naming variables. It is safest to not use spaces, special characters (e.g., commas, semicolons, any of the shift characters above the numbers), or function names (e.g., mean). In fact, I strongly suggest never to use spaces in variables, filenames, etc.
Dataframes
Generally, the best way to store your data is to put all your
biological and environmental data into one dataframe so that you can
analyse them together. This requires having the first few variables
(columns) as descriptors of each of your samples (e.g.,
Site
, Species
, Sex
,
flush.dist
, etc). Each column has the same number of rows,
so that it resembles a matrix. In essence, each row contains a data
point (an observation; this will often reflect the response variable in
your analysis), plus as many descriptors for that data point as is
available (these are generally the explanatory variables in an
analysis). In this module we will rely on the tidyverse
approach for handling data, and in this context a dataframe is called a
tibble
.
R> Site Species Sex flush.dist land.dist
R> 1 1 Oystercatcher Female 12.8 150.9
R> 2 1 Oystercatcher Male 4.4 114.1
R> 3 1 Oystercatcher Female 9.5 153.7
R> 4 1 Oystercatcher Female 12.2 137.7
R> 5 1 Oystercatcher Female 10.1 143.3
R> 6 1 Oystercatcher Male 6.6 142.8
Setting the R Project
An important aspect of any program is its working directory (or folder). This is where R will read files from and write files to. RStudio displays the current working directory within the title region of the Console. The recommended way to deal with working directories is to use the RStudio Project feature: click on the File menu, select New Project, and give it an appropriate name (again, don’t use spaces). A new RStudio window will then open and the File pane within RStudio will then show all the files associated with this particular set of analysis (typically a Project is associated with a suite of analyses, e.g., you’d want one for all the analyses that will contribute towards your Hons project). All your data files will also be within the Project folder. The Project will reflect an actual place in your computer’s file system – know how to navigate there using the Windows Explorer (Windows) or Finder (Mac).
Once the Project has been set up correctly, you’ll see the name of the project reflected in the top right corner of the RStudio window (from there you can also select other projects that you have set up previously).
Read in data
We are going to read in the Beach Birds dataset provided. These data
reflect results of an experiment on beaches designed to measure the
influence of off-road vehicles (ORVs) on shorebirds. Some colleagues
visited five different beaches on the Sunshine Coast
(Sites
), Queensland, Australia, and at each site, drove
along the shoreline in an ORV. As they drove along, they identified
birds in the distance, and drove at them until they took flight. They
recorded the species (Species
) and sex (Sex
)
of the bird, the distance of the ORV from the bird at which it took
flight (flush.dist
), as well as the distance the bird flew
before settling again (land.dist
). In instances where sex
could not be determined, or where birds flew out of sight before
landing, they marked observations NA.
Often the first task is to convert an Excel file into a .csv file,
which is my recommended format for getting data into R. I have already
done this, but you would open the Excel file (e.g. BeachBirds.xlsx),
then select “Save As” from the File menu. In the Format drop-down menu,
select the option called “Comma Separated Values”, then hit Save. You’ll
get a warning that formatting will be removed and that only one sheet
will be exported; simply Continue. When saving this new file, be sure to
select a folder that you can find. Look in that folder (directory) and
make sure that you see BeachBirds.csv. Sometimes this file might not
actually have commas inbetween the fileds (columns), but semi-colons “;”
instead – use a text editor and see what’s inside. Set up the
read.csv()
function appropriately to deal with commas or
semi-colons (i.e. by using the sep = ""
argument.
Now let’s start writing the script by clicking on the New Document Button (in the top-left corner of the RStudio window, and selecting R Script). It is recommended to start a script with some basic information for you to refer back to later. Start with a comment line (the line begins with a #) that tells you the name of the script, something about the script, who created it, and the date it was created. In the Source Editor enter:
# Beachbirds.R. Reads in and manipulates bird data
# <YourName> <CurrentDate>
TIP: Comments The hash (
#
) tells R not to run any of the text on that line to right of the symbol. This is the standard way of commenting/annotating R code; it is VERY good practice to comment in detail so that you can understand later what you have done. Note that you can comment out entire blocks of code by highlighting it in the Source Editor and going to the menu Code and then choosing Comment/Uncomment Lines.
TIP: Splitting lines of code If you have long lines of code, then you can spread them over multiple lines. You just have to make sure that R knows something is coming, either by leaving a bracket open, or ending the line with an operator.
So now we can save our script Choose File/Save As/ and type in “BeachBirds”. It will automatically add a “.r” extension. But where will it save it? Yes, that’s right, the Working Directory specified by the Project we set up initially.
Importing data
Now we have the Project set, and R will know where to look for the
files we read. The function read.csv()
is the most
convenient way to read in most biological data. There are several other
ways to read in data, but .csv
is usually the easiest. To
find out what it does, we will read its help entry:
?read.csv
All R Help items are in the same format. A short Description (of what it does), Usage, Arguments (the different inputs it requires), Details (of what it does), Value (what it returns) and Examples. Arguments (the parameters that are passed to the function) are the lifeblood of any function, as this is how you provide information to R. You do not need to specify all arguments, as many have appropriate default values, and others might not be needed for your particular case.
There are many arguments that you can use to customize reading of your data, but most important are:
file: the name of the data file to be read (this needs to include its path if it is not in your specified working directory); note that file names must be placed within quotation marks
header: is a logical argument (TRUE/FALSE) that specifies whether R reads the first line of your file as the names of the variables it contains
quote: By default, character strings can be quoted by either single ’ or double ” quotes and usually do not need to be changed when exporting data as .csv from Excel. Let’s assign the data in the file BeachBirds.csv to a variable called
birds
:
<- read.csv("../data/BeachBirds.csv", header = TRUE) birds
Remember that specifying header = TRUE
indicates to R
that the first row in the spreadsheet contains variable (column) names
(headers). Note, also, that you can omit this argument, as
header = TRUE
is the default argument in
read.csv()
.
We can see that we have an object called birds
. We can
find out what sort of object birds
is by typing:
class(birds)
R> [1] "data.frame"
In this case, birds
is a dataframe.
TIP: Stick with .csv files There are packages in R to read in Excel spreadsheets (e.g., xlsx), but remember there are likely to be problems reading in formulae, graphs, macros and multiple worksheets. We recommend exporting data deliberately to .csv files (which are also commonly used in other programs). This not only avoids complications, but also allows you to unambiguously identify the data you based your analysis on. This last statement should give you the hint that it is good practice to name your .csv slightly differently each time you export it from Excel, perhaps by appending a reference to the date it was exported. Also, for those of you who use commas in Excel as the decimal separator, or to separate 1000s, undo these features now.
TIP: Dealing with missing data The .csv file format is usually the most robust for reading data into R. Where you have missing data (blanks), the .csv format separates these by commas. However, there can be problems with blanks if you read in a space-delimited format file. If you are having trouble reading in missing data as blanks, try replacing them in your spreadsheet with NA, the missing data code in R. In Excel, highlight the area of the spreadsheet that includes all the cells you need to fill with NA. Do an Edit/Replace… and leave the “Find what:” textbox blank and in the “Replace with:” textbox enter NA, the missing value code. Once imported into R, the NA values will be recognised as missing data.
Viewing our data
There are many ways of finding broad views of our data in R. The first few functions that we will look at were designed to simply scrutinise the contents of the tibbles, which is the ‘tidyverse’ name for the general ‘container’ that holds our data in the software’s environment (i.e. in a block of the computer’s memory dedicated to the R software). Whatever data are in R’s environment will be seen in the ‘Environment’ tab in the top right of RStudio’s four panes.
From the Environment pane
The first way to see what’s in the tibble is not really a function at
all, but a convenient (and lazy) way of quickly seeing a few basic
things about our data. Let us look at the BeachBirds
data.
Load it like so (you’ll remember from the Intro R Workshop):
# Load the data
<- read.csv("../data/BeachBirds.csv") birds
In the Environment pane, the object named birds
will now
appear under the panel named Data. To the left of it is a small white
arrow in a blue circular background. By default the arrow points to the
right. Clicking on it causes it to point down, which denotes that the
data contained within the tibble have become expanded. The names of the
columns (more correctly called ‘variables’) can now be seen. There you
can see the variables Site
, Species
,
Sex
, flush.dist
, and land.dist
.
The class of data they represent can be seen too: there’s continuous
data of class num
, a variable of chr
, and a
one of class int
. Beneath these there’s a lot of attributes
that denote some meta-data, which you may safely ignore for now.
head()
and tail()
The head()
and tail()
functions simply
display top and bottom portions of the tibble, and you may add the
n
argument and an integer to request that only a certain
number of rows are returned; by default the top or bottom six rows are
displayed.
There are various bits of additional information printed out. The
display will change somewhat if there are many more variables than that
which can comfortably fit within the width of the output window
(typically the Console). The same kinds of information as was returned
with the Environment pane expansion arrow are displayed, but the data
class is now accompanied by an angle bracket
(i.e. <...>
) notation. For example, num
in the Environment pane and <dbl>
as per the
head()
or tail()
methods are exactly the same:
both denote continuous (or ‘double precision’) data.
head(birds)
R> Site Species Sex flush.dist land.dist
R> 1 1 Oystercatcher Female 12.8 150.9
R> 2 1 Oystercatcher Male 4.4 114.1
R> 3 1 Oystercatcher Female 9.5 153.7
R> 4 1 Oystercatcher Female 12.2 137.7
R> 5 1 Oystercatcher Female 10.1 143.3
R> 6 1 Oystercatcher Male 6.6 142.8
tail(birds, n = 2)
R> Site Species Sex flush.dist land.dist
R> 398 5 Gull Female 9.4 50.5
R> 399 5 Gull Female 10.1 52.1
As an alternative to head()
, you may also simply type
the name of the object (here birds
) in the Console (or
write it in the Source Editor if it is necessary to retain the function
for future use) and the top portion of the tibble will be displayed,
again trimmed to account for the width of the display.
colnames()
This function simply returns a listing of the variable (column) names.
colnames(birds)
R> [1] "Site" "Species" "Sex" "flush.dist" "land.dist"
There is an equivalent function called rownames()
that
may be used to show the names of rows in your tibble, if these are
present. Row names are generally discouraged, and we will refrain from
using them here.
summary()
The next way to see the contents of the tibble is to apply the
summary()
function. Here we see something else. Some
descriptive statistics that describe properties of the full set of data
are now visible. These summary statistics condense each of the variables
into numbers that describe some properties of the data within each
column. You will already know the concepts of the ‘minimum,’ ‘median,’
‘mean,’ and ‘maximum.’ These are displayed here.
summary(birds)
R> Site Species Sex flush.dist
R> Min. :1.00 Length:399 Length:399 Min. : 0.00
R> 1st Qu.:2.00 Class :character Class :character 1st Qu.: 5.30
R> Median :3.00 Mode :character Mode :character Median : 8.30
R> Mean :2.89 Mean : 8.14
R> 3rd Qu.:4.00 3rd Qu.:10.90
R> Max. :5.00 Max. :20.00
R> land.dist
R> Min. : 5.60
R> 1st Qu.: 42.50
R> Median : 64.40
R> Mean : 73.12
R> 3rd Qu.: 90.10
R> Max. :199.90
This will serve well as an introduction to the next chapter, which is about descriptive statistics. What are they, and how do we calculate them?
Task 1: Now that we have refreshed our memory you should start to remember how to work with our data. In a new R script, list a few other approaches available for interrogating our dataframe and finding some more details about it.
Task 2: The next thing we want to do is to subset and filter our data. In the newly created R script, write down some lines of code that will subset and filter the beach birds data in interesitng and useful ways. Always provide ample comments to indicate what it you are doing, and why you’re doing it. Hint: you’ll want to use the tidyverse package.
This will show you what the data for Sex and for flush.dist look like.
Data classes
In biology we will encounter many kinds of data, and depending on which kind, the type of statistical analysis will be decided.
Numerical data
Numerical data are quantitative in nature. They represent things that can be objectively counted, measured or claculated.
Nominal (discrete) data
Integer data (discrete numbers or whole numbers), such as counts. For
example, family A may have 3 children and family B may have 1 child,
neither may have 2.3 children. Integer data usually answer the question,
“how many?” In R integer data are called int
or
<int>
.
Continuous data
These usually represent measured ‘things,’ such as something’s heat
content (temperature, measured in degrees Celsius) or distance (measured
in metres or similar), etc. They can be rational numbers including
integers and fractions, but typically they have an infinite number of
‘steps’ that depends on rounding (they can even be rounded to whole
integers) or considerations such as measurement precision and accuracy.
Often, continuous data have upper and lower bounds that depend on the
characteristics of the phenomenon being studied or the measurement being
taken. In R, continuous data are denoted num
or
<dbl>
.
The kinds of summaries that lend themselves to continuous data are:
- Frequency distributions
- Relative frequency distributions
- Cumulative frequency distributions
- Bar graphs
- Box plots
- Scatter plots
Dates
Dates are a special class of continuous data, and there are many different representations of the date classes. This is a complex group of data, and we will not cover much of it in this course.
Qualitative data
Qualitative data may be well-defined categories or they may be subjective, and generally include descriptive words for classes (e.g. mineral, animal , plant) or rankings (e.g. good, better, best).
Categorical data
Because there are categories, the number of members belonging to each
of the categories can be counted. For example, there are three red
flowers, 66 purple flowers, and 13 yellow flowers. The categories cannot
be ranked relative to each other; in the example just provided, for
instance, no value judgement can be assigned to the different colours.
It is not better to be red than it is to be purple. There are just fewer
red flowers than purple ones. Contrast this to another kind of
categorical data called ‘ordinal data’ (see next). This class of data in
an R dataframe (or in a ‘tibble’) is indicated by Factor
or
<fctr>
.
The kinds of summaries that lend themselves to categorical data are:
- Frequency distributions
- Relative frequency distributions
- Bar graphs
- Pie graphs (!!!)
- Category statistics
Ordinal data
This is a type of categorical data where the classes are ordered (a
synonym is “ranked”), typically from low to high (or vice
versa), but where the magnitude between the ordered classes cannot
be precisely measured or quantified. In other words, the difference
between them is somewhat subjective (i.e. it is qualitative rather than
quantitative). These data are on an ordinal scale. The data may be
entered as descriptive character strings (i.e. as words), or they may
have been translated to an ordered vector of integers; for example, “1”
for terrible, “2” for so-so, “3” for average, “4” for good and “5” for
brilliant. Irrespective of how the data are present in the dataframe,
computationally (for some calculations) they are treated as an ordered
sequence of integers, but they are simultaneously treated as categories
(say, where the number of responses that report “so-so” can be counted).
Ordinal data usually answer questions such as, “how many categories can
the phenomenon be divided into, and how does each category rank with
respect to the others?” Columns containing this kind of data are named
Ord.factor
or <ord>
.
Binary data
Right or wrong? True or false? Accept or reject? Black or white?
Positive or negative? Good or bad? You get the idea… In other words,
these are observations or responses that can take only one of two
mutually exclusive outcomes. In R these are treated as ‘Logical’ data
that take the values of TRUE
or FALSE
(note
the case). In R, and computing generally, logical data are often denoted
with 1 for TRUE
and 0 for FALSE
. This class of
data is indicated by logi
or <lgl>
.
Character values
As the name implies, these are not numbers. Rather, they are human words that have found their way into R for one reason or another. In biology we most commonly encounter character values when we have a list of things, such as sites or species. These values will often be used as categorical or ordinal data.
Missing values
Unfortunately, one of the most reliable aspects of any biological
dataset is that it will contain some missing data. But how can something
contain missing data? One could be forgiven for assuming that if the
data are missing, then they obviously aren’t contained in the dataset.
To better understand this concept we must think back to the principles
of tidy data. Every observation must be in a row, and every column in
that row must contain a value. The combination of multiple observations
then makes up our matrix of data. Because data are therefore presented
in a two-dimensional format, any missing values from an observation will
need to have an empty place-holder to ensure the consistency of the
matrix. These are what we are referring to when we speak of “missing
values”. In R these appear as a NA
in a dataframe and are
slighlty lighter than the other values. These data are indicated in the
Environment as NA
and if a column contains only missing
values it will be denoted as <NA>
.
Complex numbers
“And if you gaze long enough into an abyss, the abyss will gaze back into you.”
— Friedrich Nietzsche
In an attempt to allow the shreds of our sanity to remain stitched together we will end here with data types. But be warned, ye who enter, that below countless rocks, and around a legion of corners, lay in wait a myriad of complex data types. We will encounter many of these at the end of this course when we encounter modeling, but by then we will have learned a few techniques that will prepare us for the encounter.