R datasets

class: center, middle, inverse, title-slide

# R datasets
## Built-in datasets
### AJ Smit
### 2020/07/06 (updated: 2020-07-07)

---

## Loading and discovering built-in datasets

To find out which datasets are available to you on your system, execute the following command:

```r
data()
```

A new tab will open and the list of datasets will be displayed. For example, the following might be displayed:

```
Data sets in package ‘datasets’:

AirPassengers                       Monthly Airline Passenger Numbers 1949-1960
BJsales                             Sales Data with Leading Indicator
BJsales.lead (BJsales)              Sales Data with Leading Indicator
BOD                                 Biochemical Oxygen Demand
CO2                                 Carbon Dioxide Uptake in Grass Plants
ChickWeight                         Weight versus age of chicks on different diets
DNase                               Elisa assay of DNase
EuStockMarkets                      Daily Closing Prices of Major European Stock Indices, 1991-1998
Formaldehyde                        Determination of Formaldehyde
HairEyeColor                        Hair and Eye Color of Statistics Students
Harman23.cor                        Harman Example 2.3
[...]
```

---
The exact list might vary from user to user, as this depends on which packages have been installed on your system. For example, scrolling all the way to the bottom of the list produced above using the `data()` command, you'll see the following information and code that can be executed to show all the datasets in all the packages available on your system:

```
Use ‘data(package = .packages(all.available = TRUE))’
```

A particular dataset can be loaded as follow:

```r
data(acacia, package = "ade4") # the package name is listed also in the data() display
```

Below I will list a few interesting datasets that you may use to practice some of the statistical tests we have covered in the module. I will recommend some of the analyses you may wish to do on each dataset.

---
# Head Dimensions in Brothers

Package **boot**, dataset `frets`: The data consist of measurements of the length and breadth of the heads of pairs of adult brothers in 25 randomly sampled families. All measurements are expressed in millimetres.

Please consult the dataset's help file (i.e., load the package **boot** package and type `?frets` on the command line).

This datasets lends itself to paired *t*-tests.

- Why paired *t*-tests?
- Which pairings make sense? State the H0 for each of the options.
- Which assumptions to test?
- Test the H0s and describe the findings.

---
# Results from an Experiment on Plant Growth

Dataset `PlantGrowth`: Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions.

This dataset lends itself to a *t*-test.

- Which *t*-test?
- What is the H0?
- Which assumptions to test?
- Test the H0 and describe the findings.

---
# Student's Sleep Data

Dataset `sleep`: Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.

This dataset lends itself to a *t*-test.

- Which *t*-test?
- What is the H0?
- Which assumptions to test?
- Test the H0 and describe the finding.

---
# Diameter, Height and Volume for Black Cherry Trees

Dataset `trees`: This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees.

This dataset lends itself to regressions and correlations.

- Why can both regressions and correlations apply here? Provide your reasoning for when you'd apply each (i.e. for which pairs of variables, and why those?)
- What are the H0s? There can be a few, and the exact H0 depends on whether you will do a correlation or regression.
- Which assumptions to test for each H0?
- Test the H0s and describe the findings.
- Provide graphs in support of the above analyses.

---
# Average Heights and Weights for American Women

Dataset `women`: This data set gives the average heights and weights for American women aged 30–39.

- Which test?
- What is the H0?
- Which assumptions to test?
- Test the H0 and describe the findings.

---
# Acceleration Due to Gravity

Package **boot**, dataset `gravity`: Between May 1934 and July 1935, the National Bureau of Standards in Washington D.C. conducted a series of experiments to estimate the acceleration due to gravity, g, at Washington. Each experiment produced a number of replicate estimates of g using the same methodology. Although the basic method remained the same for all experiments, that of the reversible pendulum, there were changes in configuration.

- Which test?
- What is the H0?
- Which assumptions to test?
- Test the H0 and describe the findings.
- Produce a figure to support your findings.

---
# Toxicity of Nitrofen in Aquatic Systems

Package **boot**, dataset `nitrofen`: Nitrofen is a herbicide that was used extensively for the control of broad-leaved and grass weeds in cereals and rice. Although it is relatively non-toxic to adult mammals, nitrofen is a significant tetragen and mutagen. It is also acutely toxic and reproductively toxic to cladoceran zooplankton.

This dataset lends itself to a 2-way ANOVA and an ANCOVA (read about it!).

- What are the H0s?
- Which assumptions to test?
- Test the H0s and describe the findings.
- Produce figures to support your findings (one for the ANOVA, another for the ANCOVA).

---
# Urine Analysis Data

Package **boot**, dataset `urine`: 79 urine specimens were analyzed in an effort to determine if certain physical characteristics of the urine might be related to the formation of calcium oxalate crystals.

- What is the H0?
- Which physical characteristics of urine are related the formation of oxalate crystals?

---
# Tooth Strength Data

Package **bootstrap**, dataset `tooth`: Thirteen accident victims have had the strength of their teeth measured. It is desired to predict teeth strength from measurements not requiring destructive testing. Four such variables have been obtained for each subject -- (D1,D2) are difficult to obtain, (E1,E2) are easy to obtain.

- Do the easy to obtain variables give as good prediction as the difficult to obtain ones?
- Provide H0, the full analysis, and supporting figures.
- Please provide a justification for why you selected the test you used here.

---
# The Cholostyramine Data

Package **bootstrap**, dataset `cholost`: *n* = 164 men took part in an experiment to see if the drug cholostyramine lowered blood cholesterol levels. The men were supposed to take six packets of cholostyramine per day, but many actually took much less.

- Model the dependence of Improvement upon Compliance. Provide a linear model equation that describes this relationship.
- Provide H0, the full analysis, and supporting figures.