This document provides an overview of creating lineups in R with ggplot2 and nullabor. Hopefully, you can find a recipe that fits your class needs. I don’t have my intro students create their own lineups, rather I include lineups on slides and activity prompts. That said, I have also created shiny apps so that students can make their own lineups without the need to teach all of the commands or load specialty packages.
library(nullabor) # for lineup and null plot functions
library(ggplot2) # for visualization
library(dplyr) # for data manipulation
library(ggthemes) # for colorblind-safe palette
To illustrate how to create lineups to compare a proportions/distributions of a categorical response across groups, consider the fly
data set found in the ggmosaic R package. In this example, let’s consider how to create a lineup to investigate whether responses to the question “In general, is it rude to bring a baby on a plane?” (rude_to_bring_baby
) varies across gender (gender
).
To begin, let’s select only the variables of interest and omit the missing values (NA
s) to avoid transparent segments representing the missing values. (An alternative strategy is to use forcats::fct_explicit_na
to make the missing values explicit levels of the variable.)
data("fly", package = "ggmosaic")
fly_data <- fly %>%
select(gender, rude_to_bring_baby) %>%
na.omit()
glimpse(fly_data)
## Rows: 843
## Columns: 2
## $ gender <fct> male, male, male, male, male, male, male, male, ma…
## $ rude_to_bring_baby <fct> no, somewhat, somewhat, somewhat, yes, no, somewha…
Notice that fly_data
consists of two columns and 843 rows.
Next, we need to create a data set with one copy of the original data set and 19 null data sets generated under the null hypothesis of independence. To do this, we can use the nullabor::lineup()
and nullabor::null_permute()
functions.
lineup_data <- lineup(method = null_permute("rude_to_bring_baby"), true = fly_data)
## decrypt("EnXL zNTN Z6 PJfZTZJ6 eV")
glimpse(lineup_data)
## Rows: 16,860
## Columns: 3
## $ gender <fct> male, male, male, male, male, male, male, male, ma…
## $ rude_to_bring_baby <fct> somewhat, somewhat, somewhat, no, no, no, no, no, …
## $ .sample <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
The resulting lineup_data
now has 16860 (\(20 \times 843\)) rows and an additional column, .sample
, indicating data set membership. The observed data have been assigned a number uniformly at random, and is printed as an encrypted messaged that you can decrpyt by running the decrypt(...)
message in the console.
Once the data have been generated, the lineup is constructed via faceting:
lineup_data %>%
ggplot() +
geom_bar(mapping = aes(x = gender, fill = rude_to_bring_baby), position = "fill") +
facet_wrap(~ .sample, ncol = 5) +
scale_fill_colorblind()
Note: I use ggthemes::scale_fill_colorblind()
throughout this tutorial, and in my classroom, to avoid common perception issues.
One of the perks of using ggplot2
as your graphics engine is that once the lineup data are generated, it’s easy to switch geometries (i.e plot types). For example, if you wanted to create mosaic plots rather than stacked bar charts, only minor changes to the plotting code chunk are needed:
library(ggmosaic)
lineup_data %>%
ggplot() +
geom_mosaic(aes(x = product(rude_to_bring_baby, gender), fill = rude_to_bring_baby)) +
facet_wrap(~ .sample, ncol = 5) +
scale_fill_colorblind()
To illustrate how to create lineups to compare distributions of a quantitative response across groups, consider the case0101
data set found in the Sleuth3 R package. The data are from an experiment designed to explore whether creativity scores were impacted by the type of motivation (intrinsic or extrinsic). To evaluate this, creative writers were randomly assigned to a questionnaire where they ranked reasons they write: one questionnaire listed intrinsic motivations and the other listed extrinsic motivations. After completing the questionnaire, all subjects wrote a Haiku about laughter which was graded for creativity by a panel of poets.
data("case0101", package = "Sleuth3")
Rather than building the lineup piece-by-piece as before, in this example I will pipe the commands together using the pipe operator %>%
. If you are unfamiliar with the pipe operator, you can think of it as the phrase “and then”. The pipe operator is designed to take the object/result to the left and pass it to the function on the right as its first argument.
The code chunk below creates side-by-side boxplots of creativity scores by treatment group:
case0101 %>%
lineup(method = null_permute("Treatment"), true = ., n = 16) %>%
ggplot(aes(x = Treatment, y = Score, fill = Treatment)) +
geom_boxplot(alpha = 0.5) +
facet_wrap(~ .sample, ncol = 4) +
scale_fill_colorblind()
## decrypt("EnXL zNTN Z6 PJfZTZJ6 ee")
Again, it’s easy to change the plot type. For example, we can easily create a violin plots by swapping out the geom
:
case0101 %>%
lineup(method = null_permute("Treatment"), true = ., n = 16) %>%
ggplot(aes(x = Treatment, y = Score, fill = Treatment)) +
geom_violin(alpha = 0.5) +
facet_wrap(~ .sample, ncol = 4) +
scale_fill_colorblind()
## decrypt("EnXL zNTN Z6 PJfZTZJ6 Wi")
Notice that the null plots and position of the data plot changed because we reran the lineup()
command, generating a new lineup data set. To avoid this behavior you can hard code the position of the data plot by adding the pos
argument to lineup()
.
The lineup protocol is also a natural way to help students interpret Q-Q plots while they are still honing their intuition. In this example, I’ll use a simulated sample of 30 observations drawn from a \(\chi^2_2\) distribution, and create normal Q-Q plots to compare the sample to the standard normal distribution.
First, let’s generate the data:
dframe <- data.frame(x = rchisq(30, df = 2))
Next, we’ll use lineup()
to create a data frame with the data set and 19 null data sets. We still need to specify the method
and true
data set, but we use null_dist()
to simulate data from a specified distribution. The null_dist()
command requires three arguments:
var
: the variable name (as a string)
dist
: the distribution name (as a string). See ?nullabor::null_dist
for all accepted distribution names.
params
: a list of parameters that will be passed to the distribution function.
Since we wish to compare the data to a standard normal distribution, we set dist = "norm"
and params = list(mean = 0, sd = 1))
.
lineup_data <- lineup(
method = null_dist("x", dist = "norm", params = list(mean = 0, sd = 1)),
true = dframe
)
## decrypt("EnXL zNTN Z6 PJfZTZJ6 WW")
Now that we have the lineup data set in hand, we can construct the lineup as usual:
lineup_data %>%
ggplot(aes(sample = x)) +
geom_qq_line() +
geom_qq() +
facet_wrap(~ .sample)
If you were testing the data against the \(\chi^2_2\) distribution, then we can easily regenerate the lineup data setting ist = "chi-squared"
and params = list(df = 2))
and create the lineup:
lineup_data2 <- lineup(method = null_dist("x", dist = "chi-squared", params = list(df = 2)), true = dframe)
## decrypt("EnXL zNTN Z6 PJfZTZJ6 et")
lineup_data2 %>%
ggplot(aes(sample = x)) +
geom_qq_line(distribution = qchisq, dparams = list(df = 2)) +
geom_qq(distribution = qchisq, dparams = list(df = 2)) +
facet_wrap(~ .sample)
Lineups also help students to hone their intuition for and ability to interpret residual plots. To construct residual plot lineups using nullabor, I recommend first loading the broom package to make extracting information from your regression model into a data frame easier. (For simple linear regression without transformations this isn’t necessary, but it will help with more complicated models!)
library(broom)
To begin, let’s consider modeling the results of the voltage experiment discussed in Chapter 8 of The Statistical Sleuth.
Let’s consider the simple linear regression model where we regression the time until breakdown of the insulating fluid on the voltage:
data("case0802", package = "Sleuth3")
volt_mod <- lm(Time ~ Voltage, data = case0802)
and extract the data used to fit the model using augment()
, along with some superfluous elements of the fitted model:
aug_volt <- augment(volt_mod)
head(aug_volt)
## # A tibble: 6 x 8
## Time Voltage .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.79 26 483. -478. 0.0820 294. 0.125 -1.67
## 2 1580. 26 483. 1096. 0.0820 268. 0.660 3.84
## 3 2324. 26 483. 1840. 0.0820 198. 1.86 6.45
## 4 68.8 28 375. -307. 0.0488 297. 0.0286 -1.06
## 5 108. 28 375. -267. 0.0488 298. 0.0217 -0.920
## 6 110. 28 375. -265. 0.0488 298. 0.0214 -0.913
Now, we can create the lineup data set using the lineup()
and null_lm()
commands. The null_lm()
command requires two arguments:
A formula (f
) for the model, and
a method
for generating the null residuals (as a string). Currently, "rotate"
(residual rotation), "pboot"
(parametric bootstrap), and "boot"
(nonparametric bootstrap) are available. For intro classes using a simulation-based curriculum, I recommend using method = "boot"
for consistency with the curriculum.
lineup_resids <- lineup(null_lm(Time ~ Voltage, method = "boot"), true = aug_volt)
## decrypt("EnXL zNTN Z6 PJfZTZJ6 eV")
Finally, we can create the residual plot lineup:
lineup_resids %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_hline(yintercept = 0, linetype = 2, color = "blue") +
geom_point() +
facet_wrap(~ .sample)
The observed residual plot pops out, indicating a clear need for remedial measures.
Transformations are a little trickier because you need to name the variables in your model formula identically to the output from augment()
. To see how this works, let’s consider a simple linear regression model relating the average metabolic rate and mass for 95 mammals. The data can be found in the Sleuth3 R package:
data("ex0826", package = "Sleuth3")
As before we fit our model and extract the key components into a data frame.
metab_mod <- lm(log(Metab) ~ log(Mass), data = ex0826)
aug_metab <- augment(metab_mod) %>%
rename(log.Metab. = `log(Metab)`, log.Mass. = `log(Mass)`) %>%
mutate(.resid = resid(metab_mod))
head(aug_metab)
## # A tibble: 6 x 8
## log.Metab. log.Mass. .fitted .hat .sigma .cooksd .std.resid .resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.71 0.916 6.32 0.0109 0.455 0.00978 -1.33 -0.605
## 2 6.39 2.33 7.36 0.0148 0.448 0.0347 -2.15 -0.974
## 3 5.43 0.262 5.83 0.0105 0.458 0.00408 -0.876 -0.398
## 4 5.28 -0.208 5.48 0.0108 0.459 0.00112 -0.454 -0.206
## 5 5.70 0.285 5.85 0.0105 0.459 0.000568 -0.327 -0.149
## 6 6.25 1.18 6.51 0.0114 0.459 0.00187 -0.571 -0.259
Note: In former version of broom, the column names substituted a period for parentheses in the transformed variable columns. An update to broom changed this, but we need it for to make the code easier below, so we manually rename the columns and extract residuals. Hopefully, this will be updated in broom
soon.
To properly create the lineup data set for a transformed model, be sure to match the column names in the augmented data set. Here, we pass the formula log.Metab. ~ log.Mass.
for null_lm()
:
lineup_metab <- lineup(null_lm(log.Metab. ~ log.Mass., method = "boot"), true = aug_metab)
## decrypt("EnXL zNTN Z6 PJfZTZJ6 yi")
Finally, lineup creation proceeds as expected:
lineup_metab %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_hline(yintercept = 0, linetype = 2, color = "blue") +
geom_point(shape = 1) +
facet_wrap(~ .sample)
As of this writing, the nullabor package only has a few null_*
functions:
null_dist()
: generates null data from a specified probability distributionnull_lm()
: generates null data from a linear regression modelnull_permute()
: generates null data by permuting a variablenull_ts()
: generates null data by simulating from a time series modelHowever, there may be situations in your classes where lineups would be useful to explore new plots (of familiar plots in new situations). A prime example is diagnosing generalized linear models (GLMs). In this section I will give an example of manually constructing the null data frame so that the lineup plot can be constructed using the same plotting recipe as shown above.
In this example, we’ll consider the wells
data set discussed by Gelman and Hill. The code chunk below loads the data set, fits a simple logistic regression model, and extracts the deviance residuals.
library(arm)
wells <- read.table("../data/wells.dat")
wells$dist100 <- wells$dist/100
well_glm <- glm(switch ~ arsenic, family = binomial, data = wells)
well_aug <- augment(well_glm, type.residuals = "deviance")
The well_aug
object now contains the true
data set we will pass to lineup()
, so we must next turn our efforts to simulating null data sets. To generate null data sets, we need to
To begin, we can use replicate()
and simulate()
to simulate \(n=19\) sets of null responses. Here, the null data sets will correspond to a correctly specified model. Note that the result of replicate is a list of length 19.
well_sim_y <- replicate(19, expr = simulate(well_glm), simplify = FALSE)
Next, we refit the proposed logistic regression model for each new response and extract the residuals:
well_nulls <- lapply(well_sim_y, FUN = function(x) {
augment(glm(x[[1]] ~ arsenic, data = wells, family = binomial))
})
Finally, we bind the rows together, and add a .sample
column:
well_nulls <- bind_rows(well_nulls, .id = ".n")
str(well_nulls)
## tibble [57,380 × 9] (S3: tbl_df/tbl/data.frame)
## $ .n : chr [1:57380] "1" "1" "1" "1" ...
## $ x[[1]] : num [1:57380] 1 0 0 1 1 1 1 1 1 1 ...
## $ arsenic : num [1:57380] 2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
## $ .fitted : num [1:57380] 0.5771 -0.0665 0.464 0.1052 0.0857 ...
## $ .resid : num [1:57380] 0.944 -1.149 -1.38 1.133 1.141 ...
## $ .std.resid: num [1:57380] 0.944 -1.15 -1.38 1.133 1.142 ...
## $ .hat : num [1:57380] 0.000552 0.000609 0.00043 0.000407 0.000423 ...
## $ .sigma : num [1:57380] 1.15 1.15 1.15 1.15 1.15 ...
## $ .cooksd : num [1:57380] 0.000155 0.000285 0.000342 0.000183 0.000194 ...
## - attr(*, "terms")=Classes 'terms', 'formula' language x[[1]] ~ arsenic
## .. ..- attr(*, "variables")= language list(x[[1]], arsenic)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "x[[1]]" "arsenic"
## .. .. .. ..$ : chr "arsenic"
## .. ..- attr(*, "term.labels")= chr "arsenic"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: 0x7fee1a435da8>
## .. ..- attr(*, "predvars")= language list(x[[1]], arsenic)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "x[[1]]" "arsenic"
and convert the .n
columns of sample IDs to numeric
well_nulls$.n <- as.numeric(well_nulls$.n)
Now the we have our true
data set and our samples
data set, we can use the same plotting recipe with one minor alteration:
lineup(true = well_aug, n = 20, samples = well_nulls) %>%
ggplot(aes(x = arsenic, y = .resid)) +
geom_hline(yintercept = 0, linetype = 2, color = "blue") +
geom_point(shape = 1) +
facet_wrap(~ .sample)
## decrypt("EnXL zNTN Z6 PJfZTZJ6 Wb")
The resulting lineup can be a great conversation starter about how difficult it is to interpret residual plots for binary logistic regression. You can also use it as a springboard into binned residual plots, which I construct below.
With the true
and samples
data sets in hand (or a new set of simulations) we can use the arm::binned.resids()
function to create binned residuals plots, which Gelman and Hill suggest are more useful in model diagnosis. To create binned residual plots, we must calculate the binned residuals by group (i.e. .sample
):
library(purrr)
well_lineup_df <- lineup(true = well_aug, n = 20, samples = well_nulls)
## decrypt("EnXL zNTN Z6 PJfZTZJ6 Ws")
str(well_lineup_df)
## tibble [60,400 × 10] (S3: tbl_df/tbl/data.frame)
## $ x[[1]] : num [1:60400] 1 0 0 1 1 1 1 1 1 1 ...
## $ arsenic : num [1:60400] 2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
## $ .fitted : num [1:60400] 0.5771 -0.0665 0.464 0.1052 0.0857 ...
## $ .resid : num [1:60400] 0.944 -1.149 -1.38 1.133 1.141 ...
## $ .std.resid: num [1:60400] 0.944 -1.15 -1.38 1.133 1.142 ...
## $ .hat : num [1:60400] 0.000552 0.000609 0.00043 0.000407 0.000423 ...
## $ .sigma : num [1:60400] 1.15 1.15 1.15 1.15 1.15 ...
## $ .cooksd : num [1:60400] 0.000155 0.000285 0.000342 0.000183 0.000194 ...
## $ .sample : num [1:60400] 1 1 1 1 1 1 1 1 1 1 ...
## $ switch : int [1:60400] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "terms")=Classes 'terms', 'formula' language x[[1]] ~ arsenic
## .. ..- attr(*, "variables")= language list(x[[1]], arsenic)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "x[[1]]" "arsenic"
## .. .. .. ..$ : chr "arsenic"
## .. ..- attr(*, "term.labels")= chr "arsenic"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: 0x7fee1a435da8>
## .. ..- attr(*, "predvars")= language list(x[[1]], arsenic)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "x[[1]]" "arsenic"
## - attr(*, "pos")= int 16
well_binned <-
well_lineup_df %>%
group_by(.sample) %>%
group_map(~ data.frame(binned.resids(.x$arsenic, .x$.resid)$binned)) %>%
bind_rows(.id = ".sample") %>%
mutate(.sample = as.numeric(.sample))
str(well_binned)
## 'data.frame': 1080 obs. of 7 variables:
## $ .sample: num 1 1 1 1 1 1 1 1 1 1 ...
## $ xbar : num 0.51 0.529 0.556 0.575 0.596 ...
## $ ybar : num 0.1455 -0.0695 -0.1542 -0.0115 0.2585 ...
## $ n : num 28 83 51 57 39 62 62 45 66 44 ...
## $ x.lo : num 0.51 0.52 0.55 0.57 0.59 0.61 0.63 0.65 0.67 0.7 ...
## $ x.hi : num 0.51 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.69 0.71 ...
## $ X2se : num 0.452 0.259 0.328 0.314 0.376 ...
Notice that the arm::binned.resids()
function renames the variables in the data set. When you are constructing your plots be sure to use the variables xbar
and ybar
rather than the previous variable names.
With the well_binned
data set in hand, we can construct the lineup directly using only plotting functions:
well_binned %>%
ggplot(aes(x = xbar, y = ybar)) +
geom_hline(yintercept = 0, linetype = 2, color = "blue") +
geom_point(shape = 1) +
facet_wrap(~ .sample)
When you render final versions of your lineups, it’s best to resize the plotting canvas so that each panel appears to be square.
If you are not comfortable using *ggplot2, then I recommend checking out the ggformula package. This uses the formula interface from the mosaic package to create plots using the ggplot2 architecture.
I don’t make my students write the code to create lineups from scratch, but I could see providing them with an R Markdown file with the code they can simply run on an RStudio server during an in-class activity.
If you want to avoid coding, then check out my Shiny apps for creating lineups in the familiar situations discussed above.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.