vignettes/influence_diagnostics.Rmd
influence_diagnostics.Rmd
library(HLMdiag)
#>
#> Attaching package: 'HLMdiag'
#> The following object is masked from 'package:stats':
#>
#> covratio
The HLMdiag
package provides functionality to examine diagnostics for hierarchical linear models, including residuals values and influence diagnostics. This vignette aims to:
hlm_influence()
, which provides any easy way to obtain multiple influence diagnostics in one tibblehlm_augment()
, which combines hlm_influence()
and hlm_resid()
to return influence diagnostics and residualslme
created from the nlme
packageHLMdiag
and those returned by the lme4
and car
packageshlm_influence()
functionThe hlm_influence()
function provides a wrapper that returns influence diagnostics appended to the original model frame. It is useful for assessing the influence of individual observations, or groups of observations, and can also be used with dotplot_diag()
to visually explore the influence diagnostics. The diagnostics returned by hlm_influence()
include Cook’s distance, MDFFITS, covariance trace (covtrace), covariance ratio (covratio), leverage, and relative variance change (RVC).
Cook’s distance and MDFFITS both measure the distance between fixed effects estimated from the full data set and those obtained from the reduced data set. For Cook’s distance, the change in parameter estimates is scaled by the estimated covariance matrix of the original parameter estimates, while MDFFITS scales this change by the estimated covariance matrix of the deletion estimates. The covariance trace and covariance ratio both measure how precision is affected by the deletion of a particular observation or group of observations i. Covariance trace is a measure of the ratio between the covariance matrices with and without unit i to the identity matrix, while covariance ratio is a comparison of the two covariance matrices with and without unit i using their determinants. Relative variance change (RVC) is a measurement of the ratio of estimates of the l th variance component with and without unit i. The final influence diagnostic returned by hlm_influence()
, leverage, is the rate of change in the predicted response with respect to the observed response. For a full explanation of these influence diagnostics, including formulas, please refer to Loy and Hofmann (2014).
To explore the functionality of hlm_influence()
, we will use the data set classroom
(West et. al, 2014). The data set consists of 1,190 observations of students. Students are grouped within classes, which are nested within schools. There are 312 distinct classes nested within 107 schools.
data("classroom", package = "WWGbook")
#> # A tibble: 1,190 x 12
#> sex minority mathkind mathgain ses yearstea mathknow housepov mathprep
#> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 448 32 0.46 1 NA 0.082 2
#> 2 0 1 460 109 -0.27 1 NA 0.082 2
#> 3 1 1 511 56 -0.03 1 NA 0.082 2
#> 4 0 1 449 83 -0.38 2 -0.11 0.082 3.25
#> 5 0 1 425 53 -0.03 2 -0.11 0.082 3.25
#> 6 1 1 450 65 0.76 2 -0.11 0.082 3.25
#> 7 0 1 452 51 -0.03 2 -0.11 0.082 3.25
#> 8 0 1 443 66 0.2 2 -0.11 0.082 3.25
#> 9 1 1 422 88 0.64 2 -0.11 0.082 3.25
#> 10 0 1 480 -7 0.13 2 -0.11 0.082 3.25
#> classid schoolid childid
#> <int> <int> <int>
#> 1 160 1 1
#> 2 160 1 2
#> 3 160 1 3
#> 4 217 1 4
#> 5 217 1 5
#> 6 217 1 6
#> 7 217 1 7
#> 8 217 1 8
#> 9 217 1 9
#> 10 217 1 10
#> # … with 1,180 more rows
We will fit a simple three-level hierarchical linear model using the lme4
package:
class.lmer <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classroom)
level
hlm_influence()
calculates influence diagnostics for individual cases, or larger groups. For example, to obtain a tibble with influence diagnostics for all 1,190 students we use the following:
infl <- hlm_influence(class.lmer, level = 1)
#> # A tibble: 1,190 x 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 460 0 1 -0.27 0.082 1 160
#> 3 3 56 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.00106 0.00106 0.00289 1.00 0.121
#> 2 0.00143 0.00143 0.00246 1.00 0.121
#> 3 0.000336 0.000335 0.00378 1.00 0.122
#> 4 0.000225 0.000225 0.00169 1.00 0.0780
#> 5 0.000173 0.000172 0.00178 1.00 0.0780
#> 6 0.000000807 0.000000804 0.00321 1.00 0.0794
#> 7 0.0000233 0.0000233 0.00113 1.00 0.0775
#> 8 0.0000000504 0.0000000504 0.00120 1.00 0.0775
#> 9 0.000130 0.000129 0.00401 1.00 0.0801
#> 10 0.00108 0.00108 0.00147 1.00 0.0778
#> # … with 1,180 more rows
Note that the parameter used to select individual observations is now called level
, not group
. Previous versions of HLMdiag
used the group parameter for influence diagnostics, where group = NULL
was the default, and users could choose to delete groups instead by setting group
equal to level names. As of HLMdiag
version 0.4.0, group
has been replaced by level
. level
defaults to level = 1
which deletes individual observations, exactly how group = NULL
used to work.
The resulting tibble can be used along with dotplot_diag()
to identify potentially influential observations using either an internal cutoff of 3*IQR or a user-specified cutoff. For example, the plot shown below reveals and labels all observations above the internally calculated cutoff for Cook’s distance.
dotplot_diag(infl$cooksd, name = "cooks.distance", cutoff = "internal")
The plot illustrates that there are many observations with a Cook’s distance value above the internally calculated cutoff. dotplot_diag()
labels the top 5, which is difficult to see in this case. Rather than investigate all observations above the cutoff, we may be interested in the observations with the highest Cook’s distance values. The tibble returned by hlm_influence()
provides an easy way to do this when used with the arrange()
function from dplyr
.
tb1 <- infl %>%
arrange(desc(cooksd))
#> # A tibble: 1,190 x 14
#> id mathgain mathkind sex minority ses housepov schoolid classid cooksd
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 539 253 290 0 1 -0.4 0.257 47 82 0.0536
#> 2 41 -110 434 0 1 -0.37 0.365 4 179 0.0265
#> 3 1078 203 290 1 1 -0.32 0.067 95 144 0.0256
#> 4 664 -84 629 1 0 2.33 0.17 62 22 0.0245
#> 5 312 138 542 1 0 2.27 0.05 27 104 0.0221
#> 6 754 218 368 0 1 -1.14 0.43 70 152 0.0202
#> 7 337 197 290 1 1 -0.46 0.214 28 203 0.0194
#> 8 723 214 290 1 1 -0.67 0.209 68 244 0.0187
#> 9 812 147 510 0 0 -0.47 0.367 75 42 0.0148
#> 10 1146 11 376 0 1 0.62 0.126 102 44 0.0132
#> mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0527 0.0172 1.02 0.112
#> 2 0.0264 0.00406 1.00 0.143
#> 3 0.0250 0.0250 1.03 0.147
#> 4 0.0242 0.0125 1.01 0.0869
#> 5 0.0219 0.00988 1.01 0.0919
#> 6 0.0200 0.00702 1.01 0.0896
#> 7 0.0190 0.0198 1.02 0.147
#> 8 0.0183 0.0195 1.02 0.155
#> 9 0.0147 0.00797 1.01 0.0722
#> 10 0.0131 0.00825 1.01 0.0962
#> # … with 1,180 more rows
There does not appear to be a clear pattern among students who have relatively higher Cook’s distance values; they do not tend to be from a particular school or class. In order to investigate these observations further, one could look to summary statistics for the different explanatory variables. Additionally, a similar analysis could be done with the other influence diagnostics provided in the tibble from hlm_influence()
.
In addition to identifying influential observations, it may also be useful to identify influential groups. The level
parameter in hlm_influence()
can be used for this purpose. For example, we can obtain influence diagnostics for each class:
infl.classes <- hlm_influence(class.lmer, level = "classid:schoolid")
#> # A tibble: 312 x 6
#> `classid:schoolid` cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 160:1 0.00224 0.00224 0.00990 1.01 0.0980
#> 2 217:1 0.00133 0.00131 0.0238 1.02 0.128
#> 3 197:2 0.00323 0.00320 0.0106 1.01 0.123
#> 4 211:2 0.00724 0.00719 0.0202 1.02 0.0891
#> 5 307:2 0.00346 0.00343 0.0197 1.02 0.152
#> 6 11:3 0.00116 0.00115 0.0141 1.01 0.0960
#> 7 137:3 0.000873 0.000868 0.0131 1.01 0.150
#> 8 145:3 0.00517 0.00515 0.0248 1.02 0.0983
#> 9 228:3 0.000479 0.000478 0.00797 1.01 0.126
#> 10 48:4 0.00371 0.00367 0.0212 1.02 0.134
#> # … with 302 more rows
Note that the level
parameter is set as classid:schoolid
, not classid
. The level parameter should be specified as found in model@flist
for lmerMod objects. For three level models, this usually takes the form of “level2:level3”, where level2 is the name of the second level variable, and level3 is the name of the third level variable.
Using dotplot_diag()
again with one of the columns from the resulting tibble of hlm_influence
reveals that class 218 has a relatively high Cook’s distance value.
dotplot_diag(infl.classes$cooksd, name = "cooks.distance", cutoff = "internal")
We can repeat a similar analysis to flag influential schools.
infl.schools <- hlm_influence(class.lmer, level = "schoolid")
#> # A tibble: 107 x 6
#> schoolid cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.000561 0.000557 0.0381 1.04 0.0901
#> 2 2 0.00681 0.00672 0.0554 1.06 0.115
#> 3 3 0.0343 0.0333 0.0723 1.07 0.108
#> 4 4 0.0248 0.0246 0.0360 1.04 0.125
#> 5 5 0.00224 0.00223 0.0583 1.06 0.0998
#> 6 6 0.0109 0.0107 0.0690 1.07 0.104
#> 7 7 0.00576 0.00560 0.0785 1.08 0.108
#> 8 8 0.0121 0.0118 0.0785 1.08 0.0926
#> 9 9 0.00624 0.00590 0.0793 1.08 0.141
#> 10 10 0.00848 0.00826 0.0723 1.07 0.0928
#> # … with 97 more rows
Similarly, we use dotplot_diag
to visually represent the differences. In this case, there are only four observations above the cutoff, so we set modify = "dotplot"
in order to better visualize the influential observations. modify = "dotplot"
works well when there are a relatively small number of observations above the cutoff.
dotplot_diag(infl.schools$cooksd, name = "cooks.distance", cutoff = "internal", modify = "dotplot")
The plot reveals that schools 68, 75, 70, and 27 have relatively high Cook’s distance values.
delete
The delete
parameter allows the user to calculate influence diagnostics for a specified group of observations, or group factor level. For example, influence diagnostics for the influential schools flagged above (68, 75, 70, and 27) can be calculated as shown below, deleting all four schools at once:
hlm_influence(class.lmer, level = "schoolid", delete = c("27", "70", "75", "68"))
#> # A tibble: 1 x 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.238 0.222 0.370 1.43
Note that in this case, delete
is specified as a character vector consisting of the school ID’s of interest. delete
can also be set as a numeric vector of indices; in this case, setting delete
to the row numbers of all students in the data frame who attend schools 68, 75, 70, or 27 would be equivalent to the line above.
leverage
argumentIn the previous examples, hlm_influence()
only returned overall leverage. However, hlm_influence()
also allows for other types of leverage, including the leverage corresponding to the fixed effects, the leverage corresponding to the random effects as proposed by Demidenko and Stukel (2005), and the unconfounded leverage corresponding to the random effects proposed by Nobre and Singer (2011). These types of leverage are referred to as fixef
, ranef
, and ranef.uc
, respectively.
None of our observations are flagged for high leverage when looking only at overall leverage:
dotplot_diag(infl$leverage.overall, name = "leverage", cutoff = "internal")
However, we can obtain the other types of leverage as follows:
infl2 <- hlm_influence(class.lmer, level = 1, leverage = c("overall", "fixef", "ranef", "ranef.uc"))
This example illustrates how to select all four types of leverage, but any one or more may also be selected.
We can then use dotplot_diag()
with one of the new leverage columns to investigate outlier for that type of leverage.
dotplot_diag(infl2$leverage.fixef, name = "leverage", cutoff = "internal", modify = "dotplot")
However, further analysis reveals that several observations have high leverage when considering only the leverage corresponding to the fixed effects.
approx
hlm_influence()
defaults to calculating influence diagnostics based off a one step approximation (Christensen et.al 1992; Shi and Chen 2008; Zewotir 2008). However, the approx
parameter allows the user to calculate influence diagnostics based off a full refit of the data using hlm_influence()
. For example, if we wished to calculate influence diagnostics for each school by fully refitting the model each time, we could use:
infl3 <- hlm_influence(class.lmer, level = "schoolid", approx = FALSE)
#> # A tibble: 107 x 9
#> schoolid cooksd mdffits covtrace covratio rvc.sigma2 rvc.D11 rvc.D22
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.000642 0.000639 0.0565 1.06 -0.00262 0.0154 0.0183
#> 2 2 0.00684 0.00676 0.0423 1.04 -0.00409 0.0161 -0.00403
#> 3 3 0.0389 0.0386 0.0256 0.973 0.00299 0.000124 -0.0952
#> 4 4 0.0249 0.0253 0.0971 0.906 -0.0315 -0.0293 0.0174
#> 5 5 0.00222 0.00221 0.0691 1.07 -0.00229 0.0160 0.00997
#> 6 6 0.0111 0.0109 0.0671 1.07 0.00443 0.00323 -0.0198
#> 7 7 0.00564 0.00546 0.0970 1.10 0.00436 0.0230 -0.0119
#> 8 8 0.0117 0.0115 0.0653 1.07 -0.00744 -0.0804 0.0574
#> 9 9 0.00634 0.00598 0.0922 1.09 0.00313 0.00326 -0.00216
#> 10 10 0.00816 0.00790 0.109 1.11 0.00796 0.0220 -0.00935
#> leverage.overall
#> <dbl>
#> 1 0.0901
#> 2 0.115
#> 3 0.108
#> 4 0.125
#> 5 0.0998
#> 6 0.104
#> 7 0.108
#> 8 0.0926
#> 9 0.141
#> 10 0.0928
#> # … with 97 more rows
In most cases, using the default of approx = TRUE
is sufficient for influence analysis. Setting approx = FALSE
also takes much longer than the default setting since the model must be refit each time with each group or observation deleted. However, the full refit method also allows for relative variance change (RVC) to be returned. If this is desired, approx = FALSE
should be used.
na.action
and the data
argumenthlm_influence()
was written to respect the na.action
parameter from the lme4
package. This argument defaults to na.omit
, which means all rows in the data sets with NA
s present are simply deleted from the model. However, there is also an na.exclude
option, which pads the resulting tibbles with NA
s in the the rows that contained NA
s in the original data set in place of deleting them altogether. In order for hlm_influence()
to do this, the original data set must be passed into hlm_influence()
via the data
argument whenever the na.action
was set to na.exclude
in the model fitting process.
For example, while the class
data set does not have any NA
s in the data set, we can introduce a couple for the purposes of an example.
classNA <- classroom
classNA[2,3] <- NA
classNA[3,4] <- NA
We can then fit the same model using the lme4
package as before. Below, we fit two models, one with na.action = "na.exclude"
and one with the default na.action = "na.omit"
.
class.lmer.exclude <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classNA, na.action = "na.exclude")
class.lmer.omit <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classNA, na.action = "na.omit")
We then run hlm_influence()
on the model where na.action = "na.omit"
.
infl4 <- hlm_influence(class.lmer.omit, level = 1)
#> # A tibble: 1,188 x 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 83 449 0 1 -0.38 0.082 1 217
#> 3 3 53 425 0 1 -0.03 0.082 1 217
#> 4 4 65 450 1 1 0.76 0.082 1 217
#> 5 5 51 452 0 1 -0.03 0.082 1 217
#> 6 6 66 443 0 1 0.2 0.082 1 217
#> 7 7 88 422 1 1 0.64 0.082 1 217
#> 8 8 -7 480 0 1 0.13 0.082 1 217
#> 9 9 60 502 0 1 0.83 0.082 1 217
#> 10 10 -2 502 1 1 0.06 0.082 2 197
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000710 0.000708 0.00351 1.00 0.160
#> 2 0.000295 0.000294 0.00182 1.00 0.0801
#> 3 0.000140 0.000140 0.00183 1.00 0.0801
#> 4 0.00000818 0.00000815 0.00325 1.00 0.0814
#> 5 0.0000145 0.0000144 0.00124 1.00 0.0795
#> 6 0.00000224 0.00000224 0.00127 1.00 0.0796
#> 7 0.000184 0.000183 0.00400 1.00 0.0821
#> 8 0.00111 0.00111 0.00163 1.00 0.0799
#> 9 0.000375 0.000374 0.00338 1.00 0.0815
#> 10 0.00250 0.00248 0.00508 1.01 0.136
#> # … with 1,178 more rows
Note than there are only 1,188 rows in the returned tibble, although there were 1,190 observations in the original data set. The two rows with NAs were deleted from the returned tibble.
We repeat this with the model where na.action = "na.exclude"
.
infl5 <-hlm_influence(class.lmer.exclude, level = 1, data = classNA)
#> # A tibble: 1,190 x 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 NA 0 1 -0.27 0.082 1 160
#> 3 3 NA 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000710 0.000708 0.00351 1.00 0.160
#> 2 NA NA NA NA NA
#> 3 NA NA NA NA NA
#> 4 0.000295 0.000294 0.00182 1.00 0.0801
#> 5 0.000140 0.000140 0.00183 1.00 0.0801
#> 6 0.00000818 0.00000815 0.00325 1.00 0.0814
#> 7 0.0000145 0.0000144 0.00124 1.00 0.0795
#> 8 0.00000224 0.00000224 0.00127 1.00 0.0796
#> 9 0.000184 0.000183 0.00400 1.00 0.0821
#> 10 0.00111 0.00111 0.00163 1.00 0.0799
#> # … with 1,180 more rows
In this tibble, there are 1,190 rows. Furthermore, the two rows with NAs display NAs for the influence diagnostics, instead of being entirely absent as in the above example. It is important to note that the data
argument is necessary. Failing to provide the data set through the data
argument in this situation will result in an error.
The hlm_augment()
function combines the hlm_resid()
and hlm_influence()
functions to return a tibble containing information about the residuals and the influence diagnostics appended to the data. For example, we can obtain residuals and influence diagnostics at once for all students in the class
data set with the following:
aug <- hlm_augment(class.lmer, level = 1)
#> # A tibble: 1,190 x 20
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <dbl> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 460 0 1 -0.27 0.082 1 160
#> 3 3 56 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> .resid .fitted .ls.resid .ls.fitted .mar.resid .mar.fitted cooksd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -37.7 69.7 0 32 -34.6 66.6 0.00106
#> 2 47.5 61.5 0 109 50.6 58.4 0.00143
#> 3 18.5 37.5 0 56 21.6 34.4 0.000336
#> 4 23.3 59.7 35.4 47.6 20.0 63.0 0.000225
#> 5 -19.9 72.9 -15.4 68.4 -23.1 76.1 0.000173
#> 6 1.01 64.0 -4.18 69.2 -2.22 67.2 0.000000807
#> 7 -9.14 60.1 -1.19 52.2 -12.4 63.4 0.0000233
#> 8 0.413 65.6 4.24 61.8 -2.82 68.8 0.0000000504
#> 9 11.5 76.5 4.18 83.8 8.22 79.8 0.000130
#> 10 -54.8 47.8 -45.3 38.3 -58.0 51.0 0.00108
#> mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.00106 0.00289 1.00 0.121
#> 2 0.00143 0.00246 1.00 0.121
#> 3 0.000335 0.00378 1.00 0.122
#> 4 0.000225 0.00169 1.00 0.0780
#> 5 0.000172 0.00178 1.00 0.0780
#> 6 0.000000804 0.00321 1.00 0.0794
#> 7 0.0000233 0.00113 1.00 0.0775
#> 8 0.0000000504 0.00120 1.00 0.0775
#> 9 0.000129 0.00401 1.00 0.0801
#> 10 0.00108 0.00147 1.00 0.0778
#> # … with 1,180 more rows
This is useful for inspecting residuals values and influence diagnostics values at the same time. However, hlm_augment()
lacks some of the functionality that hlm_influence()
and hlm_resid()
have. The delete
and approx
parameters available for hlm_influence()
are not available in hlm_augment()
, so the function will always use a one step approximation and delete all observations or groups instead of a selected few. The standardize
parameter from hlm_resid()
is also not included, so hlm_influence()
or hlm_resid()
should be used instead if this functionality is desired. For more information about available functionality in hlm_resid()
, see the hlm_resid
vignette.
hlm_augment()
is especially useful for inspecting residual values of observations with relatively high influence diagnostics values, or vice versa.
aug2 <- aug %>%
arrange(desc(cooksd))
#> # A tibble: 1,190 x 20
#> id mathgain mathkind sex minority ses housepov schoolid classid .resid
#> <dbl> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 539 253 290 0 1 -0.4 0.257 47 82 110.
#> 2 41 -110 434 0 1 -0.37 0.365 4 179 -157.
#> 3 1078 203 290 1 1 -0.32 0.067 95 144 62.0
#> 4 664 -84 629 1 0 2.33 0.17 62 22 -88.7
#> 5 312 138 542 1 0 2.27 0.05 27 104 94.6
#> 6 754 218 368 0 1 -1.14 0.43 70 152 107.
#> 7 337 197 290 1 1 -0.46 0.214 28 203 60.7
#> 8 723 214 290 1 1 -0.67 0.209 68 244 59.8
#> 9 812 147 510 0 0 -0.47 0.367 75 42 87.2
#> 10 1146 11 376 0 1 0.62 0.126 102 44 -79.8
#> .fitted .ls.resid .ls.fitted .mar.resid .mar.fitted cooksd mdffits covtrace
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 143. 4.04e+ 1 213. 117. 136. 0.0536 0.0527 0.0172
#> 2 47.0 0. -110 -177. 66.8 0.0265 0.0264 0.00406
#> 3 141. 0. 203 65.9 137. 0.0256 0.0250 0.0250
#> 4 4.67 -2.79e+ 1 -56.1 -81.9 -2.08 0.0245 0.0242 0.0125
#> 5 43.4 -2.66e-15 138 98.1 39.9 0.0221 0.0219 0.00988
#> 6 111. -6.66e-15 218 125. 93.1 0.0202 0.0200 0.00702
#> 7 136. 0. 197 62.3 135. 0.0194 0.0190 0.0198
#> 8 154. 0. 214 80.4 134. 0.0187 0.0183 0.0195
#> 9 59.8 2.78e+ 1 119. 109. 38.3 0.0148 0.0147 0.00797
#> 10 90.8 -2.60e+ 1 37.0 -91.1 102. 0.0132 0.0131 0.00825
#> covratio leverage.overall
#> <dbl> <dbl>
#> 1 1.02 0.112
#> 2 1.00 0.143
#> 3 1.03 0.147
#> 4 1.01 0.0869
#> 5 1.01 0.0919
#> 6 1.01 0.0896
#> 7 1.02 0.147
#> 8 1.02 0.155
#> 9 1.01 0.0722
#> 10 1.01 0.0962
#> # … with 1,180 more rows
We can sort by a particular column in order to inspect the values of other influence diagnostics and the residuals of influential observations.
Previously, only the individual one step approximation influence functions worked on lme
models fit using the nlme
package. However, hlm_influence()
can also be used on lme
objects, as can hlm_augment()
. This allows the user to calculate influence diagnostics for lme
models by fulling refitting the model using the approx = FALSE
argument.
In most cases, using a lme
object for hlm_influence()
or hlm_augment()
is identical to their usage with lmerMod
objects from lme4
. However, there are a few notable exceptions.
For both hlm_influence()
and hlm_augment()
, levels should be specified by names that appear in model@flist
. For the second level of a three level lmerMod
model, this usually looks like “level2:level3” where level2 and level3 are the names of the second and third level variables, respectively. However, for a lme
model, levels should be specified by names that appear in model$groups
. For example, we can obtain influence diagnostics for each classroom from the class
data set in the following way:
class.lme <- nlme::lme(mathgain ~ mathkind + sex + minority + ses + housepov, random = ~ 1|schoolid/classid, data = classroom)
hlm_influence(class.lme, level = "classid")
#> # A tibble: 312 x 6
#> classid cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1/160 0.00224 0.00224 0.00990 1.01 0.121
#> 2 1/217 0.00133 0.00131 0.0238 1.02 0.0784
#> 3 2/197 0.00323 0.00320 0.0106 1.01 0.126
#> 4 2/211 0.00724 0.00719 0.0202 1.02 0.102
#> 5 2/307 0.00346 0.00343 0.0197 1.02 0.0752
#> 6 3/11 0.00116 0.00115 0.0141 1.01 0.102
#> 7 3/137 0.000873 0.000868 0.0131 1.01 0.0819
#> 8 3/145 0.00517 0.00515 0.0248 1.02 0.107
#> 9 3/228 0.000479 0.000478 0.00797 1.01 0.148
#> 10 4/48 0.00371 0.00367 0.0212 1.02 0.149
#> # … with 302 more rows
For the lmerMod
model, we specified level = classid:schoolid
. However, for lme
models, the name of the second level alone is sufficient. However, specifying names of specific groups to delete for lme
models is also slightly different. For example, we can obtain influence diagnostics for a model created when classes 160 and 217 are deleted for a lmerMod
model in the following way:
hlm_influence(class.lmer, level = "classid:schoolid", delete = c("160:1", "217:1"))
#> # A tibble: 1 x 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000561 0.000557 0.0381 1.04
Obtaining influence diagnostics for a model created with the deletion of classes 160 and 217 from a lme
model is a bit different:
hlm_influence(class.lme, level = "classid", delete = c("1/160", "1/217"))
#> # A tibble: 1 x 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000561 0.000557 0.0381 1.04
Note that both examples specify units for deletion in the same way they are specified in model@flist
(lmerMod
models) or model$groups
(lme
models).
Additionally, lmerMod
models require that the data
argument is passed into hlm_influence()
and hlm_augment()
when na.action = "na.exclude"
during the model fitting process. However, this is unnecessary for lme
models.
The HLMdiag
package has two different ways to calculate Cook’s distance. One of them, which is more exact, refits the model with each observation or group deleted from the model and calculates Cook’s distance from the resulting coefficient estimates and variance components estimates. The second is a one-step approximation. For more information about how the one step approximation works, we refer the reader to Christensen et.al (1992); Shi and Chen (2008); and Zewotir (2008). Other R packages also have functions that can calculate Cook’s distance. In this section, we highlight the differences between the ways of calculating Cook’s distance in HLMdiag
versus other methods in the car
and lme4
packages.
car
packageThe cooks_distance()
function from HLMdiag
calculates Cook’s distance through a full refit of the model using the following formula:
\(C_i(\hat{\beta}) = \frac{1}{p}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})\)
Notice that the difference in the change in the parameter estimates is scaled by the estimated covariance matrix of the original parameter estimates. We can calculate the Cook’s distance values in the HLMdiag
package by first using the case_delete()
function, followed by the cooks.distance()
function.
hlm_case <- HLMdiag:::case_delete.lmerMod(class.lmer)
hlm_cd_full <- HLMdiag:::cooks.distance.case_delete(hlm_case)
head(hlm_cd_full)
#> [1] 9.327238e-04 1.415243e-03 3.316859e-04 2.282399e-04 1.797497e-04
#> [6] 6.968432e-07
Conversely, the mdffits()
function from HLMdiag
calculates MDFFITS, which is a multivariate version of the DFFITS statistic. This is calculated as follows:
\(MDFFITS_i(\hat{\beta}) = \frac{1}{p}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})\)
Instead of scaling by the estimated covariance matrix of the original parameter estimates, calculations for MDFFITS are scaled by the estimated covariance matrix of the deletion estimates. For each deleted observation or group, the model is refit and the covariance structure re-estimated without unit i. We can calculate this similarly to how we calculated Cook’s distance, using the same case_delete
object.
hlm_mdffits <- mdffits(hlm_case)
head(hlm_mdffits)
#> [1] 9.304263e-04 1.412796e-03 3.302360e-04 2.278198e-04 1.793468e-04
#> [6] 6.942264e-07
These estimates are pretty similar to those produced by cooks_distance()
; the difference is due to the use of the covariance matrix of the deletion estimates, rather than the original estimates.
The car
package calculates Cook’s distance similarly to how the HLMdiag
package calculates MDFFITS. Instead of using the covariance matrix of the original parameter estimates like HLMdiag
’s cooks_distance()
function, the cooks.distance()
function from car
uses the estimated covariance matrix of the deletion estimates. However, the cooks_distance()
function from car is not identical to the mdffits
function from HLMdiag
; the car::cooks.distance()
function also scaled each observation, i, by dividing it by the estimated variance of the model calculated without observation i. Therefore, the formula used to calculate Cook’s distance in the car
package is as follows:
\(C_i(\hat{\beta}) = \frac{1}{p\hat{\sigma_{i}}^2}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})\)
We can calculate this by first using the influence()
function followed by the cooks.distance()
function.
library(car)
car_case <- influence(class.lmer)
car_cd <- cooks.distance(car_case)
head(car_cd)
#> [1] 1.270970e-06 1.928339e-06 4.493427e-07 3.102931e-07 2.441009e-07
#> [6] 9.411194e-10
These values initially seem fairly different from the MDFFITS and Cook’s distance values produced by HLMdiag
, due to the additional scaling by the inverse of the variance of each refitted model. We can adjust for this by multiplying the vector of Cook’s distance values from car
by the estimated variance from each refitted model.
sig.sq <- car_case[[4]][,1]
car_cd_adjusted <- car_cd * sig.sq
head(car_cd_adjusted)
#> 1 2 3 4 5 6
#> 9.304279e-04 1.412774e-03 3.302267e-04 2.278159e-04 1.793900e-04 6.918772e-07
Once the values from car
have been adjusted by sigma squared, they now appear very similar to the MDFFITS values from HLMdiag
. The plot below shows the difference between the MDFFITS estimates from HLMdiag
and the variance-adjusted Cook’s distance estimates from car
for all 1,190 observations in the classroom
data set.
While the difference between the estimates varies slightly due to differences in how the fixed effects and variance components are calculated for refit models in both packages, the difference in values tends to be very small.
lme4
packageSimilar to HLMdiag
, the lme4
package has two methods that calculate Cook’s distance. One of them, similar to the case_delete
method, conducts a full refit of models without each observation or group. The second is a quicker approximation.
lme4
approximationIn order to calculate the approximation of Cook’s distance values from lme4
, we use the cooks.distance()
function.
lme_cd_approx <- lme4:::cooks.distance.merMod(class.lmer)
head(lme_cd_approx)
#> 1 2 3 4 5 6
#> 0.050602009 0.080124360 0.012322170 0.011276222 0.008216468 0.000021657
However, this approximation is fairly different from the one produced from HLMdiag
.
hlm_cd_approx <- HLMdiag:::cooks.distance.lmerMod(class.lmer)
head(hlm_cd_approx)
#> [1] 1.060728e-03 1.433652e-03 3.358071e-04 2.249644e-04 1.726769e-04
#> [6] 8.069538e-07
The approximation from HLMdiag
is also closer to the full refit calculated by HLMdiag
than the lme4
approximation is. The plots below show the difference between the full refit from HLMdiag
and the HLMdiag
approximation (left) and the lme4
approximation (right).
The estimates produced by the lme4
approximation are never smaller than the values from the HLMdiag
full refit, but they tend to be further off the HLMdiag
full refit values than the HLMdiag
approximation values are. The difference between the full refit and the approximation from HLMdiag
tends to be very small (< 0.0005), while the difference between the full refit and the lme4
approximation values tends to be much larger.
lme4
full refitlme4
also has a function that performs a full refit of the models without each observation or group and calculates Cook’s distance values based off those refits. We can calculate this by using the influence
function from lme4
.
lme_case <- lme4:::influence.merMod(class.lmer, data = classroom)
cd_lme_full <- cooks.distance(lme_case)
lme4
’s cook’s distance function for objects created from the influence
function has a bug that prevents us from obtaining Cook’s distance values from the influence
object using lme4
; however, we can use the cooks.distance
function from the stats
package on the influence
object from lme4
.
The values obtained from lme4
match those from car
, meaning that lme4
is also scaling the estimates by dividing by the estimated variance of each refit model, in addition to using the estimated covariance matrix from the deletion estimates, rather than the original estimates. Therefore, the lme4
package is also using this formula to calculate Cook’s distance:
\(C_i(\hat{\beta}) = \frac{1}{p\hat{\sigma_{i}}^2}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})\)
The full refits from the lme4
and car
packages both choose to calculate what is MDFFITS in the HLMdiag
package, and additionally choose to scale by dividing by the variance from each refit model. Those choices explain almost all of the differences between Cook’s distance values from the three different packages; additional variation is due to differences in how the new models are fit and coefficients estimated.
##References Bates D, Maechler M, Bolker B, Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. URL http://dx.doi.org/10.18637/jss.v056.i05
Christensen R, Pearson L, Johnson W (1992). “Case-Deletion Diagnostics for Mixed Models.” Technometrics, 34(1), 38–45.
Fox J, Weisburg S (2019). A {R} Companion to Applied Regression, Third Addition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
Loy A, Hofmann H (2014). HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models in R. Journal of Statistical Software, 56(5), 1-28.
Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team (2020). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-148, <URL: https://CRAN.R-project.org/package=nlme>.
Shi L, Chen G (2008). “Case Deletion Diagnostics in Multilevel Models.” Journal of Multivariate Analysis, 99(9), 1860–1877.
West, B., Welch, K. & Galecki, A. (2006) Linear Mixed Models: A Practical Guide Using Statistical Software. First Edition. Chapman Hall / CRC Press. ISBN 1584884800
Zewotir T (2008). “Multiple Cases Deletion Diagnostics for Linear Mixed Models.” Communications in Statistics – Theory and Methods, 37(7), 1071–1084.