An Evaluation of Information Criteria Use for Correct Cross-Classified Random Effects Model Selection

Beretvas, S. Natasha ; Murphy, Daniel L.

In: Journal of Experimental Education, Jg. 81 (2013), Heft 4, S. 429-463

Online academicJournal

Zugriff:

Volltext (PDF)

The authors assessed correct model identification rates of Akaike's information criterion (AIC), corrected criterion (AICC), consistent AIC (CAIC), Hannon and Quinn's information criterion (HQIC), and Bayesian information criterion (BIC) for selecting among cross-classified random effects models. Performance of default values for the 5 indices used by SAS PROC MIXED for estimating a 2-level cross-classified random effects model were compared with modifications to the sample size used in the AICC, CAIC, HQIC, and BIC formulations. The sample sizes explored included the number of level 1 units (N), the average number of classification units (m), and the number of nonempty classification cells (c). The authors also assessed performance of the X[superscript 2] "diff" test for testing the difference in fit between 2 nested cross-classified random effects models. The X[superscript 2] "diff" exhibited a slightly inflated Type I error rate with high power. The modified information criteria performed better than did the default values. Pairing of N with the HQIC, BIC, and CAIC and of m with the AICC worked best. Results and suggestions for future research are discussed.

An Evaluation of Information Criteria Use for Correct Cross-Classified Random Effects Model Selection.

The authors assessed correct model identification rates of Akaike's information criterion (AIC), corrected criterion (AICC), consistent AIC (CAIC), Hannon and Quinn's information criterion (HQIC), and Bayesian information criterion (BIC) for selecting among cross-classified random effects models. Performance of default values for the 5 indices used by SAS PROC MIXED for estimating a 2-level cross-classified random effects model were compared with modifications to the sample size used in the AICC, CAIC, HQIC, and BIC formulations. The sample sizes explored included the number of level 1 units (N), the average number of classification units (m), and the number of nonempty classification cells (c). The authors also assessed performance of the χ²_diff test for testing the difference in fit between 2 nested cross-classified random effects models. The χ²_diff exhibited a slightly inflated Type I error rate with high power. The modified information criteria performed better than did the default values. Pairing of N with the HQIC, BIC, and CAIC and of m with the AICC worked best. Results and suggestions for future research are discussed.

Keywords: cross-classified model; information criteria; model fit indices; multilevel model

ANALYSES OF LARGE-SCALE EDUCATIONAL DATASETS are frequently complicated by the clustering of individuals within common contexts. For example, in cross-sectional data, a dataset might consist of students clustered within classrooms or schools. Longitudinal data consists of repeated measures on (and thus clustered within) individuals who might be further clustered by schools. This clustering leads to violation of the assumption of independence that is made when estimating a multiple regression model using, for example, ordinary least squares estimation. Violating this assumption can lead to spurious inferences being made on the basis of tests of parameter estimates with concomitant inflated type I error rates (Hox, [16]; Snijders & Bosker, [28]). Instead, one way that researchers could better handle the dependency resulting from clustered data involves the use of the multilevel model.

However, use of the conventional multilevel model only works for purely clustered data. Figure 1 contains a network graph depicting the pure clustering of students (level 1) within middle schools (level 2) with multiple middle schools feeding into each high school (level 3). The pureness of the clustering is evidenced in the figure by the fact that lines do not cross that connect students to their middle and high schools. The three levels of the hierarchy are evident because for each high school, there is a closed set of middle schools that feed solely into that high school. For example, any student in the dataset who attended middle schools MS1, MS2, or MS3 went on to attend HSI (see Figure 1). Similarly, any student in the dataset who attended middle schools MS4 or MS5 then attended HSII.

Graph: FIGURE 1 Network graph depicting pure clustering of students within MSs and of MSs within HSs. HS = high school; MS = middle school; STU = student.

Real-world data with two clustering variables (here, middle and high school) do not always fit into this pure kind of hierarchy. Instead, a dataset might entail what is termed a cross-classified structure, wherein one of the two higher level clustering units is not purely clustered within the other. In the context of the present example, the dataset consists of multiple students per (i.e., clustered within each) middle school and of multiple students per high school, however middle schools might not be purely clustered within high schools (contrary to what is depicted in Figure 1). Similarly, high schools might not be purely clustered within middle schools. In other words, there might not be a closed set of middle schools affiliated with each high school (nor vice-versa). Thus, students (level 1) are cross-classified by middle school and by high school, and thus, each clustering unit can be considered a level 2 cross-classification factor.

Figures 2 and 3 depict the same cross-classified dataset in which students are clustered within middle schools and students are also clustered within high schools but neither clustering unit is clustered within the other. Figures 2 and 3 differ solely in that the students are presented in different orders. In Figure 2, students are aligned by middle school. In Figure 3, students are aligned by high school. The cross-classification is evidenced in these figures by the crossing of the lines connecting students with one of the two higher level clustering units (see Beretvas, [2], [3]).

Graph: FIGURE 2 Network graph depicting cross-classification of students by MSs and by HSs with students lined up within MSs. HS = high school; MS = middle school; STU = student.

Graph: FIGURE 3 Network graph depicting cross-classification of students by HSs and by MSs with students lined up within HSs. HS = high school; MS = middle school; STU = student.

There are several possible examples of cross-classified data in educational research. This includes the example already given (middle and high schools) as well as the classic example of school and neighborhood (Garner & Raudenbush, [8]) and combinations of high school classes (e.g., mathematics and sciences), among many others. An extension to the multilevel model, termed the cross-classified random effects model (CCREM) can be used to handle cross-classified data. Recent methodological research (e.g., see Luo & Kwok, [19]; Meyers & Beretvas, [20]) has supported the negative effects on resulting statistical inferences when researchers ignore cross-classifications in multilevel analyses. Estimation of CCREMs is increasingly facilitated given that many multilevel modeling textbooks now include chapters dedicated to an explanation of the CCREM (e.g., see Beretvas, [2]; Goldstein, [11]; Hox, [16]; Rasbash & Browne, [23]; Raudenbush & Bryk, [24]; Snijders & Boskers, [28]) and that most statistical software programs and procedures (e.g., HLM, MLwiN, R, SAS PROC MIXED, STATA) permit estimation of CCREMs. Applied researchers across disciplines are thus increasingly recognizing cross-classified data structures in their datasets and estimating CCREMs (e.g., Ecob et al., [6]; Fielding, [7]; Garner & Raudenbush, [8]; Goldstein & Sammon, [12]; Pickery, Loosveldt, & Carton, [21]; Szapocznik et al., [31]; West, Sweeting, & Leyland, [33]).

Recent methodological studies (e.g., Luo & Kwok, [19]; Shi, Leite, & Algina, [27]) have investigated the effect of misspecifying CCREMs. Only one research study has been found that included a preliminary assessment of the performance of a small set of model fit indices used to identify the correct CCREM model (Meyers & Beretvas, [20]). The present study extends the research started by Meyers and Beretvas ([20]) by assessing the performance of a fuller set of possible information criteria in terms of correct CCREM model selection. The study also introduces modifications to the sample size used in the information criteria's formulations and assesses correct model identification for a limited set of possible generating and estimating CCREMs.

The next few sections review the formulations of possible information criteria for use with the conventional multilevel model as well as summarizing research on how well the information criteria support the fit of the correct multilevel model. This discussion addresses sample size modifications to the information criteria that have been suggested and assessed in studies by Gurka ([13]) and Whittaker and Furlow ([34]). This then leads to a very brief review of the CCREM's formulation as well as a summary of the results of the one study by Meyers and Beretvas that assessed information criteria performance with the CCREM.

Use of Model Fit Indices with Multilevel Models

Multilevel modelers do not frequently use fit indices to assess or compare the fit of multilevel models and instead tend to use statistical significance test results for each parameter estimate of interest under the assumption that the model fits. Whittaker and Furlow ([34]) surveyed applied multilevel modeling articles published between 2002 and 2007 and found that authors had used fit index results when making decisions about the selection of their final model in only 20% of the studies. Despite the tendency of researchers to use statistical hypothesis testing to make modeling decisions, it cannot be asserted that there is consensus in the multilevel modeling field about the validity of the use of such testing to fit multilevel model parameters. This is especially the case for tests of the statistical significance of variance components (e.g., see Goldstein, [11]; Raudenbush & Bryk, [24]). As noted in Goldstein's (2003) seminal text in a discussion of whether to test the statistical significance of a level 2 residuals' variance component: "It is generally preferable to carry out a likelihood ratio test by estimating the 'deviance' for the current model and the model omitting the level 2 variance" (p. 24). Thus, instead of using null hypothesis testing, Goldstein recommends comparing two nested models' fit with one model including and the other model omitting the relevant random effect.

Although they are not commonly used by multilevel modeling researchers, fit indices are automatically produced in the output of commonly used statistical modeling software programs (e.g., SAS, MLwiN, Mplus, HLM, LISREL). Fit indices could prove useful given the questionable use (Goldstein, [11]; Hox, [16]; Raudenbush & Bryk, [24]) of null hypothesis testing in multilevel modeling. In particular, information criteria that can be used to test the fit of nested and nonnested multilevel models might hold some merit in assisting model selection.

Information Criteria Used in Multilevel Modeling

When comparing two nested multilevel models, researchers can use the chi-square difference test, χ2diff, to test the difference in two models' deviance statistics, which is assumed to be chi-square distributed (Hox, [16]; Goldstein, [11]; Raudenbush & Bryk, [24], p. 84). The χ2diff statistic is calculated as follows:

Graph

where LLrestricted represents the log-likelihood function for the less parameterized model and LLunrestricted represents the log-likelihood function for the more parameterized model. Note that the χ2diff statistic can only be used to compare the fit of two nested models such that all of the parameters estimated in the restricted model are also estimated in the unrestricted model. The unrestricted model estimates at least one additional parameter.

Calculation of the log likelihood depends on the estimation procedure used. If the two models being compared differ only in their random effects' specification, then the deviance difference in Equation 1 can be tested using restricted (also known as residual) maximum likelihood (REML) estimates of the LLs (Verbeke & Molenberghs, [32]). If the models differ in fixed, or random or some combination of both types of effects, then full information maximum likelihood (FIML) estimates of the LLs can be used to calculate χ2diff. The present study will focus on the use of FIML rather than REML estimation because the models being compared differ in their fixed effects' specifications and more generally because FIML can be used with a wider range of model comparisons. Thus, the LLs in Equation 1 refer to marginal log-likelihood rather than restricted LLs (e.g., see Gurka, [13]; Harville, [15]). The degrees of freedom associated with the χ2diff equals the number of additional parameters estimated in the more parameterized unrestricted model (that are not estimated in the less parameterized restricted model).

Information criteria can also be used to compare the fit of nonnested models. However, when comparing the fit of two nonnested linear mixed models, the χ2diff statistic cannot be used. Instead, researchers must use information criteria to compare nonnested models' fit. Each information criteria is a function of the LL of the model being estimated. Most information criteria penalize the LL for the number of parameters being estimated, whereas some information criteria also incorporate some function of the relevant sample size, N*. Note that "N*" is used to represent sample size in the ensuing formulas to reflect the different conceptualizations of sample size that are encountered in the use of these formulas. More specifically, in the context of multilevel models, N* could refer to the number of level 1 units or to the number of level 2 units.

Information criteria are not statistics per se but are indices whose values can be compared. In general, and depending on the formulation (see Gurka, [13]), the smaller the value of the information criteria, the better the model's fit. Some researchers have suggested cutoffs for substantial differences in information criteria values. For example, Raftery ([22]) suggested a minimum drop of 10 points between two models' BICs to support the better fit of the model with the lower BIC (1995), although use of such criteria is uncommon in practice. Instead, a lower information criteria value is often used to gauge fit regardless of the magnitude of the difference in the two models' information criteria. Some of the more commonly used information criteria are subsequently described.

Akaike's information criterion (AIC; Akaike, [1]) is a commonly used information criteria that compensates for the number of parameters estimated in the model and does not use a function of the sample size:

Graph

where q represents the number of random and fixed effects parameters estimated in the model (when FIML estimation is used). Hurvich and Tsai ([17]) suggested a modification to the AIC designed to correct the AIC's tendency to support the fit of overly parameterized models. The modification entails a function of N*. The finite sample-corrected AIC (AICC) equals:

Graph

The Bayesian information criterion (BIC; Schwarz, [26]) is also designed to correct for N* and equals:

Graph

Another consistent criterion is Bozdogan's consistent AIC (CAIC; Bozdogan, [4]) that equals:

Graph

Last, SAS software includes Hannon and Quinn's information criterion (HQIC; Hannon & Quinn, [14]) in its output describing the fit of multilevel models. The HQIC is a function of both q and N*:

Graph

Some methodological research has been conducted that has compared identification of the correct multilevel model using information criteria (e.g., Gurka, [13]; Whittaker & Furlow, [34]). Both articles compared performance of information criteria when used for comparing multilevel models' fit. Both articles also assessed performance of these information criteria when paired with different values used for the N* that appears in the information criteria's formulas (see Equations 3 through 6). An evaluation of the use of different sample size interpretations of the N* was studied because different statistical software programs (SAS and SPSS) substitute different values for N* in some of these information criteria formulas when estimating multilevel models. For example, when FIML estimation is used with SAS, N* is set to the number of level 2 observations (m) for both the BIC and CAIC. SPSS software, however, uses the number of level 1 units, N, (i.e., the total sample size) as its N* in the BIC and CAIC formulas used (Equations 4 and 5) for estimating mixed models.

In Gurka's study (2006), he compared correct model identification using the AIC, AICC, BIC, and CAIC paired with N and m and FIML estimation. In addition, correct model identification rates for the four information criteria were also compared when REML estimation was used with two different restricted log-likelihood functions and with the information criteria paired with N−q and m. Gurka compared the fit of three sets of three different models (the correctly specified, underparameterized and overparameterized models). The sets were characterized by differences in terms of their fixed effects, random effects, and a combination of both. Gurka manipulated level 1 and level 2 variance values to provide conditions with different intraclass correlation (ρICC) and total (of the level 1 and level 2) variance values.

When comparing models that differed only in their fixed-effects, Gurka ([13]) found for the smaller total variance scenario that there did not seem to be an effect of the intraclass correlation (ρICC-). In these conditions, when FIML estimation was used, Gurka also found that the AICC performed very slightly better when paired with N (here, termed AICCN), although the BIC and CAIC performed slightly better with m (i.e., the BICm and CAICm outperformed the BICN and CAICN, respectively). However, for the larger total variance scenario, for conditions with larger ρICC, the proportion of correct model identifications decreased. Gurka also found in this set of conditions that the pairing of the AICCm, CAICm, and BICm resulted in better correct model identification rates than their N-based counterparts. However, again, the differences between the information criteria used with m performed only slightly better than when paired with N.

Across information criteria, Gurka ([13]) found that the AIC had the lowest correct model identification rates. Use of the AICCN and AICCm led to better model identification rates than found with the AIC. Use of the m- and N-based CAIC and BIC always led to better correct model identification rates than the AICC and AIC, with the CAIC values very slightly but consistently exceeding those of the BIC.

Whittaker and Furlow's (2009) study extended Gurka's (2006) investigation in a number of ways. First, the authors included an assessment of the performance of the HQIC criterion. In addition, the authors evaluated the performance of the five information criteria (see Equations 2 through 6) with some models that were more complex than those used by Gurka (including models with random slopes and cross-level interactions). Last, Whittaker and Furlow (2009) also used multilevel modeling to model clustered data rather than repeated measures data, with larger within-level-two sample sizes. As did Gurka ([13]), Whittaker and Furlow ([34]) assessed the information criteria's performance when paired with N versus m. Unlike Gurka, the authors focused on FIML not REML estimation.

Overall, Whittaker and Furlow (2009) found that the BIC and CAIC indices worked best in the scenarios that they evaluated. However, Whittaker and Furlow found that BICN and CAICN performed better than BICm and CAICm in terms of identifying the correct model. This finding differs from Gurka's results where the BICm and CAICm outperformed their N-based counterparts under conditions with larger total variances. However, there were scenarios in Whittaker and Furlow's study, especially when simpler models were being assessed, when the performance of the BICm and CAICm strongly outperformed the BICN and CAICN. In addition, when the BICN and CAICN outperformed the BICm and CAICm, the differences were not as substantial. Last, Gurka ([13]) found the same pattern of results where the N-based versions of the BIC and CAIC functioned very slightly better than the m-based versions under smaller variance conditions, whereas the BICm and CAICm substantially outperformed the BICN and CAICN for larger variance conditions, with the differences increasing for larger values of ρICC.

As did Gurka ([13]), Whittaker and Furlow ([34]) found that use of the AIC led to the lowest correct model identification rates followed by the AICCN-. The HQICm and HQICN also performed poorly. When each of these four indices did not identify the correct model as the best-fitting model, they instead tended to identify the more parameterized model as the correct model.

In summary, results found in Gurka ([13]) and Whittaker and Furlow's (2009) studies supported the use of two of the information criteria (the BIC and CAIC) over the AICC and AIC as well as over the consistent HQIC. Differences in the performance of the BIC and CAIC when each was paired with m versus N were not substantial in conditions with smaller total variance and ρICC values. However, as predicted for the BIC (based on Kass & Raftery, [18]) and identified for the CAIC, in conditions with larger total variances, the BICm and CAICm performed the best. Gurka ([13]) and Whittaker and Furlow ([34]) noted the need for additional explorations of information criteria's performance under a variety of conditions with multilevel models. The present study is designed to apply these authors' modifications to the N* used in the formulation of information criteria and compare their use for correct model identification for an extension to multilevel models, namely for CCREMs.

Cross-Classified Random Effects Models

As noted earlier, use of the conventional multilevel models permits modeling of pure clusters of data such as for the clustering of students (level 1) within middle schools (level 2) within high schools (level three). However, use of this three-level model would only work if middle schools (MSs) were purely clustered within high schools (HSs) or vice versa. In practice it is more likely that neither clustering variable is perfectly nested within the other. For scenarios in which there are two clustering variables (here, MS and HS) that are not clustered within each other, a cross-classified data structure is inferred. In the current example, students are cross-classified by level 2 classification factors (namely, MS and HS). The dependency resulting from the clustering of students within each of the two classifications must still be appropriately handled using the CCREM rather than a conventional multilevel model. Use of the CCREM permits partitioning of the variability in the outcome of interest into the portion attributable to the individual student and to the effects of each of the cross-classification factors (here, MS and HS).

A brief review of the two-level CCREM is provided here although the reader is encouraged to review more detailed descriptions published elsewhere (e.g., see Hox, [16]; Rasbash & Browne, [23]; Raudenbush & Bryk, [24]). After describing the CCREM, dilemmas associated with the use of information criteria with this multilevel model extension are presented.

Unconditional CCREM

As with conventional multilevel models, an unconditional model is formulated that contains no predictors. Adopting Raudenbush and Bryk's (2002) levels' formulation, at level 1, the outcome for student i, who attended MS j1 and HS j2 is as follows:

Graph

where it is assumed that the level 1 residual, . At level 2, the unconditional CCREM would be the following:

Graph

where is the residual for MS j1 and is the residual for HS j2. These two level 2 classifications' residuals are assumed independent of each other with and . When combined into a single equation, the unconditional CCREM becomes:

Graph

Equation 3 clearly demonstrates the partitioning of the variability in the outcome, , into the portions attributable to variability across students (), MSs () and HSs (). In conventional multilevel modeling, the intraclass correlation, ρICC, is calculated by dividing the level 2 variance component by the total variance (sum of the level 1 and two variance components). With the CCREM, the corresponding proportion is called the intraunit correlation coefficient, ρIUCC. The ρIUCC representing the proportion of total variability attributable to MSs is calculated as follows:

Graph

The ρIUCC for the HS classification is calculated similarly by replacing the MS variance component in the numerator () with the HS's variance component, .

The conditional CCREM includes predictors at any of the levels. For example, a level 1 predictor, X, could be added to the model such that Equation 7 becomes:

Graph

and the level 2 (Equation 8) becomes:

Graph

if the effect of X is assumed to be fixed across MSs and HSs (i.e., Equation 12 includes no middle nor high school residuals in the formulation of the coefficient for X, ). Equation 12 could be extended to include MS, M, and HS, H, predictors should there be substantial variability in the intercept term (i.e., if > 0 and > 0):

Graph

where the γ010 and γ001 coefficients represent the effect of M and of H, respectively, on γ000.

Alternatively, the effect of X could be modeled as randomly varying across both MSs and HSs through the addition of the and residuals to the formula for as follows:

Graph

The M and H predictors could also be added to the random-slopes model in Equation 14 to explain variability in, say, the intercept while modeling variability in the slope of X across MSs and HSs resulting in the following level 2 model:

Graph

where each classification's pair of residuals are assumed normally distributed with means of zero and

Graph

Additional combinations of predictors and random effects, and even levels, can be modeled although these are not detailed further here. Parameters (and associated standard errors) of the CCREM can be estimated using several different multilevel modeling software programs (including MLwiN and HLM) as well as more general statistical software programs (SAS, R, and STATA).

As part of a larger study, Meyers and Beretvas (2006) assessed the performance of the BIC and AIC for data generated to fit a CCREM. The authors estimated the correct CCREM and then a conventional multilevel model that ignored one of the classification factors and was thus underparameterized. Meyers and Beretvas ([20]) found that the BIC correctly identified the CCREM over the conventional multilevel model 100% of the time. The AIC did not always result in perfect model identification. The authors simulated conditions in which the cross-classification residuals were correlated and conditions in which they were not. In conditions in which the correlation was set to zero, the AIC correctly identified the correct model more than 93% of the time. In the other conditions, the AIC's performance dropped drastically and resulted in correct model identification only 69% of the time. Additional factors that affected the AIC's performance included the ρIUCC and the within-cross-classification sample size. The AIC's average correct model identification rates were almost 92% for the larger ρIUCC (where ρIUCC = 0.15) conditions and close to 71% when ρIUCC = 0.05. For larger within-cross-classification sample size (n = 40), the AIC's correct model identification rates were more than 87%, with rates of 75% when this sample size was 20.

Meyers and Beretvas ([20]) compared models that differed only in their random-effects specification; therefore the authors had used REML estimation. For scenarios where the model comparison involves models with different fixed-effects specifications, performance of the information criteria when using FIML should also be investigated. The focus of Meyers and Beretvas' (2006) study was on a comparison of results found when using a CCREM to handle cross-classified data versus ignoring one of the cross-classification factors. Thus, model fit information criteria were compared for the correctly specified CCREM versus the underparameterized conventional multilevel version of the model. The study did not assess information criteria performance when an overparameterized model's fit was compared with that of the correct model. In addition, Meyers and Beretvas only looked at the performance of the AIC and BIC. They did not investigate the functioning of the χ2diff, CAIC, HQIC, nor the AICC. Last, the authors used SAS PROC MIXED and its default formulation for the BIC.

When SAS PROC MIXED is used to estimate CCREMs[1], N* is set to a value of "one" for the BIC, CAIC, and HQIC[2] (in Equations 4, 5, and 6, respectively). Note, however that the value of N is used for N* with the AICC. Previous researchers (Gurka, [13]; Whittaker & Furlow, [34]) have found some evidence that use of the number of level 2 units, m, as N* in some information criteria might work better in terms of correct model identification. However, it is unclear whether use of the number of level 2 or the number of level 1 units might result in optimal information criteria functioning for CCREM model comparison. In addition, although the number of level 1 units is clear enough with cross-classified data structures, this is not the case for the number of level 2 units given that when using a two-level CCREM there are at least two level 2 factors. Thus, there are three possible modifications to the default values (of one) for N* used in the formulations of the BIC, CAIC, and HQIC (see Equations 4 through 6). These three modifications included: the number of level 1 units, N, the number of level 2 identifiers for one of the classifications (akin to the m used in multilevel modeling), and the number of nonempty cross-classification cells, c.

Therefore, the present study assessed use of the AIC, AICC, BIC, CAIC, and HQIC for correct CCREM model identifications. Each information criteria was assessed using its default formulation in SAS and included (if different) its combination with N, m, and c as values for N*. Model fit of three CCREMs were compared including the correctly specified CCREM, an underparameterized CCREM (excluding one truly nonzero fixed effect) and an overparameterized CCREM (including an additional truly zero fixed effect). We conducted these three-model comparisons for two true models. One true model included a fixed level 1 predictor and another included a level 1 predictor that varied randomly across both level 2 classifications. Several design conditions were manipulated that contribute to the degree of clustering, the sample size and the degree of cross-classification.

METHOD

We conducted a simulation study to assess correct CCREM model identification of the AIC, BIC, HQIC, CAIC, and AICC. The AICC, HQIC, BIC, and CAIC were each paired with three alternative values for N*. The N* alternatives included the total sample size, N, the maximum number of units of the cross-classification factors, m, and the number of nonempty cross-classification cells, c. The AIC is not a function of N, so it was not modified to include the different values of N*. In addition, we also assessed the default values (where N* = 1) used in SAS PROC MIXED for the CAIC, HQIC, and BIC. The N* used with each information criteria is identified here by using a subscript representing the N*. Thus, we assessed the performance of the following 16 information criteria: AIC, AICCN, AICCm, AICCc, HQIC1, HQICN, HQICm, HQICc, BIC1, BICN, BICm, BICc, CAIC1, CAICN, CAICm, and CAICc. In addition, we also assessed performance of the χ2diff test to test the difference in model fit for two nested (here, CCREM) models.

We assessed performance of the information criteria and χ2diff across three models (correctly specified model, underparameterized model, and overparameterized model) using two generating models. In the first generating model (here, called FixedX), the level 1 predictor's effect was modeled as fixed across level 2 classifications. In the second generating model (here, called RandomX), the level 1 predictor's effect was modeled as randomly varying across the two level 2 classifications.

Generating Models

The level 1 equation used for the FixedX and RandomX models matches that in Equation 11 that included a single level 1 predictor, X. The level 2 equation for the FixedX generating model included a MS predictor, M, in the equation for the intercept as well as residual variability across middle and high schools and was as follows:

Graph

The same intercept, , equation as in Equation 16 was used for the RandomX generating model, however, the equation for the level 1 predictor's slope coefficient, , included residual terms for both MSs and HSs (i.e., randomly varied across both classification factors):

Graph

Parameter values used when generating FixedX and RandomX data included γ000 = 5, γ100 = 0.5, and γ010- = 0.5 with a level 1 variance generating value of σ2=1. This value for σ2 was substituted into the equation for the ρIUCC (see Equation 10) to obtain the true values of and given the condition's generating values for MS and HS intercept ρIUCC s. For RandomX data, and were generated to be 0.214286 to provide MS and HS ρIUCC values of 0.15 apiece for the slope coefficient. (For FixedX data, and were zero.) Note that all covariances between residuals were generated to be zero.

Estimating Models

Fit of the generated datasets was compared across three estimating models. The three estimating models included the correct model (Equation 16 or 17, depending on the condition), an underparameterized model and an overparameterized model. The underparameterized model excluded one parameter as compared with the relevant correct model; the overparameterized model included one unnecessary parameter. More specifically, the underparameterized model did not include the MS predictor, M, and the overparameterized model included a fixed-effect coefficient for HS predictor, H, of the intercept, with a true value of zero. Thus, for the FixedX-generated data, the models in Equations 12, 16, and 13 corresponded do the underparameterized, correct and overparameterized models, respectively. For the RandomX-generated data, the relevant underparameterized, correct, and overparameterized models appear in Equations 14, 17, and 15, respectively.

These very basic models were chosen as a simple starting point for this assessment of the performance of the various information criteria. In addition, although information criteria are typically used when comparing nonnested models, they can also be used to choose between nested models. The use of the three nested models in the present study permitted assessment of the performance of the χ2diff test (see Equation 1) in addition to allowing evaluation of the information criteria's performance.

Data Generation Procedure

We used SAS to generate data, estimate models (SAS PROC MIXED) and to summarize results. Given the models being compared differed only in their fixed effects' specification, FIML estimation should be and was used. Correct model identification rates were compared across the three estimating models for both the FixedX and RandomX scenarios. The following conditions were manipulated: the per-classification intercept ρIUCC, the number of HSs attended by students from each MS, the degree of cross-classification, the number of cross-classified factor units, and the average number of middle school students ().

Design Conditions

Table 1 lists design conditions and their values. We subsequently provide further details.

TABLE 1 Simulation Study Conditions

	Intercept intraunit	Number and degree of
	correlation	cross-classifications	Sample size
	coefficient
		First	Second	Third	Fourth	Number of cross-	Average
	Middle	High	high	high	high	high	classified	per middle
Condition	school	school	school	school	school	school	factor units	school, n
1	0.15	0.15	64%	36%	—	—	50	25
2	0.15	0.30	64%	36%	—	—	50	25
3	0.30	0.15	64%	36%	—	—	50	25
4	0.15	0.15	50%	50%	—	—	50	25
5	0.15	0.30	50%	50%	—	—	50	25
6	0.30	0.15	50%	50%	—	—	50	25
7	0.15	0.15	64%	36%	—	—	25	50
8	0.15	0.30	64%	36%	—	—	25	50
9	0.30	0.15	64%	36%	—	—	25	50
10	0.15	0.15	50%	50%	—	—	25	50
11	0.15	0.30	50%	50%	—	—	25	50
12	0.30	0.15	50%	50%	—	—	25	50
13	0.15	0.15	64%	12%	12%	12%	50	25
14	0.15	0.30	64%	12%	12%	12%	50	25
15	0.30	0.15	64%	12%	12%	12%	50	25
16	0.15	0.15	25%	25%	25%	25%	50	25
17	0.15	0.30	25%	25%	25%	25%	50	25
18	0.30	0.15	25%	25%	25%	25%	50	25
19	0.15	0.15	64%	12%	12%	12%	25	50
20	0.15	0.30	64%	12%	12%	12%	25	50
21	0.30	0.15	64%	12%	12%	12%	25	50
22	0.15	0.15	25%	25%	25%	25%	25	50
23	0.15	0.30	25%	25%	25%	25%	25	50
24	0.30	0.15	25%	25%	25%	25%	25	50
25	0.15	0.15	64%	36%	—	—	50	50
26	0.15	0.30	64%	36%	—	—	50	50
27	0.30	0.15	64%	36%	—	—	50	50
28	0.15	0.15	50%	50%	—	—	50	50
29	0.15	0.30	50%	50%	—	—	50	50
30	0.30	0.15	50%	50%	—	—	50	50
31	0.15	0.15	64%	36%	—	—	25	25
32	0.15	0.30	64%	36%	—	—	25	25
33	0.30	0.15	64%	36%	—	—	25	25
34	0.15	0.15	50%	50%	—	—	25	25
35	0.15	0.30	50%	50%	—	—	25	25
36	0.30	0.15	50%	50%	—	—	25	25
37	0.15	0.15	64%	12%	12%	12%	50	50
38	0.15	0.30	64%	12%	12%	12%	50	50
39	0.30	0.15	64%	12%	12%	12%	50	50
40	0.15	0.15	25%	25%	25%	25%	50	50
41	0.15	0.30	25%	25%	25%	25%	50	50
42	0.30	0.15	25%	25%	25%	25%	50	50
43	0.15	0.15	64%	12%	12%	12%	25	25
44	0.15	0.30	64%	12%	12%	12%	25	25
45	0.30	0.15	64%	12%	12%	12%	25	25
46	0.15	0.15	25%	25%	25%	25%	25	25
47	0.15	0.30	25%	25%	25%	25%	25	25
48	0.30	0.15	25%	25%	25%	25%	25	25

Intraunit correlation coefficient

In a summary of four studies that assessed reasonable values for two-level models' ρICC based on real data, Spybrook and Raudenbush (2009) reported values between 0.10 and 0.29. In their study, Meyers and Beretvas (2006) reported MS and HS ρIUCC estimates for a real dataset of (0.18, 0.07) and in their simulation study the authors had investigated ρIUCC generating values of (0.05, 0.05) and (0.15, 0.15). In the present study, three combinations of values were explored including (0.15, 0.15), (0.15, 0.30) and (0.30, 0.15). The corresponding generating values for and were (0.214286, 0.214286), (0.272727, 0.545455) and (0.545455, 0.272727), respectively. These values represented moderate and large ρIUCC values.

Degree of cross-classification

The degree of cross-classification was operationalized using two factors, including the number of HSs attended by students at each MS and the distribution of the middle school students across the HSs. Luo and Kwok ([19]) found that the degree of cross-classification impacted model estimates. On the basis of their analysis of the National Educational Longitudinal Study of 1988 (NELS-88) dataset, Meyers and Beretvas (2006) had found that, on average, between two and three MSs typically fed into any given HS. To provide a fuller comparison, the present study used two values for the number of HSs attended for students from each MS, namely two and four. Tables 2 and 3 depict how this number affects the degree of cross-classification in a dataset. When middle school students attended four rather than two HSs, the degree of cross-classification was greater.

TABLE 2 Cross-Classification Pattern Simulated in Two High Schools Conditions

	High school
Middle school	HS₁	HS₂	HS₃	HS₄	HS₅	...	HS_m_-1	HS_m
MS₁	×	×
MS₂		×	×
MS₃			×	×
MS₄				×	×
MS₅					×	...
...	...	...	...	...	...	...	...	...
MS_m_-1						...	×	×
MS_m	×							×
Note. The m represents the number of units of cross-classifications for condition. × represents a nonempty cross-classification cell containing an average sample size equaling the condition's percentage of students per high school (e.g., 64%, 50%) × .

TABLE 3 Cross-Classification Pattern Simulated in Four High Schools Conditions

	High school
Middle school	HS₁	HS₂	HS₃	HS₄	HS₅	...	HS_m_-1	HS_m
MS₁	×	×					×	×
MS₂	×	×	×					×
MS₃	×	×	×	×
MS₄		×	×	×	×
MS₅			×	×	×	...
...	...	...	...	...	...	...	...	...
MS_m_-1						...	×	×
MS_m	×					...	×	×
Note. The m represents number of units of cross-classifications for condition. × represents a nonempty cross-classification cell containing an average sample size equaling the condition's percentage of students per high school (e.g., 64%, 25%) × .

Two patterns of distributions of MSs feeding into each HS were investigated including a balanced and an unbalanced distribution. The distributional pattern was reflected in the percent of students attending each of the affiliated HSs (with unbalanced and balanced distributions operationalized respectively as 64%:36% and 50%:50% for the two cross-classifications conditions, and 64%:12%:12%:12% versus 25%:25%:25%:25% for the four cross-classifications conditions).

Sample size

Two values (m = 25 and m = 50) were used for the number of units per cross-classification (i.e., the number of MSs and the number of HSs). The value of m was generated to be the same for both MSs and HSs in the present study. Each value of m was paired with two values (25 and 50) for the average total number of middle school students per MS, . Meyers and Beretvas ([20]) had used similar values for m (30, 50) and (20, 40), however, the values chosen in the present study were chosen to facilitate comparison of the effects of m versus on information criteria performance controlling for total sample size. Thus, in the present study four combinations of m and values: (m = 50, = 25), (m = 25, = 50), (m = 50, = 50), and (m = 25, = 25) were used. The first two of these four sample size combinations [i.e., (m = 50, = 25) and (m = 25, = 50)] resulted in the same overall total sample size (N = 1,250), thereby permitting evaluation of which sample size factor ( or m) might contribute more to the information criteria's performance after controlling for total sample size. The latter two m and combinations permitted a fuller investigation of the effect of sample size on the information criteria's performance.

Analyses

Correct model identification rates were tallied across replications for each combination of conditions for the two generating models and for each of the 16 information criteria being investigated. Correct model identification rates were also gathered for FixedX and RandomX scenarios using the χ2diff test statistic comparing the underidentified model with the correct model as well as for testing the difference in deviances for the overparameterized versus the correct model. For the underparameterized versus the correct model test, a statistically significant χ2diff (p <.05) supported selecting the correct model. For the overparameterized versus the correct model comparison, a statistically nonsignificant χ2diff (p >.05) supported selection of the correct model.

RESULTS

Convergence Rates

Table 4 contains the proportion of converged solutions (out of the 500 replications for each of the 48 combinations of conditions). The proportion of converged solutions is presented for each of the three models estimated for FixedX and for RandomX datasets. In addition, the overall proportion of replications for which converged solutions were obtained for across three models estimated is also presented in the last two columns of Table 4 for FixedX and RandomX datasets.

TABLE 4 Proportion of Converged Solutions (Out of 500), by Condition, Generating Model, and Estimating Model

	Proportion of converged solutions by generating and estimating models
		Overall
	FixedX	RandomX	proportion, by
			generating model
		Correctly			Correctly
Condition	Underparameterized	specified	Overparameterized	Underparameterized	specified	Overparameterized	FixedX	RandomX
1	97.8%	78.6%	78.8%	95.0%	76.8%	76.8%	76.2%	73.6%
2	100.0%	75.2%	75.0%	97.4%	74.6%	75.0%	74.8%	73.6%
3	99.6%	98.8%	98.8%	96.8%	95.8%	95.8%	98.6%	95.0%
4	98.4%	80.2%	79.8%	97.0%	78.8%	78.8%	78.4%	77.0%
5	100.0%	76.8%	77.8%	96.4%	74.0%	74.0%	76.8%	73.6%
6	100.0%	99.4%	99.2%	97.2%	97.2%	97.2%	99.2%	96.8%
7	99.0%	82.4%	82.0%	97.2%	83.2%	82.8%	80.6%	81.0%
8	100.0%	87.8%	88.2%	96.8%	85.4%	85.0%	87.6%	84.2%
9	100.0%	100.0%	99.8%	98.0%	97.8%	97.8%	99.8%	97.8%
10	98.0%	85.0%	84.6%	97.4%	84.2%	83.8%	82.8%	82.4%
11	100.0%	90.2%	90.6%	98.4%	86.6%	86.6%	90.2%	86.4%
12	99.4%	99.6%	99.4%	97.8%	97.8%	97.6%	99.4%	97.6%
13	99.0%	83.0%	82.8%	96.2%	78.6%	78.6%	81.4%	76.0%
14	100.0%	80.8%	80.4%	96.6%	77.6%	78.2%	80.0%	76.4%
15	99.8%	100.0%	100.0%	98.0%	97.8%	97.8%	99.8%	97.6%
16	99.6%	90.2%	90.4%	96.4%	86.4%	86.2%	89.6%	84.6%
17	100.0%	91.2%	91.2%	99.2%	89.0%	89.0%	91.2%	88.6%
18	100.0%	100.0%	100.0%	98.8%	98.8%	98.6%	100.0%	98.6%
19	99.0%	90.4%	91.0%	98.6%	88.4%	87.8%	89.2%	87.2%
20	100.0%	93.2%	93.2%	97.0%	92.6%	93.0%	93.0%	91.8%
21	100.0%	100.0%	100.0%	97.6%	98.0%	98.0%	100.0%	97.6%
22	100.0%	95.8%	95.2%	99.0%	95.0%	94.2%	95.2%	93.8%
23	100.0%	98.8%	98.8%	99.6%	99.0%	99.0%	98.8%	98.8%
24	100.0%	100.0%	100.0%	99.0%	99.0%	99.0%	100.0%	99.0%
25	100.0%	83.8%	84.0%	99.4%	85.8%	86.0%	83.2%	84.8%
26	100.0%	88.6%	88.8%	100.0%	87.8%	88.2%	88.6%	87.8%
27	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%
28	100.0%	87.4%	87.4%	100.0%	87.8%	88.2%	86.6%	87.8%
29	100.0%	90.0%	90.0%	100.0%	90.6%	90.6%	90.0%	90.4%
30	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%
31	92.8%	72.8%	70.8%	75.6%	62.0%	61.6%	66.4%	55.4%
32	99.8%	73.8%	74.2%	84.0%	67.0%	66.2%	73.0%	64.0%
33	96.6%	96.8%	95.8%	85.2%	83.2%	82.4%	95.0%	81.4%
34	93.6%	67.8%	65.6%	79.2%	63.8%	61.4%	62.4%	55.4%
35	100.0%	78.8%	78.6%	85.4%	64.4%	64.2%	78.0%	62.2%
36	97.6%	97.4%	95.8%	85.0%	85.6%	84.2%	95.2%	83.0%
37	100.0%	92.2%	92.0%	100.0%	92.6%	92.4%	92.0%	92.4%
38	100.0%	92.8%	92.8%	100.0%	92.6%	92.8%	92.8%	92.6%
39	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%
40	100.0%	97.4%	97.4%	100.0%	98.0%	97.8%	97.2%	97.8%
41	100.0%	99.8%	99.8%	100.0%	97.6%	97.6%	99.8%	97.6%
42	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%
43	92.6%	75.2%	74.2%	78.8%	65.4%	64.6%	70.0%	58.0%
44	100.0%	80.8%	80.6%	87.0%	70.2%	70.2%	80.2%	68.4%
45	97.2%	97.8%	97.0%	85.2%	84.6%	83.8%	96.0%	82.6%
46	97.2%	86.0%	84.6%	89.0%	76.6%	76.6%	83.6%	73.0%
47	100.0%	87.8%	87.8%	91.6%	79.0%	78.4%	87.4%	77.6%
48	99.2%	99.4%	99.4%	88.8%	89.2%	88.8%	99.2%	87.6%
Note. See Table 1 for detailed list of conditions by number.

The pattern of convergence rates was very similar across generating models. Convergence rates were consistently better for the FixedX generating model's conditions than the RandomX conditions in which two additional random effects' variances were generated and estimated. Substantially higher convergence rates were found for the underparameterized models versus the correct and overparameterized models. Insubstantial differences in convergence rates were found between the correct and overparameterized models estimated for both FixedX and RandomX datasets. The mean overall convergence rates across conditions for FixedX and RandomX data for the underparameterized model were 99.1% and 94.9%, respectively. The corresponding rates for the correct and overparameterized model were 90.1% and 89.9% for FixedX data and 86.6% and 86.4% for RandomX data, respectively.

Although some mention will be made of differences found across models, the focus in the remaining presentation of convergence rates will be on the results found in the last two columns of Table 4 summarizing rates when solutions converged for all three models estimated per replication. These results closely match those found for the correct and overparameterized models for both FixedX and RandomX data. Differences across conditions were not especially distinct when the underparameterized model was estimated.

Several factors seemed to have had particularly strong effects on convergence rates: the sample size and the value and pattern of intercept ρIUCC. Convergence rates were substantially lower for the smallest total sample size conditions based on the following values for sample size factors: m = 25, = 25, and N = 625 as compared with those observed for the three other sample size combinations of conditions: (m = 25, = 50, N = 1,250), (m = 50, = 25, N = 1,250), and (m = 50, = 50, N = 2,500) for both FixedX and RandomX scenarios. Convergence rates were also affected by the average within-classification sample size, . The proportion of converged solutions was higher under larger conditions. There was also an effect of the number of units per cross-classification factor, m, such that convergence rates were better in the larger m conditions, although this factor had less of an impact than did (see Table 4).

Convergence rates were also found to be a function of the intercept ρIUCC value and pattern. For conditions in which the MS and HS ρIUCC values were 0.15 and 0.15, the convergence rates were consistently the lowest (with mean overall rates of 82.2% and 78.8% for FixedX and RandomX data, respectively) as compared with (ρIUCC,MS = 0.15, ρIUCC,HS = 0.30) and (ρIUCC,MS = 0.30, ρIUCC,HS = 0.15) conditions. Interestingly, few differences were found between the (0.15, 0.30) and the (0.30, 0.15) conditions when the underparameterized model was estimated. However, for the correct and overparameterized models, the rates were substantially different. When the ρIUCC,MS < ρIUCC,HS, the mean overall convergence rates were 86.4% and 82.1% for FixedX and RandomX data, respectively, however, when ρIUCC,HS < ρIUCC,MS, the overall FixedX and RandomX means were much higher (98.9% and 94.7%).

The cross-classification distribution and degree also affected convergence rates. Convergence was found to be better when there were more cross-classifications than when there were fewer. In other words, in conditions where each HS was fed by two MSs (see Table 2), the convergence rates were substantially lower than when each HS was fed by four MSs (see Table 3). In addition, the more balanced distributions (50%:50% versus 64%:36%, and 25%:25%:25%:25% versus 64%:12%:12%:12% for the two and four HS conditions, respectively) led to better convergence rates. However, the effect of the number of cross-classifications appeared to be stronger than the effect of the distribution of cross-classifications.

Correct Model Identification Rates Using Information Criteria

Performance of default formulations of information criteria

Fit of three (correct, under- and overparameterized) models was compared simultaneously for each dataset. Tables 5 and 6 contain the proportions of correct model identifications for 14 of the 16 information criteria for the FixedX and RandomX data, respectively. The pattern of results for FixedX and for RandomX data was very similar. Results for the default HQIC and BIC values used by SAS (here, termed HQIC1 and BIC1) do not appear in Tables 5 and 6 because their values were always zero across conditions for both FixedX and RandomX data. The HQIC1 and BIC1 supported the incorrect overparameterized model one hundred percent of the time.

TABLE 5 Correct Model Identification Rates Using Different N*s Paired With Each Information Criterion for the FixedX Generating Model, by Condition

		Version of AICC	Version of HQIC	Version of BIC	Version of CAIC
Condition	AIC	AICC_N	AICC_m	AICC_c	HQIC_N	HQIC_m	HQIC_c	BIC_N	BIC_m	BIC_c	CAIC₁	CAIC_N	CAIC_m	CAIC_c
1	82.94%	83.20%	90.03%	86.09%	94.23%	90.03%	90.55%	98.69%	94.23%	95.80%	65.62%	99.21%	96.33%	97.11%
2	81.02%	81.82%	89.04%	85.56%	94.92%	89.04%	90.64%	99.20%	94.92%	96.52%	64.17%	99.47%	96.79%	97.86%
3	78.91%	79.11%	88.03%	83.98%	94.52%	88.03%	90.26%	98.99%	94.52%	96.15%	62.27%	99.19%	96.96%	97.97%
4	82.14%	82.14%	88.27%	85.71%	93.62%	88.27%	90.31%	98.98%	93.62%	94.64%	67.60%	99.24%	95.92%	97.19%
5	82.29%	82.29%	91.41%	88.02%	95.57%	91.41%	93.49%	99.48%	95.57%	97.14%	63.80%	99.48%	97.40%	97.92%
6	80.85%	81.05%	88.31%	84.48%	92.54%	88.31%	89.32%	98.59%	92.54%	95.57%	64.32%	99.19%	96.57%	97.18%
7	81.14%	81.89%	92.80%	87.35%	92.80%	84.37%	87.35%	98.02%	90.57%	92.80%	62.28%	98.76%	94.05%	95.78%
8	82.88%	83.11%	94.98%	91.10%	94.98%	86.30%	91.10%	100.00%	92.92%	94.98%	66.44%	100.00%	95.43%	98.63%
9	81.16%	81.76%	93.39%	87.17%	92.39%	84.17%	87.17%	88.38%	89.98%	92.39%	63.53%	85.57%	92.18%	92.79%
10	81.88%	82.13%	93.48%	87.92%	93.48%	85.02%	87.92%	99.52%	90.58%	93.48%	68.12%	99.52%	95.41%	97.59%
11	82.04%	82.26%	94.90%	87.14%	94.90%	85.14%	87.14%	98.89%	92.02%	94.90%	68.51%	99.11%	95.57%	96.90%
12	79.68%	79.88%	91.95%	87.12%	91.55%	83.30%	86.92%	91.35%	88.73%	91.55%	66.00%	88.73%	91.15%	93.56%
13	82.06%	82.31%	89.93%	84.03%	95.09%	89.93%	93.37%	98.53%	95.09%	97.79%	66.09%	99.26%	97.54%	98.03%
14	83.00%	83.00%	89.00%	84.25%	94.00%	89.00%	91.75%	99.25%	94.25%	96.25%	66.50%	99.50%	96.00%	98.00%
15	82.16%	82.77%	89.38%	84.17%	94.79%	89.38%	91.98%	99.00%	94.99%	97.60%	66.93%	99.20%	96.99%	98.40%
16	78.80%	79.69%	86.61%	82.14%	94.20%	86.61%	90.40%	97.99%	94.20%	96.88%	62.28%	98.66%	96.43%	97.99%
17	83.11%	83.11%	88.60%	84.87%	95.40%	88.60%	92.76%	98.68%	95.40%	97.59%	64.69%	98.90%	96.93%	98.47%
18	83.00%	83.60%	89.20%	84.80%	93.60%	89.20%	91.40%	99.00%	93.60%	97.20%	69.80%	99.40%	97.00%	98.60%
19	81.39%	81.61%	94.17%	85.43%	94.17%	85.43%	91.48%	98.21%	92.60%	95.96%	62.56%	99.10%	94.40%	97.53%
20	86.02%	86.02%	94.19%	88.39%	94.19%	88.39%	91.18%	99.14%	92.69%	95.91%	70.54%	99.36%	94.84%	98.07%
21	79.80%	80.20%	92.00%	83.40%	91.20%	83.40%	89.20%	90.40%	89.80%	92.20%	64.00%	88.80%	91.60%	91.80%
22	82.77%	82.98%	94.96%	85.71%	94.96%	85.71%	91.18%	98.53%	92.44%	96.43%	65.55%	99.16%	95.38%	97.90%
23	80.57%	81.17%	93.73%	85.83%	93.73%	85.83%	89.68%	98.99%	91.50%	95.95%	62.35%	99.60%	94.33%	97.57%
24	79.60%	79.60%	92.40%	82.60%	91.40%	82.60%	88.20%	91.00%	89.40%	92.80%	62.00%	89.80%	92.60%	92.20%
25	83.41%	83.89%	88.94%	86.78%	96.64%	88.94%	90.39%	98.80%	95.43%	97.12%	61.78%	99.28%	97.60%	97.60%
26	82.84%	83.07%	88.94%	86.01%	96.16%	88.94%	90.75%	99.55%	95.03%	96.84%	67.49%	100.00%	96.84%	97.97%
27	82.20%	82.00%	88.80%	86.80%	95.60%	88.80%	91.20%	99.00%	94.60%	96.60%	63.40%	99.00%	97.00%	97.80%
28	83.37%	83.83%	89.15%	87.07%	96.54%	89.15%	90.99%	99.08%	95.61%	97.00%	65.13%	99.54%	97.23%	97.92%
29	84.67%	84.44%	89.33%	87.56%	95.56%	89.33%	91.56%	99.11%	94.89%	96.22%	66.00%	99.33%	97.56%	98.22%
30	81.80%	82.20%	88.60%	85.60%	95.40%	88.60%	91.00%	99.40%	94.80%	96.20%	65.40%	99.80%	96.80%	97.60%
31	81.33%	81.63%	93.07%	85.84%	91.87%	83.13%	85.84%	98.49%	89.76%	93.07%	68.37%	98.80%	93.68%	95.48%
32	85.21%	86.03%	96.16%	91.78%	95.34%	89.86%	91.78%	99.45%	94.25%	96.16%	67.40%	99.18%	96.71%	97.53%
33	81.26%	81.90%	91.37%	87.16%	89.68%	84.63%	87.16%	88.63%	89.05%	90.53%	69.26%	85.90%	91.58%	91.58%
34	81.09%	81.41%	95.19%	88.78%	94.23%	85.90%	88.78%	98.72%	91.35%	95.19%	65.71%	99.68%	97.12%	97.76%
35	82.82%	82.82%	93.85%	88.97%	92.82%	85.90%	88.97%	97.44%	91.80%	93.85%	64.36%	97.95%	94.87%	94.87%
36	81.30%	81.93%	93.28%	88.03%	92.65%	85.29%	87.82%	90.97%	90.55%	92.86%	66.18%	88.66%	93.07%	92.86%
37	79.78%	80.00%	87.39%	81.96%	93.48%	87.39%	90.65%	99.35%	91.96%	95.65%	65.00%	99.57%	95.44%	98.04%
38	84.27%	84.91%	91.16%	86.42%	95.04%	91.16%	92.67%	99.35%	94.40%	97.41%	68.32%	99.35%	97.20%	98.28%
39	82.00%	82.40%	90.00%	83.60%	94.00%	90.00%	91.80%	99.40%	93.40%	96.00%	62.80%	100.00%	95.80%	98.00%
40	80.86%	81.07%	88.48%	83.95%	92.80%	88.48%	91.15%	99.79%	92.39%	95.27%	63.37%	100.00%	94.86%	97.12%
41	82.16%	82.37%	88.38%	83.57%	95.79%	88.38%	93.59%	99.80%	95.59%	98.20%	66.53%	99.80%	97.60%	99.20%
42	80.20%	80.60%	87.00%	82.40%	93.40%	87.00%	89.80%	98.60%	92.20%	96.20%	62.00%	99.00%	95.80%	98.00%
43	78.86%	79.71%	94.86%	85.14%	94.00%	85.14%	90.00%	99.14%	91.71%	96.86%	62.00%	99.43%	95.71%	98.00%
44	78.80%	80.30%	92.77%	83.29%	92.02%	83.29%	88.28%	98.75%	89.78%	96.01%	61.35%	98.75%	94.51%	97.76%
45	80.63%	81.25%	91.04%	84.58%	89.38%	84.58%	86.88%	90.63%	87.50%	90.83%	66.25%	88.75%	90.83%	91.25%
46	84.45%	85.41%	95.46%	87.32%	93.78%	87.32%	91.39%	98.33%	92.82%	96.17%	69.62%	99.04%	95.46%	96.89%
47	84.67%	85.58%	95.65%	87.19%	94.74%	87.19%	92.45%	99.54%	93.59%	97.94%	68.88%	99.54%	96.80%	99.09%
48	79.84%	79.84%	91.13%	83.27%	89.92%	83.07%	88.31%	90.12%	88.91%	90.32%	64.92%	86.29%	90.73%	90.32%
Note. See Table 1 for detailed list of conditions by number.

TABLE 6 Correct Model Identification Rates Using Different N*s Paired With Each Information Criterion for the RandomX Generating Model, by Condition

		Version of AICC	Version of HQIC	Version of BIC	Version of CAIC
Condition	AIC	AICC_N	AICC_m	AICC_c	HQIC_N	HQIC_m	HQIC_c	BIC_N	BIC_m	BIC_c	CAIC₁	CAIC_N	CAIC_m	CAIC_c
1	80.16%	80.44%	90.22%	83.15%	94.02%	89.40%	92.66%	98.64%	94.02%	96.47%	60.87%	98.64%	96.20%	97.28%
2	83.15%	83.42%	90.49%	85.05%	95.65%	89.95%	92.66%	98.91%	95.65%	97.83%	65.49%	99.46%	97.01%	98.64%
3	81.90%	82.32%	88.84%	83.79%	94.11%	88.42%	90.95%	98.74%	94.11%	97.47%	65.47%	98.53%	96.84%	98.95%
4	81.04%	81.82%	90.91%	83.64%	93.25%	90.13%	92.21%	99.48%	93.25%	96.62%	62.86%	99.74%	96.62%	98.44%
5	82.34%	82.88%	91.30%	85.33%	94.84%	90.22%	92.39%	99.19%	94.84%	97.28%	69.02%	99.73%	97.01%	98.37%
6	80.58%	80.58%	88.22%	83.47%	92.56%	87.60%	90.50%	97.73%	92.56%	95.87%	62.19%	98.76%	95.04%	97.52%
7	79.26%	79.75%	95.80%	83.46%	93.33%	82.96%	89.38%	99.26%	90.12%	95.31%	62.96%	99.26%	94.57%	97.78%
8	83.61%	83.85%	97.15%	87.65%	95.72%	86.94%	92.40%	99.29%	92.87%	97.15%	65.80%	99.53%	96.44%	98.34%
9	83.23%	83.85%	91.62%	87.73%	91.82%	87.32%	90.80%	86.30%	91.41%	90.59%	64.62%	83.44%	91.21%	91.00%
10	81.31%	81.31%	97.09%	87.62%	94.90%	85.19%	90.05%	99.52%	91.02%	97.09%	64.81%	99.76%	95.63%	98.06%
11	82.18%	82.87%	96.30%	87.50%	94.21%	86.11%	91.44%	98.38%	91.90%	95.37%	63.66%	99.07%	94.91%	98.15%
12	80.33%	80.33%	92.62%	85.25%	91.39%	84.84%	88.93%	88.73%	89.75%	92.01%	64.75%	86.48%	91.80%	91.60%
13	83.42%	83.95%	91.84%	85.00%	95.53%	90.79%	93.16%	98.42%	95.53%	96.84%	70.26%	99.21%	96.32%	98.42%
14	85.86%	85.60%	92.15%	87.70%	95.81%	90.84%	93.72%	98.43%	95.81%	97.64%	70.42%	99.22%	96.86%	98.43%
15	84.02%	84.43%	90.16%	86.07%	94.88%	89.75%	92.62%	98.57%	94.88%	96.93%	65.37%	98.57%	96.72%	97.95%
16	82.03%	82.27%	91.25%	83.45%	95.27%	89.36%	93.85%	98.82%	95.27%	97.64%	66.67%	99.53%	97.40%	98.58%
17	81.72%	81.94%	88.71%	82.84%	93.91%	88.26%	91.20%	99.32%	93.91%	97.29%	65.69%	99.32%	96.39%	98.87%
18	80.12%	80.12%	86.82%	81.74%	92.09%	85.80%	89.45%	98.78%	92.09%	96.76%	62.48%	99.59%	95.54%	97.97%
19	84.17%	83.95%	98.39%	90.83%	96.79%	90.60%	93.58%	99.77%	94.50%	97.94%	65.60%	99.77%	96.79%	99.31%
20	78.43%	79.09%	96.73%	85.62%	92.38%	84.10%	88.67%	99.13%	89.98%	95.21%	62.09%	99.35%	94.12%	97.39%
21	81.15%	81.35%	91.80%	85.25%	90.37%	85.04%	87.30%	88.32%	88.32%	90.78%	64.75%	84.63%	90.57%	91.19%
22	82.52%	82.94%	98.08%	87.63%	94.03%	86.99%	91.05%	99.15%	91.90%	95.95%	62.90%	99.36%	95.31%	98.51%
23	79.96%	79.76%	96.96%	84.21%	93.73%	83.40%	88.87%	98.99%	90.89%	96.36%	61.74%	98.99%	94.94%	97.37%
24	80.81%	81.41%	93.33%	86.47%	91.52%	85.46%	88.28%	90.10%	89.50%	92.53%	66.87%	88.28%	91.92%	92.32%
25	83.02%	83.49%	90.57%	84.43%	95.52%	88.92%	92.45%	99.29%	94.58%	97.64%	64.39%	99.76%	97.41%	98.59%
26	84.51%	84.97%	91.34%	86.56%	96.58%	90.21%	92.94%	99.54%	96.36%	98.63%	66.97%	99.77%	98.41%	99.54%
27	82.20%	81.80%	88.40%	83.00%	93.40%	86.60%	90.20%	98.40%	92.20%	95.60%	63.80%	99.20%	95.20%	97.00%
28	80.18%	80.18%	88.16%	81.78%	94.31%	87.02%	91.12%	99.77%	92.94%	97.04%	63.55%	99.77%	96.13%	98.86%
29	82.30%	82.74%	91.37%	84.74%	96.24%	90.27%	92.26%	99.56%	96.02%	98.23%	63.50%	99.56%	98.01%	99.34%
30	81.80%	82.00%	89.00%	84.20%	93.40%	88.40%	90.80%	97.60%	93.00%	95.60%	64.60%	98.20%	94.80%	97.00%
31	80.51%	81.59%	96.39%	86.64%	92.78%	85.56%	89.89%	97.83%	91.34%	96.03%	61.73%	98.56%	96.03%	97.47%
32	80.94%	81.56%	95.63%	86.88%	93.13%	86.25%	89.69%	97.19%	90.94%	95.31%	61.56%	97.81%	94.38%	96.88%
33	84.03%	84.52%	94.10%	88.21%	92.14%	87.72%	89.44%	87.72%	90.42%	93.37%	64.13%	85.75%	93.61%	90.66%
34	83.76%	84.12%	96.75%	87.37%	94.22%	86.64%	91.34%	98.56%	91.70%	96.03%	66.79%	99.64%	95.67%	97.47%
35	79.42%	79.74%	95.50%	85.21%	93.25%	84.57%	91.32%	98.39%	92.28%	95.18%	64.95%	99.36%	94.53%	96.79%
36	79.52%	80.00%	92.29%	85.30%	89.88%	84.58%	86.75%	89.16%	89.16%	90.60%	63.61%	85.78%	90.36%	90.12%
37	84.20%	84.85%	91.99%	85.93%	96.32%	91.56%	94.37%	99.35%	96.32%	97.62%	66.67%	99.57%	97.40%	98.05%
38	81.43%	81.64%	92.44%	84.67%	97.62%	90.07%	94.17%	99.78%	96.76%	98.27%	63.50%	99.78%	98.06%	99.35%
39	82.40%	82.60%	89.40%	84.40%	95.20%	88.00%	91.80%	99.20%	95.00%	97.20%	66.60%	99.80%	97.00%	98.20%
40	83.64%	83.85%	91.82%	85.48%	95.91%	89.98%	93.87%	99.39%	95.30%	97.14%	65.85%	99.80%	96.93%	98.16%
41	84.43%	84.02%	91.39%	85.04%	95.90%	91.19%	93.44%	99.18%	95.29%	98.16%	68.03%	99.18%	98.16%	98.77%
42	81.00%	81.20%	91.40%	83.80%	95.00%	89.80%	92.40%	99.20%	94.60%	97.00%	63.20%	99.20%	96.80%	98.40%
43	79.31%	80.35%	95.52%	83.79%	92.41%	82.76%	87.93%	96.90%	88.62%	94.48%	60.00%	98.62%	94.14%	96.55%
44	83.33%	83.63%	96.49%	87.14%	94.15%	86.84%	89.77%	98.54%	91.23%	96.20%	66.96%	99.12%	96.20%	97.95%
45	80.63%	81.11%	93.71%	86.44%	92.98%	84.99%	90.07%	91.28%	90.80%	93.46%	60.53%	87.89%	92.98%	92.98%
46	83.29%	84.11%	95.07%	86.85%	92.33%	85.75%	90.41%	98.08%	90.69%	94.80%	64.66%	98.36%	94.52%	97.26%
47	81.96%	82.47%	96.39%	85.83%	93.30%	85.31%	88.92%	97.68%	90.72%	95.36%	68.81%	98.97%	94.59%	97.42%
48	82.42%	83.56%	93.84%	87.44%	92.47%	87.22%	90.41%	93.15%	91.78%	92.92%	70.09%	90.41%	92.92%	93.61%
Note. See Table 1 for detailed list of conditions by number.

The default CAIC value (here, CAIC1) identified the correct model on average 65.4% of the time across conditions for FixedX data (and similarly for RandomX data). The vast majority of the time, when the CAIC did not support the fit of the correct model, it supported the fit of the overparameterized model (34.6%) only supporting the less parameterized model for an average of 0.03% of replications. This same pattern held for RandomX data.

The performance of the AIC and AICCN was very similar with mean correct model identification rates across FixedX conditions of 81.8% and 82.2%, respectively (and across RandomX conditions with corresponding mean rates of 82.0% and 82.3%). None of the design factors manipulated in this study seemed to lead to substantial differences in correct model identification rates within each information criterion (see Tables 5 and 6).

Performance of modified formulations of information criteria

Correct model identification rates for the modified versions of the AICC, HQIC, BIC and CAIC were also assessed. Of the three modifications to the AICC, the AICCm seemed to work best both for FixedX and for RandomX data. The AICCm's correct model identification rates were consistently higher than the default AICN values as well as performing better than the AICCc. With the AICCm, the design factor that seemed to have the most substantial effect on correct model identification rates was m. In conditions where m was smaller (m = 25), the AICCm had a higher correct model identification rate (M = 93.6%) than when m was larger (M = 88.9%) for FixedX data with the same pattern identified for RandomX data. The reverse pattern was found for the AICCc for which, when m was 25 versus 50, the rates were 85.0% and 86.7%, respectively, for FixedX data. The differences in these rates as a function of m were larger for the AICCm than for the AICCc.

There also seemed to be a slight effect of the pattern and values of the intercept ρIUCC with the MS and HS ρIUCC values of (0.30, 0.15) leading to the lowest correct model identification rates. The discrepancy was not particularly large for the main effect of intercept ρIUCC. However, we found the discrepancy between ρIUCC values' results to be a little larger when we explored the interaction between m and ρIUCC-. Table 7 contains the mean correct model identification rates for each of the six combinations of the three ρIUCC values crossed with the two m values for the FixedX data (matching the pattern found for the RandomX data). For conditions with the smallest number of cross-classification units (i.e., for m = 25), a slightly larger difference was found between the (0.30, 0.15) conditions and the (0.15, 0.30) and (0.15, 0.15) conditions. None of the other manipulated design conditions seemed to have affected the performance of the AICC for FixedX and RandomX data.

TABLE 7 Mean Correct Model Identification Rates for FixedX Data Summarized by Number of Cross-Classification Units and Intraunit Correlation Coefficient Generating Values, by Condition

Conditions	Version of AICC	Version of HQIC	Version of BIC	Version of CAIC
m	ρ_IUCC	AIC	AICC_N	AICC_m	AICC_c	HQIC_N	HQIC_m	HQIC_c	BIC_N	BIC_m	BIC_c	CAIC₁	CAIC_N	CAIC_m	CAIC_c
50	0.15,0.15	81.67%	82.02%	88.60%	84.72%	94.57%	88.60%	90.98%	98.90%	94.06%	96.27%	64.61%	99.34%	96.42%	97.63%
	0.15,0.30	82.92%	83.13%	89.48%	85.78%	95.31%	89.48%	92.15%	99.30%	95.01%	97.02%	65.94%	99.48%	97.04%	98.24%
	0.30,0.15	81.39%	81.72%	88.66%	84.48%	94.23%	88.66%	90.85%	99.00%	93.83%	96.44%	64.62%	99.35%	96.62%	97.94%
25	0.15,0.15	81.61%	82.10%	94.25%	86.69%	93.66%	85.25%	89.24%	98.62%	91.48%	95.00%	65.52%	99.19%	95.15%	97.12%
	0.15,0.30	82.88%	83.41%	94.53%	87.96%	94.09%	86.49%	90.07%	99.03%	92.32%	95.71%	66.23%	99.19%	95.38%	97.55%
	0.30,0.15	80.41%	80.80%	92.07%	85.42%	91.02%	83.88%	87.71%	90.18%	89.24%	91.68%	65.27%	87.81%	91.72%	92.04%
Note. ρ_IUCC = middle and high school values for intercept ρ_IUCC.

The modifications of the value used for N* in the HQIC formula (see Equation 6) worked substantially better than the default value of one used in SAS PROC MIXED (when used to estimate CCREMs) for which the HQIC1 never identified the correct model. The value of m again seemed to affect correct model identification rates, with higher ms being associated with slightly higher correct model identification rates. The mean correct model identification rates for the FixedX data under the m = 50 conditions for the HQICN, HQICm, and HQICc were 94.7%, 88.9% and 91.3%. The corresponding results for the m = 25 conditions and FixedX data were 92.9%, 85.2% and 89.0%. (Very similar results were found for RandomX data.) The direction of this difference reversed that found for the AICC. As with the AICC, however, correct model identification rates were consistently lower for the (0.30, 0.15) conditions when m = 25 than for the (0.15, 0.30) and (0.15, 0.15) conditions (see Table 7). None of the other design conditions seemed to impact performance of the HQIC modifications. Overall, pairing of the HQIC formulation with an N* set to the total sample size, N, seemed to result in the best correct model identification rates for the conditions and models examined here.

As mentioned earlier, the default BIC1 never resulted in correct model identification for the conditions and model comparisons evaluated in this scenario. The BICN, however, identified the correctly specified model for a very high proportion of replications (with means of 97.5% and 97.2% across FixedX and RandomX conditions, respectively; see Tables 5 and 6). High correct model identification rates were also detected for the BICm and the BICc with means of 92.7% and 95.4% for FixedX data. These high rates support use of any of these modifications and especially compared with the performance of the default BIC1. Again, the number of cross-classification units, m, seemed to contribute most to the slight variability in correct model identification rates for the BICN, BICm, and BICc. For each of these modified versions of the BIC, the larger the m the higher were the correct model identification rates. However, the differences were only on the order of 3.1% for the BICN, 3.3% for the BICm, and 2.2% for the BICc. Again, the pairing of m = 25 with a ρIUCC of (0.30, 0.15) also led to reduced correct model identification rates (see Table 7).

The performance of four versions of the CAIC is also presented in Tables 5 and 6. The default CAIC1 rates were substantially lower (with a mean of 65.4%) than those of the CAICN, CAICm, and CAICc (with mean rates of 97.4%, 95.4%, and 96.8%, respectively) for FixedX data (and very similar RandomX results). As with the other information criteria, the factor that seemed to lead to the largest differences for the CAICm was m with results favoring conditions where m was 50 rather than 25 although these differences were still minor (see Tables 5 and 6). In addition to m, the pattern and values of the intercept ρIUCC again influenced correct model identification rates. Similar rates were observed for conditions in which the MS and HS ρIUCC were (0.15, 0.15) and (0.15, 0.30). However, these rates were slightly lower for (0.30, 0.15) with the average difference being larger (5.8%) for the CAICN than for the CAICm (2.0%) and CAICc (3.0%).

Performance of the χ 2 diff test for testing fit of pairs of nested models

Fit of the correct versus underparameterized (power) and correct versus overparameterized models (Type I error) was compared using the χ2diff test of the difference in the models' deviance statistics (see Equation 1). Table 8 lists the Type I error rates by generating model. The rates consistently met Bradley's (1978) "robust" criteria (with limits of 0.025 up to 0.075 for the observed α levels), however, they did not consistently meet his "negligible" criteria (with limits of 0.045 up to 0.055). Overall Type I error rates were very slightly lower (with a mean of 5.8%) for the RandomX data than for the FixedX data (M = 6.1%). The two factors that somewhat influenced Type I error rates in a consistent fashion across the two kinds of data included m and the ρIUCC values; however, the differences were very modest.

TABLE 8 Percentage of Converged Solutions in Which a Type I Error Occurred When Selecting the Correctly Specified Model Over the Overparameterized Model Using the Deviance Statistic for the FixedX and RandomX Generating Models, by Condition

Condition	Generating model
	FixedX	RandomX
1	6.04%	6.25%
2	5.62%	4.35%
3	5.48%	6.32%
4	6.38%	7.01%
5	4.69%	5.98%
6	7.46%	7.44%
7	7.69%	7.41%
8	5.25%	4.99%
9	5.41%	5.11%
10	7.00%	5.83%
11	5.77%	5.79%
12	8.45%	5.74%
13	4.91%	4.74%
14	6.50%	4.19%
15	6.01%	5.53%
16	6.03%	4.73%
17	4.82%	6.32%
18	6.80%	8.11%
19	5.83%	3.21%
20	5.81%	8.28%
21	6.60%	7.17%
22	5.25%	6.40%
23	6.88%	6.48%
24	6.40%	5.45%
25	4.81%	5.66%
26	5.64%	4.78%
27	5.80%	8.00%
28	4.85%	7.06%
29	5.11%	4.20%
30	5.20%	7.40%
31	7.83%	5.78%
32	3.84%	6.56%
33	6.74%	6.39%
34	5.13%	5.78%
35	6.67%	5.79%
36	5.67%	5.30%
37	8.26%	3.90%
38	5.82%	3.67%
39	7.00%	5.80%
40	7.61%	4.91%
41	4.61%	4.71%
42	7.80%	5.40%
43	5.71%	6.90%
44	7.23%	5.26%
45	7.71%	5.33%
46	4.78%	6.03%
47	4.35%	6.44%
48	6.45%	6.39%
Note. See Table 1 for detailed list of conditions by number.

The χ2diff test exhibited perfect (100%) power across almost all conditions except those pairing the smaller number of cross-classification units (m = 25) with the more problematic pattern and values for the intercept ρIUCC (0.30,0.15) where power ranged from 96.4% to 98.8% for FixedX data and from 95.2% up to 99.8% for RandomX data.

DISCUSSION

The present study was designed to extend the work of several researchers who have assessed the performance of information criteria in terms of their correct multilevel model identification rates (Gurka, [13]; Meyers & Beretvas, [20]; Whittaker & Furlow, [34]). These researchers had noted the need for additional research in the area of information criteria and multilevel model selection rates. Only Meyers and Beretvas ([20]) had assessed information criteria's performance when comparing the correct CCREM with a multilevel model that ignored one of the classifications. The authors had only assessed the default SAS PROC MIXED values used for the BIC and AIC. In addition, Meyers and Beretvas (2006) only compared fit index performance for an underparameterized multilevel model that ignored the cross-classification with the correctly specified CCREM. This study extended Meyers and Beretvas' (2006) line of research on the use of information criteria for multilevel model selection by evaluating a fuller set of information criteria (including the AIC, AICC, BIC, HQIC, and CAIC) for comparisons of a correct CCREM with under- and overparameterized CCREMs. The information criteria assessed here also included modifications to the sample size, N*, value used in the AICC, BIC, HQIC, and CAIC formulas (see Equations 3 through 6). As a preliminary investigation of the use of information criteria for comparing CCREMs, very simple models were compared such that the models in each comparison each differed by only a single parameter, and the over- and underparameterizations were a function of differences in fixed effects only. However, each information criteria comparison in the present study involved comparing the fit of an underparameterized CCREM with a correctly and overparameterized CCREM. It should also be emphasized that although there are some commonalities between the design conditions manipulated in the present study and in Meyers and Beretvas' (2006) study, some of values of the design conditions investigated here differed. For example, both studies used (0.15, 0.15) values for the MS and HS ρIUCC values, however, it was decided to explore moderate and large MS and HS ρIUCC values in the present study rather than to include the smaller ρIUCC values [e.g., (0.05, 0.05)] investigated by Meyers and Beretvas ([20]). Given the poor model convergence rates (see Table 4) noted in the present study for the lowest ρIUCC values [(0.15, 0.15) versus (0.15, 0.30) and (0.30, 0.15)] when estimating this study's CCREMs, it is likely that convergence rates would have been even lower given smaller ρIUCC values had been generated. Assessment of convergence rates when estimating CCREMs and of information criteria performance for lower ρIUCC values remains an important avenue for future research. In addition, the actual values investigated here for the within-classification sample size, , and the number of levels of the classification factors, m, differed slightly from those used in Meyers and Beretvas' (2006) study to enable a comparison of the effects of and m after controlling for total sample size, N.

Summary of Results

Estimation of random effects' variances in multilevel models is generally quite challenging under many conditions. Estimation of CCREMs' random effects' variances is no exception and thus it was not surprising that in scenarios with larger true intercept ρIUCC values, convergence rates were much better than when the ρIUCC values were lower. As noted, however, the values of the ρIUCC that were explored in this study were moderate and large relative to values typically encountered in practice (Spybrook & Raudenbush, [30]). Clearly, for smaller ρIUCC values, FIML estimation of CCREMs will be increasingly problematic. This study also investigated conditions in which the two level 2 classification's (here, representing MS and HS) ρIUCC values differed, including (0.15, 0.30) and (0.30, 0.15). Results were found to differ as a function of the direction of the values of the two classifications' ρIUCC. In terms of convergence rates, despite the equivalence of the total ρIUCC, convergence rates were substantially better for scenarios with larger MS than HS variance (0.30,0.15) than the opposite scenario [i.e., for (0.15,0.30) conditions] when the correct or overparameterized models were estimated for both FixedX and RandomX data. Differences were not found when the underparameterized model was estimated.

When interpreting this result, it should be remembered that the correct (generating) model included a MS predictor that was omitted when estimating the underparameterized model. Omission of this predictor increases the MS variability which seemingly made the remaining MS and HS ρIUCC estimates more equivalent than generated. Thus, making (0.15, 0.30) datasets more like (0.30, 0.15) datasets. However, this does not apply to the results found when estimating the correct and overparameterized models. It is unclear why convergence rates should differ.

Another unexpected pattern was identified in the convergence rates (see Table 4). The number of cross-classifications per MS (two versus four) had a stronger effect than did the degree of cross-classification (balanced versus unbalanced). Under the balanced conditions, the average per-cross-classification sample size is larger than in the unbalanced conditions. Thus, for example, in the 50%:50% conditions, the per-cross-classification cell size (equaling 12.5 for the = 25 conditions) is considerably larger than in the 64%:12%:12%:12% conditions (equaling 6.5 for the = 25 conditions). Yet, convergence rates were better for the unbalanced, four cross-classifications per MS conditions than for the balanced, two cross-classifications per MS. This means that the degree of fullness of the cross-classification table (compare Tables 2 versus 3) seems to improve convergence rates more than the average cross-classification cell's size. Future research should investigate additional patterns of sparseness and sample sizes to test this finding further.

The primary finding of this study was that the default values of the information criteria used by SAS PROC MIXED for best model fit should not be used. In particular, the default HQIC and BIC values (notated here as HQIC1 and BIC1) will equal −2LL when SAS PROC MIXED is used with the two cross-classifications specified in two RANDOM statements to estimate a CCREM. This is because under this specification the number of "effective subjects" is one, the natural log of one is zero and thus everything other than −2LL is zeroed out in the formulas for the HQIC (Equation 6) and BIC (Equation 4). As a part of their 2006 study, Meyers and Beretvas had compared the fit of an underparameterized conventional multilevel model with that of the correct CCREM. The authors had found that the BIC1 always supported the better fit of the correct model. Because the authors had not included a fit comparison that also included an overparameterized model, this could be interpreted as the BIC1 providing a perfect assessment of the correct model's fit. However, the present study helped qualify and better explain the BIC1 results in the Meyers and Beretvas study. The present study identified that the BIC1 (and the HQIC1) tended to favor the most parameterized model rather than the correct model (which happened to be the most parameterized model in Meyers and Beretvas' study). Thus, based on the present study's results, applied researchers who are using SAS PROC MIXED with FIML estimation to estimate and compare the fit of CCREMs should add ln(N*)q to the default BIC1 value that is reported in SAS output after substituting any of the modifications to N* (including c, N, and m). This should help compensate for the benefits of one model's inclusion of more parameters than another. In addition, applied researchers interested in using the HQIC should add the following term: (2q){ln [ln(N*)]} to the HQIC1 value reported in the SAS output. (See Equation 6 and note that if using FIML estimation, q represents the number of fixed and random effects parameters being estimated). The present study did not find substantial differences among the choices (of c, m, and N) for N* although all worked much better than did the default value of HQIC1.

Use of the default CAIC value (the CAIC1) for comparing CCREM models' fit resulted in excessive support for the incorrect, more parameterized models over the correct, less parameterized CCREM (see Tables 5 and 6). Given the minimal difference in the correct versus overparameterized models (consisting of a single parameter), it is anticipated that when models differ by more parameters, the CAIC might increasingly favor even more parameterized although incorrect models. The modifications to the N* used in the CAIC formula (Equation 5) resulted in great improvements in correct model identification and thus the default CAIC1 should not be used and instead c, m or N should be substituted for N* when adding q[ln(N*)+1] to the CAIC1 (see Equation 5).

Correct model identification rates for the remaining two default information criteria (the AICCN and AIC) closely corresponded (see Tables 5 and 6) and were found to work reasonably well. No N*-based modification is possible with the AIC (see Equation 2). However, all of the modifications suggested here led to much higher correct model identification rates for the AICC, CAIC, HQIC and BIC (see Tables 5 and 6). Instead of their information criteria's respective default values, the HQICN, BICN, AICCm and CAICN were found to have the highest correct model identification rates although differences among the N* modifications were not considerable. Although the HQICN and AICCm exceeded other versions of the HQIC and AICC, respectively, these two indices tended to have lower correct model identification rates when compared with the BICN and CAICN. The AIC had the poorest performance relative to the optimal modifications of the HQIC, AICC, BIC, and CAIC. As a final note, applied researchers can substitute the reported log likelihood for the model that was estimated into the information criteria's formulations (that appear in Equations 2 through 6) to deduce the formulation for N* that is used by their own statistical software.

Results from Whittaker and Furlow's (2009) study of the performance of these information criteria in assessing conventional multilevel models fit matched those found here. They also found that the pairing of N with the BIC and CAIC worked better than use of m and that these two consistent information criteria worked better than the efficient HQIC, AICC and AIC. In addition, similar to Whittaker and Furlow's results, differences found between use of N and m were not very substantial. The same result was found in the present study with pairings of the relevant information criteria with the number of nonempty cross-classified cells, c, working reasonably well and not substantially differently from use of m or N as the N*. Gurka's (2006) results, on the other hand, had supported the use of m over N for conventional multilevel models. It is known that m plays a more important part than N in power for multilevel designs (e.g., Raudenbush & Liu, [25]). Some of the differences found between Gurka's work and that of Whittaker and Furlow might result from the complexity of the models that were compared. Future research is clearly still needed with conventional multilevel models to ascertain how best to modify information criteria. Similarly, given the limited complexity and degree of over- and underparameterization of the models being compared in the present study, it is essential that future research test additional models and degrees of parameterization differences to see how well these results generalize.

Two primary factors were found to influence functioning of the modified information criteria including the number of units per classification factor, m, and the value and pattern of the intercept ρIUCC for the two level 2 classifications (MS and HS). For the AICC, the larger m is, the lower the correct model identification rates. This relationship was reversed for the other three information criteria (HQIC, BIC, and CAIC). The value of the ρIUCC pattern (0.30, 0.15) that led to the best convergence rates was associated with the lowest correct model identification rates especially when paired with conditions in which m was 25. However, even when these factors influenced correct model identification rates, the rates were still never very low for the optimally functioning modified information criteria with no rates lower than 80% and many as high as 90%.

Performance of the χ2diff test for comparing fit of nested models was also assessed because it has not been evaluated with CCREM models. The test was found to have a slightly elevated Type I error rate that was worse for the FixedX than for the RandomX data. The power of the test statistic for identifying the correct (versus the underparameterized) model was very high supporting careful use of the χ2diff for comparing pairs of nested CCREMs.

Limitations

Estimation of CCREMs works better in scenarios where more variability is attributable to the classification factors and in scenarios where there are more cross-classifications. Note, however, that the present study only assessed performance of FIML estimation. Future research could explore this pattern of results when REML estimation is used to estimate the CCREMs. More importantly, use of MCMC estimation was not assessed in the present study and should be assessed especially for sparse cross-classification conditions. In addition, performance of the deviance information criterion (Spiegelhalter, Best, Carlin, & van der Linde, 2002) used with MCMC estimation should also be assessed.

The present study only looked at a very small subset of particularly simple CCREMs. This was designed to provide a starting point for the assessment of information criteria's functioning for CCREM model selection. Although use of the information criteria are more typically used with nonnested models and models that differ by more than a single parameter, the under- and overparameterized models that were compared in the present study differed from the correct model by only a single fixed effects parameter. Future research should explore performance of the information criteria in even more authentic scenarios entailing more complex patterns in which incorrect models entail combine mixed zero and nonzero true fixed and random effects. Additional scenarios should explore the use of these information criteria for differentiating between the fit of nonnested models.

Use of information criteria for identifying the better fitting model permits simultaneous assessment of the impact on fit of a set of parameters being added to (or removed from) a model. Despite the perceived lack of consensus about the validity of statistical significance testing associated with multilevel model parameter estimates, applied researchers tend to use the statistical significance results for deciding which parameters to keep in a model. Future research could explore the correspondence between inferences associated with specific parameters estimated in a model with inferences that would be made based on a comparison of models' information criteria.

The present study also only explored information criteria functioning with two-level CCREMs with only two cross-classifications and, as mentioned, entailed comparisons of relatively simple models. However, the results seem to provide a useful foundation for future research on this topic. Future research can look at more complex patterns of data structure, models and differences among the models. Once further extensions to the present study have been accomplished, stronger recommendations about which specific values for N* should be used with each of these information criteria. In the meantime, the results of this study can support the recommendation that researchers not use the default information criteria reported when SAS PROC MIXED is used to estimate CCREMs. Instead, researchers should use the N*-modified formulations of the information criteria when choosing amongst CCREM models.

Last, it should be emphasized that although the example used here entailed an educational context, there are, however, many examples of cross-classified data structures that are encountered in other fields. For example, a medical research example entails the cross-classification of patients by nurses and doctors (Rasbash & Browne, [23]). A sociometric example involves the cross-classification of someone's set of ratings (of themselves by the individual and by fellow group members) being cross-classified by the individuals themselves ("senders") and by members of their group ("receivers"; Hox, [16]). Across these cross-classified data structure scenarios, it is critical that the optimal indices and statistics are derived and assessed for use with the CCREM.

Acknowledgments

A previous version of this article was presented at the 2010 annual meeting of the American Educational Research Association in Denver, Colorado.

REFERENCES 1 Akaike, H.1973. "Information theory and an extension of the maximum likelihood principle". In Second international symposium on information theory, Edited by: Petrov, B.N. and Csaki, F.267–281. Budapest, , Hungary: Akademiai Kiado. 2 Beretvas, S.N.2008. "Cross-classified random effects models". In Multilevel modeling of educational data, Edited by: O'Connell, A.A. and Betsy McCoach, D.161–197. Charlotte, SC: Information Age. 3 Beretvas, S.N.2010. "Cross-classified and multiple membership models". In The handbook of advanced multilevel analysis, Edited by: Hox, J. and Roberts, J.K.313–334. New York, NY: Routledge. 4 Bozdogan, H.1987. Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52: 345–370. 5 Bradley, J.V.1978. Robustness?. British Journal of Mathematical and Statistical Psychology, 31: 144–152. 6 Ecob, R., Croudace, T.J., White, I.R., Evans, J.E., Harrison, G.L., Sharp, D. and Jones, P.B.2004. Multilevel investigation of variation in HoNOS ratings by mental health professionals: A naturalistic study of consecutive referrals. International Journal of Methods in Psychiatric Research, 13: 152–164. 7 Fielding, A.2002. Teaching groups as foci for evaluating performance in cost-effectiveness of GCE advanced-level provision: Some practical methodological innovations. School Effectiveness and School Improvement, 13: 225–246. 8 Garner, C.L. and Raudenbush, S.W.1991. Neighborhood effects on educational attainment: A multilevel analysis. Sociology of Education, 64: 251–262. 9 Gelman, A. and Rubin, D.B.1995. "Avoiding model selection in Bayesian social research. Discussion of "Bayesian model selection in social research," by A. Raftery". In Sociology methodology, Edited by: Marsden, P.V.165–173. Cambridge, MA: Blackwell. Goldstein, H.2003. Multilevel statistical models (3rd ed.), New York, NY: Hodder Arnold. Goldstein, H.2010. Multilevel statistical models (4th ed.), New York, NY: Hodder Arnold. Goldstein, H. and Sammons, P.1997. The influence of secondary and junior schools on sixteen year examination performance: A cross-classified multilevel analysis. School Effectiveness and School Improvement, 8: 219–230. Gurka, M.J.2006. Selecting the best linear mixed model under REML. The American Statistician, 60: 19–26. Hannon, E.J. and Quinn, B.G.1979. The determination of the order of an autoregression. Journal of the Royal Statistical Society, Series B, 41: 190–195. Harville, D.A.1974. Bayesian inference for variance components using only error contrasts. Biometrika, 61: 383–385. Hox, J.2002. Multilevel analysis: Techniques and applications, Mahwah, NJ: Erlbaum. Hurvich, C.M. and Tsai, C.L.1989. Regression and time series model selection in small samples. Biometrika, 76: 297–307. Kass, R.E. and Raftery, A.E.1995. Bayes factors. Journal of the American Statistical Association, 90: 773–795. Luo, W. and Kwok, O.2009. The impacts of ignoring a crossed factor in analyzing cross-classified data. Multivariate Behavioral Research, 44: 182–212. Meyers, J.L. and Beretvas, S.N.2006. The impact of inappropriate modeling of cross-classified data structures. Multivariate Behavioral Research, 41: 473–497. Pickery, J., Loosveldt, G. and Carton, A.2001. The effects of interviewer and respondent characteristics on response behavior in panel surveys: A multilevel approach. Sociological Methods and Research, 29: 509–523. Raftery, A.E.1995. "Bayesian model selection in social research". In Sociology methodology, Edited by: Marsden, P.V.95–111. Cambridge, MA: Blackwell. Rasbash, J. and Browne, W.J.2001. "Modeling non-hierarchical structures". In Multilevel modeling of health statistics, Edited by: Leyland, A.H. and Goldstein, H.93–105. Chichester, , England: Wiley. Raudenbush, S.W. and Bryk, A.S.2002. Hierarchical linear models: Applications and data analysis methods (2nd ed.), Thousand Oaks, CA: Sage. Raudenbush, S.W. and Liu, X.2000. Statistical power and optimal design for multisite randomized trials. Psychological Methods, 5: 199–213. Schwarz, G.1978. Estimating the dimension of a model. Annals of Statistics, 6: 461–464. Shi, Y., Leite, W.L. and Algina, J.2010. The impact of omitting the interaction between crossed factors in cross-classified random effects modelling. British Journal of Mathematical and Statistical Psychology, 63: 1–15. Snijders, T.A. B. and Bosker, R.J.1999. Multilevel analysis: An introduction to basic and advanced multilevel modeling, Thousand Oaks, CA: Sage. Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A.2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society B, 64: 583–640. Spybrook, J. and Raudenbush, S.W.2009. An examination of the precision and technical accuracy of the first wave of group-randomized trials funded by the Institute of Education Sciences. Educational Evaluation and Policy Analysis, 31: 298–318. Szapocznik, J., Lombard, J., Martinez, F., Mason, C.A., Gorman-Smith, D., Plater-Zyberk, E. and Spokane, A.2006. The impact of the built environment on children's school conduct grades: The role of diversity of use in a Hispanic neighborhood. American Journal of Community Psychology, 38: 299–310. Verbeke, G. and Molenerghs, G.2000. Linear mixed models for longitudinal data, New York, NY: Springer-Verlag. West, P., Sweeting, H. and Leyland, A.2004. School effects on pupils' health behaviours: Evidence in support of the health promoting school. Research Papers in Education, 19: 261–291. Whittaker, T.A. and Furlow, C.F.2009. The comparison of model selection criteria when selecting among competing hierarchical linear models. Journal of Modern Applied Statistical Methods, 8: 173–193. Footnotes This value corresponds to the number of "effective subjects" as appears under the DIMENSIONS table in the PROC MIXED output when two (crossed) random effects are specified in the SUBJECT = statement in two RANDOM statements. Use of "one" as N* in the HQIC formula (see Equation 6) results in an incalculable second term and zeroes out the second term in the BIC formula (Equation 4). Thus, SAS PROC MIXED sets HQIC and BIC to −2LL when estimating CCREMs.

By S.Natasha Beretvas and DanielL. Murphy

Reported by Author; Author

S. Natasha (Tasha) Beretvas is a professor of Quantitative Methods at the University of Texas at Austin and a faculty associate of the Population Research Center and the Meadows Center for Preventing Educational Risk. Her research focuses on evaluation of statistical models in educational and social science research with a focus on extensions to the conventional multilevel model to handle sources of data structure complexities.

Daniel L. Murphy currently serves as a Research Scientist in the Research & Innovation Network at Pearson, where his research program includes the use of growth measures, adaptive testing, and data visualization techniques to inform instructional decisions and interventions.

Titel:	An Evaluation of Information Criteria Use for Correct Cross-Classified Random Effects Model Selection
Autor/in / Beteiligte Person:	Beretvas, S. Natasha ; Murphy, Daniel L.
Link:	Volltext (PDF) Zum Volltext http://www.tandf.co.uk/journals
Zeitschrift:	Journal of Experimental Education, Jg. 81 (2013), Heft 4, S. 429-463
Veröffentlichung:	2013
Medientyp:	academicJournal
ISSN:	0022-0973 (print)
DOI:	10.1080/00220973.2012.745467
Schlagwort:	Descriptors: Models Goodness of Fit Evaluation Criteria Educational Research Sample Size Classification Middle Schools High Schools Statistical Distributions Simulation Equations (Mathematics)
Sonstiges:	Nachgewiesen in: ERIC Sprachen: English Language: English Peer Reviewed: Y Page Count: 35 Document Type: Journal Articles ; Reports - Research Education Level: Middle Schools ; High Schools ; Secondary Education Abstractor: As Provided Number of References: 34 Entry Date: 2014

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

BibTeX Citavi, JabRef, u.a.
(Literaturverwaltung)

PDF kein Volltext!
(Merkzettel, Notizen)

RIS Endnote, Citavi u.a.
(Literaturverwaltung)

MODS
(XML zur Weiterverarbeitung)

oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

Gewünschter Zitations-Stil:

oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.