Sunday, April 08, 2007

Missing data--full information estimation

  • FIML (direct ML) is a direct method in the sense that model parameters and standard errors are estimated directly from the available data. Missing data points are not estimated, or imputed, and are essentially treated as values that were never intended to be sampled.
  • EM algorithm is indirect ML because it provides an ML estimate of the covariance matrix and mean vector that can, in turn, be used as input for further modeling. EM algorithm is an indirect method in the sense that the missing data are preprocessed prior to performing the ultimate SEM analysis. Nevertheless, the use of an EM covariance matrix should produce parameter estimates that are nearly identical to those of FIML in many cases. However, standard errors will likely be incorrect because no single value of N is applicable to the entire EM covariance matrix. Some authors suggest the use of bootstrapping to obtain standard error estimates when using an EM covariance matrix.
  • Amos uses direct ML in full information maximum likelihood method to handle missing data, NOT EM. Both direct ML and EM are algorithms to calculate ML for missing data,but Amos uses the direct ML, not EM. Allison suggests that direct ML is better than EM.
  • when using direct ML (FIML) under MAR conditions, the cause of the missingness must be included in the substantive model. If a measured variable, say y, is related to the missingness, but is not included in the ultimate model, then MAR does not hold. Currently, SEM software packages offer no default options for including auxiliary variables in the model, although the analyst can accomplish this with some extra effort.
  • It is quite easy to condition parameter estimates on auxiliary variables when using the EM algorithm. The initial EM analysis can be performed using a superset of variables, and the relevant covariance elements can be extracted for input into an SEM program. In this case, MAR is more likely to be satisfied because the covariance matrix used for the ultimate analysis has already been conditioned on the auxiliary variables. As such, these variables need not appear in the model to satisfy MAR.
  • given that the mean of a linear composite is simply the sum of the item means that
    define the composite, the EM mean vector could be used to compute scale-score means. If relationships among constructs were of interest, the substantive analysis could be performed using direct ML estimation; individual items could serve as indicators in a
    SEM, or data could be analyzed at the scale score level after computing scale scores for cases that have complete data.
  • John W. Graham (http://methodology.psu.edu/resources.html) will provide timely updates on missing-data applications and utilities as they evolve, along with step-by-step instructions on MI with NORM
  • assuming data values are missing at random (MAR),the missing data mechanism is ignorable. With ignorable MAR data, there is no need to model the missing data mechanism as part of the estimation process
  • In the SEM literature, the approach taken by these programs is sometimes called
    full-information maximum likelihood (FIML). In keeping with terminology used in other statistical literature on missing data, it seems more natural to refer to FIML simply as ML (Maximum likelihood).
  • To apply FIML, the data must be 1) multivariate normal 2) missing at random (MAR).
    Recent research into the sensitivity of FIML to violations of these two requirements shows some robustness of the method to mild deviations from these assumptions.
  • FIML holds special promise for longitudinal studies where missing data are the rule rather than the exception.
  • ML is a theory-based approach based on the direct maximization of the likelihood of the observed data, ML estimation based on all available data; ML maximizes the observed data likelihood to obtain the maximum likelihood estimates of the parameters.
  • Once the model is specified, the user supplies an input matrix of raw data with missing values denoted by a special code. The ML procedure then computes parameter estimates on the basis of all available data, including the incomplete cases. Standard errors are derived from an observed or expected information matrix.
  • expectation-maximization (EM) algorithm may be used for maximum likelihood estimation. The EM algorithm is a method for obtaining maximum likelihood estimates in the presence of missing data. This approach is used in the Amos computer program.
  • With these new software tools, fitting a model with incomplete data is operationally no more difficult than fitting a model with complete data. Because the programs are so easy to use, there is a widespread misconception that ML relieves the user of the onus of
    thinking about missing data issues, because all necessary adjustments for missing data are performed automatically. Researchers should be aware that these automatic adjustments are satisfactory only to the extent that the data and patterns of missing values satisfy
    the underlying assumptions, in particular the assumption of MAR.
  • Principles for applying ML and likelihood-based procedures to incomplete-data problems
    were first described by Rubin (1976). Little and Rubin (1987) provided an excellent overview of the theory of ML estimation,both with and without missing values.
  • Arbuckle, J.L. (1996) Full information estimation in the presence of incomplete data. In G.A. Marcoulides and R.E. Schumacker [Eds.] Advanced structural equation modeling: Issues and Techniques. Mahwah, NJ: Lawrence Erlbaum Associates.

No comments: