Introduction

Exploratory factor analysis (EFA) and principal component analysis (PCA) are techniques mainly based on singular value decomposition of covariance matrices of multivariate data, e.g., a questionnaire, several measurements on a subject, etc. In multivariate data, it is very well possible that for some of the subjects, one or more of the variables are missing. One approach to deal with this phenomenon is listwise deletion, i.e., removing all subjects with missing values and only using the observed part of the data. This would lead to loss of information, and worse than that, biased estimates and conclusions when the amount of missing data is large or the data are not missing completely at random (MCAR) (Rubin, 1976). Especially when the amount of missing data is large, one may be faced with a singularity problem after removing the incomplete subjects, i.e., the number of items could become larger than the number of fully captured observations. As a result, the estimated variance-covariance matrices may turn out to be non-positive-definite.

Another approach to deal with estimating the covariance matrix of incomplete data is pairwise deletion, i.e., to use completely observed pairs. The same drawbacks as in the listwise deletion case can be considered here. Full information maximum likelihood (FIML) is another established approach to deal with incomplete data. FIML tries to maximally use the information from all subjects by allowing subject-specific parameter dimensions (Enders & Bandalos, 2001). The singularity problem discussed earlier could cause difficulties for FIML. Considering the fact that EFA is usually used in early stages of data collection, the available sample size is small. As McNeish (2016) mentioned, according to Russell (2002), 39% of studies use a sample size of less than 100 and this is between 100 and 200 for 23% of them. Adding missing data to such small samples would face methods like listwise deletion, pairwise deletion, and FIML with difficulties. This can be seen in the simulation study we have performed in this paper as well. One also can use the EM algorithm ((Dempster, Laird, & Rubin 1977) to iteratively estimate the covariance matrix of the incomplete data.

Wold and Lyttkens (1969) proposed the nonlinear iterative partial least squares estimation (NIPALS) procedure, which uses an alternating weighted least-squares algorithm and estimates principal components one by one in a sequential manner. Gabriel and Zamir (1979) extended NIPALS to directly estimate the desired subspace, rather than sequentially. Iterative principal component analysis (Kiers, 1997) is used to estimate the missing values and the principal components simultaneously. The main difference between iterative PCA and NIPALS is that NIPALS tries to estimate principal components regardless of the missing values, while iterative PCA produces a single imputation for the missing values as well. However, a main problem with the iterative PCA of Kiers (1997) is overfitting, i.e., while IPCA gives a small fitting error, its prediction is poor. This problem would become more serious when the amount of missing data becomes larger. Authors like Josse et al., (2009) and Ilin and Raiko (2010) proposed a regularized version of iterative PCA to overcome this problem. In this approach, a regularization parameter can control the overfitting, smaller values of this parameter would produce results similar to IPCA and larger values would regularize more. However, performance of this method depends on properly tuning this regularization parameter.

Recently, authors like Josse, Husson, and Pagés (2011), Dray and Josse (2015), Lorenzo-Seva and Van Ginkel (2016), and McNeish (2016) have considered multiple imputation (MI) in the sense of Rubin (Rubin, 2004; Schafer, 1997; Carpenter & Kenward, 2012) to deal with the missing data problem in PCA and EFA. Rubin’s multiple imputation first imputes the data using, for example, a joint (Schafer, 1997) or conditional (Van Buuren, 2007) model, then in the second step performs the usual analysis on each completed (imputed) data set. The third and last step uses appropriate combination rules to combine the results from each imputed data set. An appropriate combination rule needs to respect the fact that the imputed data are, after all, unobserved. Thus, it needs to take into account that each missing value is replaced with several plausible values. The focus of the current paper is on this last approach, where the multiple imputation is done prior to the desired analysis, e.g., PCA or EFA.

In this paper, the above problem will be described and difficulties of using MI in case of factor analysis and PCA will be discussed. Possible solutions will be reviewed and an alternative simple solution will be presented. An extensive simulation study will evaluate the proposed method. Also, considering the “Divorce in Flanders” (Mortelmans et al., 2012) dataset as a case study, the application of the proposed method will be illustrated. The paper ends with concluding notes.

Using multiple imputation and exploratory factor analysis: a review

Consider p correlated random variables \(\mathbf {X_{i}^{\prime }}=\{X_{i1},\ldots ,X_{ip}\}\) with observed covariance matrix Cov(X), of which the population value is Σ. In general, the idea of principal component analysis is to find as few as possible (uncorrelated) linear combinations of elements of X such that their variance becomes as large as possible (needed) subject to proper standardization. One can show (Johnson & Wichern, 1992) if λ1λ2 ≥… ≥ λ p ≥ 0 are the ordered eigenvalues of Σ, and (λ i , e i ) is the ith eigenvalue–eigenvector pair, then the ith principal component is given by:

$$ Y_{i}= {e_{i}}^{\prime} \mathbf{X},\; \text{Var}(Y_{i})=\lambda_{i}, \; \text{Cov}(Y_{i},Y_{k})= 0,\; i \neq k. $$
(1)

As one may see in Eq. 1, the PCA (or exploratory factor analysis from a wider perspective) are obtained using singular value decomposition of the covariance (correlation) matrix. The eigenvalues of Σ are the roots of the characteristic polynomial |Σ− λI| where |A| denotes the determinant of matrix A and I is the identity matrix of the same order as Σ. Obviously, there is no natural ordering among roots of a polynomial. Further, as one may see in Eq. 1, λ i represents a variance. It thus makes sense to order the eigenvalues in a descending manner. Problems arise when using multiple imputation prior to the exploratory factor analysis or PCA.

Consider \(X_{ij}^{(m)}\) as the observed value for ith subject (i = 1, …, N) of j th variable (j = 1, …, p) in the mth imputed dataset. The eigenvector corresponding to the largest eigenvalue of Σ(m) = Cov(X(m)) gives the structure related to the first latent factor, and there is no guarantee that the eigenvector corresponding to the largest eigenvalue of Σ(k) = Cov(X(k)) (km) is comparable with the one from the mth imputed dataset. In other words, averaging the eigenvectors (principal axes, factor loadings) using the order or the obtained eigenvalues of the covariance matrix estimated from each imputed set is likely to lead to misleading or meaningless results.

Another difficulty of using MI prior to EFA is determining number of factors. While it is necessary to determine a common number of factors across imputed sets of data, there is no guarantee that different methods of determining number of factors would propose the same decision for each and every sets of imputed data.

In order to overcome these problems, Dray and Josse (2015) have averaged the imputed values to have one single complete dataset. Other authors like Josse et al., (2011) and Lorenzo-Seva and Van Ginkel (2016) proposed to first impute the data, then perform the PCA or factor analysis on each imputed dataset separately. After obtaining the eigenvectors (factor loadings), because of the discussed problem of ordering, one needs an intermediate step before the usual averaging. This step consists of the use of the class of generalized procrustes rotations (Ten Berge, 1977) to make these matrices as similar as possible. After rotating the obtained factor loadings simultaneously, the next step would be the usual averaging. McNeish (2016) has simply generated one set of imputed data to prevent the consequences we have discussed.

One intermediate solution in place of averaging the imputed values (Dray & Josse, 2015) or averaging the factor loadings (Lorenzo-Seva & Van Ginkel, 2016) could be to estimate the covariance matrix from imputed sets of data using Rubin’s rules first, and then apply the PCA or exploratory factor analysis on this combined covariance matrix. This proposal will be discussed in the next section.

Using multiple imputation with factor analysis: a proposal

Consider \(\dot {X^{(obs)}}\) a dataset with missing values and X(1), …, X(M) as the M imputed datasets with estimated covariance matrices \(\widehat {{\Sigma }^{(1)}},\ldots ,\widehat {{\Sigma }^{(M)}}\). Using Rubin’s rules (Rubin, 2004) the multiple imputation estimate of Σ can be obtained as follows:

$$ \widetilde{\Sigma}= \frac{1}{M}\sum_{i = 1}^{M} \widehat{{\Sigma}_{M}}. $$
(2)

Having \(\widetilde {\Sigma }\), one may perform PCA or EFA directly on it, and then the problem of factor ordering as well as determining the number of factors across imputations would vanish. Of course, that would not come for free. Estimating the covariance matrix first, and then performing EFA would make impossible the direct use of Rubin’s combination rules for precision purposes. Therefore, one may consider indirect or alternative solutions. We will consider this point in more detail. It is worth noting that this is not required if no precision estimates are needed, e.g., when principal components are merely calculated for descriptive purposes or, in general, when there is no need to either make inferences about them or select a sufficiently small subset of principal components.

As was mentioned earlier, an important aspect of PCA or EFA is determining the number of factors or principal axes. While this is an important step in EFA or PCA, it is very well possible that different imputed sets of data suggest different decisions. This problem is also reported in McNeish (2016). While our proposed approach does not suffer from this problem, but still it is an important problem to address when considering MI and EFA.

One popular criterion to determine the number of factors/PCs is the proportion of explained variance. The proportion of explained variance based on the first k factors, γ k , is:

$$ \gamma_{k} = \frac{\sum_{j = 1}^{k} \lambda_{j}}{\sum_{j = 1}^{p} \lambda_{j}}, $$
(3)

with λ j as in Eq. 1. When using MI, one needs to ensure the correct amount of information present in the data is used. This is also important when estimating γ k . This can be done by constructing a confidence interval for γ k using the estimate and variance obtained by properly taking imputation process into account.

In general, even for exploratory factor analysis, it could be more informative to use an interval estimate rather than only a single point estimate to make a decision about, for example, the number of factors. Especially, when the nice properties of such point estimates (unbiasedness, consistency, etc.) are asymptotic. As Larsen and Warne (2010) also mentioned, another reason that encourages us to use confidence intervals in exploratory factor analysis is the American Psychological Association’s emphasis on reporting CI’s in their publication manual (APA, 2010). Larsen and Warne (2010) proposed CI’s for the eigenvalues, here we extend their work and derive confidence intervals for the proportion of explained variance. This can be used to determine the common number of factors across imputed sets of data, while properly taking the imputation process into account. However, the idea is general and can be used for complete data, or other methods for analyzing incomplete data.

Consider Λ = (λ1, …, λ p ) and Δ a diagonal matrix with λ1 ≥ … ≥ λ p as its diagonal elements. For large samples, we have (Anderson, 1963; Johnson & Wichern, 1992; Larsen & Warne, 2010):

$$ \widehat{\Lambda}\sim N_{p} \left( \Lambda, \frac{2}{N}\Delta^{2}\right). $$
(4)

Consider \(\left (\widehat {\Lambda }_{i}, \text {cov}(\widehat {\Lambda }_{i})\right )\), the estimated eigenvalues and its variance from the ith imputed dataset (i = 1, …, M). Using Rubin’s rules (Rubin, 2004), the combined estimates of eigenvalues and their covariance matrix \(\left (\widetilde {\Lambda }, \text {cov}(\widetilde {\Lambda })\right )\) can be estimated as follows:

$$ \left\{ \begin{array}{l} \widetilde{\Lambda}=\frac{1}{M}\sum_{i = 1}^{M} \widehat{\Lambda}_{i},\\ \text{cov}(\widetilde{\Lambda})= \frac{1}{M}\sum_{i = 1}^{M} \text{cov}(\widehat{\Lambda}_{i}) +\left( \frac{M + 1}{M}\right) \widehat{B},\\ \widehat{B}=\frac{1}{M-1}\sum_{i = 1}^{M} (\widehat{\Lambda}_{i}-\widetilde{\Lambda})(\widehat{\Lambda}_{i}-\widetilde{\Lambda})'. \end{array} \right. $$
(5)

For the proportion of explained variance, consider the pair \((\sum _{j = 1}^{k} \widetilde {\lambda _{j}},\sum _{j = 1}^{p} \widetilde {\lambda _{j}})\). Using (4) and (5) leads to:

$$ \text{cov} \left( \begin{array}{c} \sum_{j = 1}^{k} \widetilde{\lambda_{j}} \\ \sum_{j = 1}^{p} \widetilde{\lambda_{j}} \end{array} \right) = \left( \begin{array}{cc} \sigma_{11} & \sigma_{12}\\ & \sigma_{22} \end{array} \right), $$
(6)

with:

$$ \left\{ \begin{array}{l} \sigma_{11}=A_{1} \text{cov}(\widetilde{\Lambda}) A_{1}^{\prime},\\ \sigma_{22}=A \text{cov}(\widetilde{\Lambda}) A^{\prime},\\ \sigma_{12}=A_{1} \text{cov}(\widetilde{\Lambda}) A^{\prime}, \end{array} \right. $$
(7)

where A is an all-one row vector of size p and A1 as a row vector of size p with its first k elements equal to 1 and the rest pk elements equal to zero.

Now equipped with estimate of \((\sum _{j = 1}^{k} \widetilde {\lambda _{j}},\sum _{j = 1}^{p} \widetilde {\lambda _{j}})\) and its covariance matrix, one needs to construct a confidence interval for their ratio. That can be used to determine a common number of factors across imputations. For constructing a confidence interval of ratio of two possible correlated random variables, one can use Fieller’s theorem (Fieller, 1954). Fieller’s confidence interval for (3) can be calculated as follows:

$${C_{1}^{2}} =\frac{\sigma_{11}}{\left( \sum_{j = 1}^{k} \widetilde{\lambda_{j}}\right)^{2}},\;{C_{2}^{2}} =\frac{\sigma_{11}}{\left( \sum_{j = 1}^{p} \widetilde{\lambda_{j}}\right)^{2}},\;r= \frac{\sigma_{11}}{\sqrt{\sigma_{11} \sigma_{22}}}, $$
$$A= {C_{1}^{2}} + {C_{2}^{2}} - 2r C_{1}C_{2},\;B= z_{\alpha/2}^{2} {C_{1}^{2}} {C_{2}^{2}} (1-r^{2}), $$
$$\begin{array}{@{}rcl@{}} L&=& \widetilde{\gamma_{k}} \frac{1-z_{\alpha/2}^{2} r C_{1}C_{2} - z_{\alpha/2}\sqrt{A-B}}{1-z_{\alpha/2}^{2}{C_{2}^{2}}}, \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} U&=& \widetilde{\gamma_{k}} \frac{1-z_{\alpha/2}^{2} r C_{1}C_{2} + z_{\alpha/2}\sqrt{A-B}}{1-z_{\alpha/2}^{2}{C_{2}^{2}}} . \end{array} $$
(9)

Note that the confidence limits in Eqs. 8 and 9 are general and one can replace \(\widetilde {\Lambda }\) and \(\text {cov}(\widetilde {\Lambda })\) with an estimate and its covariance obtained from any other method, and all of the equations are still valid.

The results in Eq. 4 are only valid for large samples. For small samples or where the normality assumption is violated, we propose a bootstrap confidence interval (Efron & Tibshirani, 1994). Based on Shao and Sitter (1996), we propose the following procedure to construct a bootstrap confidence interval for the proportion of explained variance for incomplete data using multiple imputation:

  1. 1.

    Take a bootstrap sample (a sample with replacement of size N, the same size as in the original data) from the incomplete data.

  2. 2.

    Impute this incomplete sample only ONCE using a predictive model.

  3. 3.

    Estimate the covariance matrix of the imputed data, perform EFA and compute the proportion of explained variance.

  4. 4.

    Repeat 1-3, e.g., 1000 times.

  5. 5.

    The 100(1 − α)% confidence interval follows from the α/2 and 1 − α/2 quantiles of the bootstrapped proportions of explained variance.”

While the main use of constructing confidence intervals for the proportion of explained variance is to select the number of factors, it can be used for other purposes as well. It is well known (Harville, 1997) that \(\text {tr}(\sum _{m = 1}^{M} {\Sigma }_{m}) = \sum _{m = 1}^{M} \text {tr}({\Sigma }_{m})\) where tr denotes the trace of the matrix. As \(\text {tr}(A_{n})=\sum _{j = 1}^{p} \lambda _{j}\), where λ j (j = 1, …, p) are the eigenvalues, by estimating \(\widetilde {\Sigma }\) using (2), the sum of all of the eigenvalues would not change. Unfortunately, however, as Fan (1949) has shown, for matrices A, B, and C = A + B, with eigenvalues α i , β i , δ i in descending order, respectively, that for any k, 1 ≤ kp, we have:

$$ \sum\limits_{i = 1}^{k} \delta_{i} \leq \sum\limits_{i = 1}^{k} \alpha_{i} + \sum\limits_{i = 1}^{k} \beta_{i}. $$
(10)

This would mean that the proportion of explained variance obtained using eigenvalues of \(\widetilde {\Sigma }\) in Eq. 2 is always smaller than \(\widetilde {\gamma _{k}}=\sum _{j = 1}^{k} \widetilde {\lambda _{j}}/\sum _{j = 1}^{p} \widetilde {\lambda _{j}}\). In case that the proportion of explained variance obtained from \(\widetilde {\Sigma }\) is far out of the computed CI, one needs to use our proposed method cautiously.

This phenomenon points out that either using \(\widetilde {\Sigma }\), one may explain a much smaller proportion of variance with the same number of factors compared with averaging the factor loadings directly. Or this would suggest, for the given k, that the set of selected factors across the imputations are different, i.e., no matter the order, different sets of factors are selected in some of the imputed datasets. At any rate, in case of such occurrence we suggest to try other approaches as a sensitivity analysis.

Note that, the CI computed here is for \(\sum _{i = 1}^{M} \widehat {\gamma }_{k}\) and not for \(\widetilde {\gamma }_{k}\). Therefore, using such CIs to determine the number of factors with our proposed approach is beneficial to study the validity, etc., but it is not necessary. However, this will be necessary when users want to pool the factor loadings directly, because in that case determining the common number of factors is an important issue.

The proposed method in this section together with Fieller and bootstrap confidence intervals are implemented in R package mifa (mifa, 2017). The documentation prepared for this package gives explanations and examples on using it. One can use this package to impute the incomplete data using fully conditional specification approach (Van Buuren, 2007), then estimate the covariance matrix using the imputed data. Also, to construct Fieller and bootstrap CIs for the proportion of explained variance for given numbers of factors. The information on how to install it is presented in the Appendix.

Simulations

In order to evaluate the proposed methodology, an extensive simulation is performed in this section. The simulation con- sists of three main steps. 1- generating the data, 2- creating missing values, 3- analyzing the data. These three steps were replicated 1000 times. Let’s briefly go over each step.

In order to generate the data, first of all a covariance matrix was generated by solving the inverse eigenvalue problem, i.e., for a given set of eigenvalues, a covariance matrix was generated using eigeninv package in R. The eigenvalues vector used for this simulation is as follows:

$$\begin{array}{@{}rcl@{}} \Lambda=(50.0, 48.0, 45.0, 25.0, 20.0, 10.0, 5.0,\\ 5.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.1, 0.1). \end{array} $$
(11)

After generating the covariance matrix, its Cholesky decomposition was used to produce a set of correlated items. The sample size is set to N = 100 and the number of items were p = 15.

For each simulated dataset, two missing data mechanisms were applied from the extreme categories: non-monotone missing completely at random (MCAR) and monotone missing not at random (MNAR). Consider X i j the ith observation for the j th item. For creating a non-monotone MCAR mechanism, if a random number generated from Uniform(0,1) becomes smaller than a predefined pmiss,MCAR, then X i j is set as missing. For a small amount of missing data pmiss,MCAR is set to 0.05, for a large amount of missing data it is set to 0.3. For monotone MNAR the pmiss,MNAR is computed as exp(α + βX i j )/ (1 + exp(α + βX i j )) for β = 1, and α = − 10 in case of a large amount of missing data, and α = − 13 for generating a small amount of missing data. If a random number generated from a Uniform(0,1) becomes smaller than pmiss,MNAR, then X l j ’s for l = i, …, N are set as missing. The averaged Pearson correlation (over 15 items and 1000 replications) between the data prior to deletion and the missing data indicator is computed as 0.041 with (− 0.155, 0.234)95% confidence interval for the small amount of missing data, and 0.033 (− 0.162, 0.226) for the large amount of missing data. Note, a correlation between a metric and a categorical variable is typically somewhat lower; our intuition stems from Pearson correlations among continuous variables. Table 1 shows a summary of proportion of observed part of sample in 1000 replications of the simulations.

Table 1 Summary (mean, median, and standard deviation (SD) over 1000 replications) of proportion of observed part of generated samples for different scenarios over 1000 replications

The covariance matrix of incomplete data was estimated using the following methods:

  1. 1.

    Complete: for the complete data, before creating the missing values. This can be used as the reference to evaluate other methods.

  2. 2.

    Listwise deletion: this method ignores every row with missing values and estimates the covariance matrix using other rows. This can be done by setting use='complete.obs' in the function cov in R base functions.

  3. 3.

    Pairwise: this method uses all completely observed pairs to estimate the covariance matrix. This can be done by setting use='pairwise.complete.- obs' in R base functions.

  4. 4.

    EM: this method uses the EM algorithm to compute the covariance matrix. This can be done using the function em.norm in the R package norm.

  5. 5.

    FIML: this method uses full information maximum likelihood to compute the covariance matrix of the incomplete data. This can be done using the function corFiml in the R package psych.

  6. 6.

    IPCA-reg: this method uses regularized iterative principal component analysis to impute and estimate the covariance matrix of incomplete data. This can be done using imputePCA function in R package missMDA.

  7. 7.

    MI: this method uses our proposal, the imputation model is based on fully conditional specification (Van Buuren, 2007) implemented in R package mice with a function with the same name.

Note that the maximum number of iterations for both the EM algorithm and regularized IPCA was set as 1000. Also, the regularization parameter in IPCA-reg is set as 1 (the default). For MI, when the amount of missing data was small, ten imputations were considered, and for a large amount of missing data, the incomplete data was imputed 30 times. The imputation was done using the predictive mean matching (PMM) method (Little, 1988; Vink et al., 2014). Also, the iterations per imputation was set as 5. Furthermore, 500 sub-samples were used to construct the bootstrap CIs.

In order to summarize the results of the imputation, we have considered three main aspects: 1- the number of times each of the methods could actually lead to a positive-definite covariance estimate, 2- for cases which a positive-definite covariance was estimated, how it could be evaluated compared with the covariance estimated using complete data, 3- the proportion of explained variance in comparison with the one obtained from complete data.

Table 2 shows the proportion of times each method led to a PD covariance matrix. Note that in case of e.g., singularity, methods like listwise deletion, pairwise deletion, and FIML would lead to a non-PD covariance matrix.

Table 2 Proportion of times (out of 1000 replications) each method led to a positive definite covariance estimate

For comparing the estimates obtained using each method with the one from the complete data, we use a Mahalanobis distance (Mahalanobis, 1936)-based measure d(Scomp., Smiss.) as follows:

$$ d(S_{comp.},S_{miss.})=\sqrt{\delta^{T} \left\{\text{Var}[\text{vech}(S_{comp.})]\right\}^{-1}\delta}, $$
(12)

where δ = vech(Smiss.Scomp.) and Var[vech(Scomp.)] = 2(N − 1)HScomp.Scomp.H, with H the elimination matrix, ⊗ the Kronecker product and vech the half vectorized version of the covariance matrix, i.e., the diagonal and upper (or lower) triangular elements. Figure 1 shows this measure for covariance matrices estimated using different methods compared with complete data. To see the effect of sample size, for small amounts of missing values with MNAR mechanism the simulation is repeated for n = 100, 1000, and 5000. Figure 2 shows (12) for covariance matrices estimated using different methods, compared with complete data for replications that such a covariance could be estimated. As one may see, for large samples, most of the methods are behaving similarly, but for small samples, which are frequent in practice, selecting an appropriate method is important.

Fig. 1
figure 1

Boxplots of computed distance in Eq. 12 between the estimated covariances using complete data and different methods

Fig. 2
figure 2

Boxplots of computed distance in Eq. 12 between the estimated covariances using complete data and different methods for small of amount of MNAR missing data for n = 100, 1000, 5000

Furthermore, for comparing the proportion of explained variance using complete data and other methods, their difference is taken as the measure: \(\widehat {\gamma }_{k_{miss.}}-\widehat {\gamma }_{k_{comp.}}\). Boxplots in Figs. 345 and 6 show the difference between \(\widehat {\gamma }_{k}\) for covariance estimated using incomplete data with the one estimated using complete data with different methods in different scenarios. This is computed for all possible values of k (k = 1, …, 15).

Fig. 3
figure 3

Boxplots of difference of the estimated proportion of explained variance using different methods with complete data (\(\widehat {\gamma }_{k_{miss.}}-\widehat {\gamma }_{k_{comp.}}\)) for MCAR- low scenario

Fig. 4
figure 4

Boxplots of difference of the estimated proportion of explained variance using different methods with complete data (\(\widehat {\gamma }_{k_{miss.}}-\widehat {\gamma }_{k_{comp.}}\)) for MNAR- low scenario

Fig. 5
figure 5

Boxplots of difference of the estimated proportion of explained variance using different methods with complete data (\(\widehat {\gamma }_{k_{miss.}}-\widehat {\gamma }_{k_{comp.}}\)) for MCAR- high scenario. The methods which are not presented could not be computed in none of 1000 replications

Fig. 6
figure 6

Boxplots of difference of the estimated proportion of explained variance using different methods with complete data (\(\widehat {\gamma }_{k_{miss.}}-\widehat {\gamma }_{k_{comp.}}\)) for MNAR- high scenario

In addition to these, the Fieller and bootstrap confidence intervals are computed for the MI method. Tables 3 and 4 shows the averaged (over 1000 replications) of the estimated \(\widehat {\gamma }_{k}\) and its confidence interval for k = 1, …, 10.

Table 3 Small amount of missing data: the proportion of explained variance using averaged covariance, its confidence interval using Fieller’s method and bootstrap for \(k = 1,\dots ,10\) and the coverage (the proportion of times each CI includes the true proportion of explained variance)
Table 4 Large amount of missing data: the proportion of explained variance using averaged covariance, its confidence interval using Fieller’s method and bootstrap for \(k = 1,\dots ,10\) and the coverage (the proportion of times each CI includes the true proportion of explained variance)

As one may see in Table 2, methods like listwise deletion, pairwise deletion, and FIML would fail to estimate a positive definite covariance matrix, and the rate of this failure increases with amount of missing values. However, the EM algorithm, regularized IPCA, and MI would always find an estimate, though for larger amount of missing value this would take more time, i.e., more iterations or a larger number of imputations.

Figure 1 shows when the missing data are MCAR and their amount is small then almost all of the methods provide acceptable results, though, listwise deletion and pairwise deletion are not as good. In general, the EM algorithm, regularized IPCA and MI provide comparable results, though, as it is observable in Fig. 1, MI is always at least as good as its competitors.

Looking at Figs. 345 and 6, when it comes to estimating the proportion of explained variance, again for MCAR-small scenario, all methods perform acceptably good, especially when we look at k = 5 and k = 6, which are the desirable number of factors. However, when the mechanism becomes MNAR, or the amount of missing data increases, some biased results can be observed in comparison with estimated γ k from the complete data. In that sense, MI has an overall better performance. Note that one would be able to get better results from regularized IPCA by tuning the regularization parameter. Here we have used the default value of the function imputePCA in R package missMDA. That could be another reason to prefer MI over regularized IPCA in this case, since no tuning is needed for MI.

In order to compare the results when using our proposal with the case where MI is directly applied on the factor loadings, Fig. 7 shows the same boxplots for the proportion of explained variance directly obtained from imputed data. As one may see, when the amount of missingness is small, the results from both approaches are comparable, while for large amount of missing data the results of directly applying MI are generally closer to the ones obtained from complete data. Note that obtaining the corresponding eigenvectors (factor loadings) depends on using the right order among them while our approach does not suffer from this requirement.

Fig. 7
figure 7

Boxplots of difference of the estimated proportion of explained variance using MI directly with complete data (\(\widehat {\gamma }_{k_{miss.}}-\widehat {\gamma }_{k_{comp.}}\)) for different scenarios

Looking at Tables 3 and 4, we may see except for k = 10 in MCAR-large scenario where the estimated γ k is slightly out of the bootstrap CI, this quantity is always in the estimated confidence interval. That would suggest: the selected set of factors are comparable across imputations, and our proposed method provides valid results.

It is also useful to see out of 1000 replications, how many times each of Fieller and bootstrap CI’s contained the estimated γ k from the complete data and the one obtained from \(\widetilde {\Sigma }\). Tables 567 and 8 show this information for our four different scenarios. As one may see, for a small amount of missing data, for both MCAR and MNAR scenarios, the coverage of Fieller’s CI for both γ k ’s obtained from complete data and \(\widetilde {\Sigma }\) is more than 95%. This would become smaller when the amount of missing data is large, though when k is near 5 or 6, we have almost complete coverage for MCAR and at least 80% coverage for MNAR. For all of the four scenarios, Fieller CIs are performing better than bootstrap. So that shows even with N = 100, the Fieller’s CI performs well. One may use larger number of sub-samples to obtain better bootstrap results. Note that, although the Fieller’s CIs are performing better, the bootstrap CIs are also performing acceptably fine (sometimes even better than Fieller’s method, see, e.g., Table 8) when constructed for a reasonable number of factors (see Tables 58). Also, when the normality assumption does not hold, Fieller’s method would face some difficulties. In such cases, having alternatives like bootstrap could be useful.

Table 5 MCAR mechanism–small amount of missing data: the proportion of times the estimated the proportion of explained variance falls within Fieller and bootstrap confidence intervals
Table 6 MCAR mechanism–large amount of missing data: the proportion of times the estimated proportion of explained variance falls within Fieller and bootstrap confidence intervals
Table 7 MNAR mechanism–small amount of missing data: the proportion of times the estimated proportion of explained variance falls within Fieller and bootstrap confidence intervals
Table 8 MNAR mechanism–large amount of missing data: the proportion of times the estimated proportion of explained variance falls within Fieller and bootstrap confidence intervals

Divorce in Flanders

In order to illustrate the proposed methodology, we use the Divorce in Flanders (DiF) dataset (Mortelmans et al., 2012). DiF contains a sample of marriages registered in the Flemish Region of Belgium, between 1971 and 2008 with an oversampling for dissolved marriages (2/3 dissolved and 1/3 intact marriages). As part of this study, the participants were asked to complete the validated Dutch version (Denissen et al., 2008) of the Big Five Inventory (BFI) (John & Srivastava, 1999). The validity of the BFI for DiF data is investigated in Lovik et al., (2017).

The sample at hand consists of 9385 persons in 4472 families. In each family, mother, father, step mother, step father, and one child over 14 were asked to fill in the BFI. Note that, depending on the presented family roles and the number of people that agreed to participate, the size of the families could vary between 1 and 5. Among these 9385 persons, there are 1218 persons with at least one non-response (out of 44 items). As our main purpose here was to illustrate the use of the proposed method, in order to get rid of the problem of intra-correlation within the families, one person from each family was selected at random to form a sample of uncorrelated subjects. As a result, a random sample of size 4472 was taken where 515 of them had at least one non-response.

This incomplete dataset was imputed using the fully conditional specification (FCS) of Van Buuren (2007) using the MICE package in R (Buuren & Groothuis-Oudshoorn, 2011) with PMM method. The imputation was done M = 25 times. The covariance matrix was estimated for each of the imputed sets of data and the exploratory factor analysis was done on the averaged estimated covariance matrix, as well as on each imputed set. The latter was used to construct a confidence interval for the proportion of explained variance. This can be done using function mifa.cov in the R package mifa, available at (mifa, 2017), as mifa.cov(data.sub, n.factor= 5, M = 25, maxit.mi= 10, method.mi='pmm',alpha= 0.05, rep.boot= 1000, ci=TRUE). data.sub here is the selected incomplete sub-sample.

The estimated factor loadings are presented in Table 9. The Fieller’s confidence interval for the proportion of explained variance using five factors is obtained as (0.428, 0.439). Also, the bootstrap CI is obtained as (0.429, 0.441). The estimated proportion of explained variance of the first five factors using the proposed methodology is 0.434, which falls within both of the estimated intervals. This is coherent with the validity of the proposed methodology for this dataset as in Lovik et al., (2017).

Table 9 Factor loadings using oblimin rotation of DiF data using the estimated covariance matrix from multiply imputed data using M = 25 imputations

Conclusions

Nonresponse and missing values are among at the same time major and common problems in data analysis, especially when it comes to survey data. Multiple imputation, which was first introduced to deal with nonresponse in surveys (Rubin, 2004), has become a key and effective tool for dealing with this problem. While MI has become a very commonly used approach to handle missing data in medical sciences, its use in psychology is increasing as well. As Lorenzo-Seva and Van Ginkel (2016) mentioned, a Google search for the terms psychology “multiple imputation” produced about 131,000 hits. Repeating it now, the number of hits has increased to 171,000. This shows the growing use of MI in the field of psychology and psychometry, hence the necessity to develop frameworks for using MI, in conjunction with various methods commonly used in psychological research. However, when it comes to combining this methodology with techniques like exploratory factor analysis and principal component analysis, due to the problems of determining a common number of factors/principal components and then ordering them, combining the results from different sets of imputed data becomes as issue.

This problem is addressed in this article and a pragmatic solution is proposed, which is justified by theoretical discussion and reasoning. Our proposal states to first estimate the covariance matrix of the correlated variables and then perform the EFA/PCA on this single matrix. The theoretical aspects of this methodology are studied and investigated. As an extension of the work of Larsen and Warne (2010), confidence intervals are proposed for the proportion of explained variance, which can be used to determine the common number of factors across imputations. Also, such confidence intervals can be useful to decide on the validity of the proposed method.

The simulation results show comparable performance of the proposed method, when compared to alternative methodologies. To evaluate our proposal in real situations, it is applied to an incomplete BFI dataset; the result was definitely acceptable. The main advantages of using the proposed methodology are: it is compatible with any imputation methodology; implementing it is very straightforward and no extra programming effort is needed. Therefore, it can be used within any desired statistical package or software. It is fast and practical. Also, the proposed confidence intervals for the proportion of explained variance can be used to determine the number of factors.

The proposed ideas in this article are also implemented in an R package mifa, which is available at mifa (2017). That would make it more available for practice.

Other MI-based solutions for exploratory factor analysis of incomplete data come with their own pros and cons. Dray and Josse (2015) pool imputed datasets, while this should be done for parameters. Our approach solves this issue by working at the parameter level. Also, McNeish (2016) considers only M = 1 imputed dataset. This does not comply with the main goal of multiple imputation, which is considering the uncertainty imposed by replacing the unobserved values by predicted values obtained based on observed values by replacing each missing value by several plausible candidates. With our proposal, it is straightforward to use M > 1 imputed datasets. Finally, the approach in Lorenzo-Seva and Van Ginkel (2016) does not suffer from any of the issues discussed above, but implementing it in practice is difficult. To the best of our knowledge, other than the stand-alone software FACTOR (Lorenzo-Seva & Ferrando, 2006), no other publicly available implementation of this approach exists. Ideally, the results of this article and the R implementation of our proposal would encourage the research on as well as use of MI for EFA of incomplete data.