Measurement Invariance of Personal Well-Being Index (PWI-8) Across 26 Countries

This report examines the measurement invariance of the Personal Well-being Index with 8 items (PWI-8). University students (N = 5731) from 26 countries completed the measure either through paper and pencil or electronic mode. We examined uni-dimensional structure of PWI and performed a Multi-group CFA to assess the measurement invariance across the 26 countries, using conventional approach and the alignment procedure. The findings provide evidence of configural and partial metric invariance, as well as partial scalar invariance across samples. The findings suggest that PWI-8 can be used to examine correlates of life satisfaction across all included countries, however it is impossible to compare raw scores across countries.


Introduction
Quality of life has become an important, well-researched topic over the last few years. More specifically this has been examined in terms of well-being, which is often assessed at national levels for international comparisons. These international comparisons, however, require measures which have been shown to be invariant across different cultural groups and countries. The objective of this study is to examine the measurement invariance of just such a measure, the Personal well-being index (PWI), which is considered one of the most popular measures for evaluating subjective well-being (International Well-being Group 2013;Sirgy 2012).
Life satisfaction is the cognitive component of subjective well-being and has a general character (Diener 1984). According to Diener et al. (1985), life satisfaction is the effect of a judgmental process in which ''a comparison of one's circumstances with what is thought to be appropriate standard'' (p. 71) is made. Therefore, it refers to some standards of evaluation, which could be related to different life domains. PWI is the decomposition of life satisfaction in satisfactions with different domains (Cummins et al. 2003), namely: (1) standard of living, (2) personal health, (3) life achievements, (4) personal relationships, (5) personal safety, (6) community connectedness, (7) future security, and (8) religion and spirituality. The PWI has been used as an assessment of life satisfaction in child (Casas et al. 2012), adolescent and student (Tomyn et al. 2011), aging (Bricker-Katz et al. 2009Forjaz et al. 2011), and clinical populations (e.g. Engel and Cummins 2011;Werner 2012). The scale is intended to be inclusive of all important life domains which could contribute to the general level of life satisfaction and to serve as a tool in cross-cultural comparisons on relative importance of particular domains in life satisfaction (International Wellbeing Group 2013). The idea behind developing the PWI was to include the most important predictors of general life satisfaction. A selection of domains was done by an international team and was based on several criteria: the selection should include only basic domains important for predicting ''life satisfaction as a whole''; each domain should refer to broad aspects of life, and each domain needs to represent an indicator and not a causal variable of general life satisfaction (see International Wellbeing Group 2013).

Measurement Invariance as Means of Cross-Cultural Inquiry
Oishi (2010) pointed towards several important methodological and conceptual issues related to cross-cultural studies on subjective well-being; these are conceptual equivalence, translation issues, desirability of the concept, response style, item functioning, differences in self-presentations, memory bias, and validity criteria. For instance, single-item measures of subjective well-being like Cantrill's ladder or general items on life satisfaction are less reliable than longer scales and do not allow for more in-depth examination of cross-cultural equivalence in terms of measurement equivalence (see Oishi 2010). As the PWI is a multiitem scale it is particularly useful in cross-cultural research. Based on the work of an international group of well-being researchers, it uses items that are simple and easy to translate, which allows for minimization of problems with conceptual equivalence and translations (see International Wellbeing Group 2006, 2013. Multi-group Confirmatory Factor Analysis (MGCFA) serves as a statistical tool for assessment of cross-cultural equivalence of a measure. Such analysis is fundamental for establishing the usefulness of any measure intended for cross-cultural research. There are three levels of measurement invariance which are most commonly used to establish whether a measure is equivalent: (a) Configural invariance provides indication that the general factor structure of the measure is the same across different groups. At this level, the construct is measured similarly in different samples. (b) Metric invariance indicates that the factor loading of items is similar, (i.e., load in the same way in assumed factor) across groups. At this level, measure correlates and/or predictors may be compared across samples. (c) Scalar invariance indicates that item intercepts are equal intercepts across groups. At this level means may be compared across samples (Davidov et al. 2014).
Scalar invariance is rarely found in large cross-cultural comparisons (see Davidov et al. 2014 (Tucker et al. 2006). Most studies typically focus on comparisons between several national groups, rarely examining large representation of countries (see also Ponizovsky et al. 2013).

The Current Study
Despite the many cultural adaptations of this measure (see International Wellbeing Group 2013, for details) and its increasing popularity among cross-cultural researchers, to the best of our knowledge, there is little evidence that the PWI is invariant across different countries, with the exceptions of adolescent samples in Chile and Brazil (Sarriera et al. 2014) or general populations in Hong Kong and Australia (Lau et al. 2005). Sometimes the levels of PWI are compared without examination of measurement invariance, as in the case of Romania and Hungary (Baltatescu 2014). This report intends to fill this gap by examining the measurement invariance of the PWI in university student samples across 26 countries.
In the current study, we examine the measurement invariance of PWI across countries from different regions of the world: Europe (10), Asia (10), Africa (2), and Latin America (4). Among them, there are the most affluent and developed countries, like UK or Japan, and less affluent, agrarian societies like Iran or Kenya. In terms of cultural regions we had representatives for all Huntington (1996) cultural groups (i.e., Western, Orthodox, Confucian, Japanese, Latin American, Hindu, Buddhists, Islamic, African, and Sinic), and in terms of religion we had countries representing all main world religions. This selection of countries is not exhaustive, but it allows for examining measurement invariance of PWI across different languages and cultures. The aim of the study was to investigate the measurement invariance of the PWI across different countries and languages. Given the large number of countries compared, we expected to find support only for the metric level of invariance, as scalar invariance is hardly found in large cross-national comparisons.

Sample and Procedure
Data were collected in a paper-pencil or online formats between April 2014 and August 2015. The sample comprised 5,530 university students (42.4 % men, M = 21.29, SD = 3.15, age ranged from 16 to 39). We excluded all participants above the age of 40 (1.7 % of total sample) from the analyses, as in most countries the respondents' age was in the 18-25 range, and rarely exceeded 30 years. We also asked students to indicate the socioeconomic status of their families on a 7-point Likert-type scale (from 1 = significantly below average to 7 = significantly over average). The students majored in different fields (e.g., social sciences, technical sciences, and medical sciences) and originated from 26 countries (see Table 1 for a sample breakdown). They were recruited for the study during their classes and participated on a voluntary basis. They completed the PWI as part of a broader project on entitlement and subjective well-being. The paper-and-pencil surveys were administered in small groups (n \ 15). In countries where the survey was administered in a non-native language, a researcher assisted students and explained the meaning of particular words. This was the case of India, Iran, Kenya, Nepal, and South Africa; however, only in Iran and Nepal is English not an official language.

Measure
The personal well-being index (PWI; Cummins et al. 2003;International Wellbeing Group 2013) measures satisfaction with different life domains: (1) standard of living, (2) health, (3) life achievements, (4) personal relationships, (5) personal safety, (6) community connectedness, (7) future security, and (8) religion and spirituality. Previous studies suggest that in different countries the relative importance of religiosity and spirituality varies as a function of cultural differences (Norris and Inglehart 2004) and that they both significantly contribute to subjective well-being (Casas et al. 2009;Piedmont and Friedman  2012). Therefore, we used one combined question about religiosity and spirituality (How much are you satisfied with your spirituality or religion?), as suggested by manual for PWI-8, despite some researchers' postulating two parallel versions for item 8 (Sarriera et al. 2014). Participants responded on an 11-point Likert-type scale (0 = not at all satisfied to 10 = totally satisfied). National versions of scale were authorised versions or they were obtained by repeating back translation procedures with bilingual researchers and with the participation of Robert Cummins (see Table 1 for information on language of administration).

Statistical Analyses
We started by using Mplus 7.4 to perform Confirmatory Factor Analysis (CFA) in order to test for a unidimensional structure of the PWI-8 in each country sample. Because the score distributions were not perfectly normal and Mardia's multivariate skewness and kurtosis statistics (presented in Supplementary Information, table SI.3) were significant in all samples, we used the robust Satorra-Bentler v 2 (Satorra and Bentler 1994; referred to as estimator MLM in Mplus). Because the country samples differed in gender distribution, we used weighting in all analyses to equalize the contribution of male and female respondents within each country to the model (the weights were calculated to achieve a target N = 100 for males and females in each group). The model fit was examined using the most common fit indices: the Chi square (v 2 ), the CFI (Comparative Fit Index), the RMSEA (root mean squared error of approximation), and the SRMR (standardized root mean squared residual). In larger samples (N [ 200), practical fit indices (CFI, RMSEA, and SRMR) are preferred to the v 2 as they are less sensitive to sample size (Chen, 2007;Davidov et al. 2014). CFI values above .90 were considered as evidence of an acceptable model fit and those above .95 as evidence of a good fit. Because in smaller samples RMSEA tends to over-reject correct models (Hu and Bentler 1999), we used RMSEA values of .10 and .06 as thresholds for acceptable and good fit, respectively. For SRMR, .08 and .05 thresholds were used (Brown 2015). In samples where the fit of the one-factor theoretical model was outside the acceptable range and a pronounced and interpretable outlier was found among the modification indices, suggesting an error covariance, the latter was added and the model was retested. We aimed to introduce as few modifications as possible in order to achieve acceptable fit without over-complicating the model.
After establishing the measurement model for each country, we proceeded by conducting multi-group CFA (MGCFA) to test for configural, metric, and scalar measurement invariance. We tested the invariance based on modified measurement models using a conventional approach (Byrne 2012). We used the DCFI and DRMSEA values of .010 and .015, respectively, as evidence of pronounced difference between nested models (Chen 2007;Cheung and Rensvold 2002). We looked for outliers among the modification indices and introduced them into the model one-by-one, until the difference in practical fit indices between the configural invariance and partial metric invariance models became small enough (DCFI B .01, DRMSEA B .015). The procedure was repeated for scalar invariance.
A potential drawback of the manual approach is that each modification results in a different model, and the exact resulting list of non-invariant parameters is dependent on the sequence in which modifications are entered into the model. In case of a long sequence of modifications, the conventional approach (addressing the strongest modification index at each step) does not guarantee that the resulting model will be optimal (i.e., simplest, with the fewest number of non-invariant parameters). This problem is overcome by the alignment procedure (Asparouhov and Muthén 2014), which evaluates different combinations of non-equivalent parameters to find an optimal model. We tried to cross-validate our findings using the alignment procedure, based on the same modified measurement model. Finally, we tested the invariance of the PWI across genders. Because the sample sizes were not large enough to test the invariance across genders in each country separately, we tested a single-factor model in the combined sample with robust Chi square (MLR) and standard errors computed using the sandwich estimator for clustered samples to account for nonindependence of observations within countries.

Scale Structure Across Countries
The internal consistency (Cronbach's alpha) values of the PWI in each national sample are presented in Table 1. Cronbach's alpha values above .70, indicating good reliability (Lance et al. 2006), were found in all samples. To ensure unidimensionality, we conducted parallel analysis. In all 26 samples, only the first eigenvalue exceeded the one obtained for random data using parallel analysis, and all the items exhibited significant loadings on the single dimension. Table 2 presents the results of single-sample CFA analyses for the initial (theoretical) model. In most countries, the theoretical model showed acceptable fit, based on the combination of practical fit indices. The fit of the model was outside the acceptable range in Spain, Poland, South Korea, Hungary, Romania, Indonesia, and Panama.
We explored the modification indices in countries with unacceptable and marginal fit and introduced additional covariances in cases where they were theoretically justified and a strong (Dv 2 [ 10) outlier was found among modification indices. The error covariance for items four and five (relationships and safety) was found in three Hispanic countries, in line with previous studies (Sarriera et al. 2014). The error covariance of items four and six (relationships and feeling part of community) was peculiar to two post-Communist Central European countries (Poland, Hungary). The other error covariances were explained by back-translation analysis. For instance, the error covariance of items five and seven was found in countries (Poland, Brazil) where local translations used the same word for ''safety'' and ''security''. In South Korea, items one, three, and seven, reflecting satisfaction with financial success, were associated. We added two error covariances to address this weak subdimension. The introduction of additional error covariances resulted in acceptable fit in all countries (shown in Table 3).

Measurement Invariance Analyses Across Countries
We proceeded by conducting invariance analyses. The multi-group model included modified measurement models for 10 countries and theoretical model for the remaining 16 countries. The configural model showed good fit to the data, the fit of the metric model was acceptable, and the fit of the scalar model was poor (see Table 4). The Chi square differences were also significant (p \ .001) between the three models. The difference in practical fit indices between the configural and metric models was very small for RMSEA (DRMSEA = .003), but above the recommended .01 threshold for the CFI (DCFI = .018) and we followed by establishing partial metric invariance. After six constraints for noninvariant loadings (listed in supplementary material) were relaxed, the difference in practical fit indices between the configural invariance and partial metric invariance models became small enough (DCFI = .01). We followed by establishing the partial scalar invariance. After relaxing 74 constraints for equal intercepts, the corresponding modification indices became non-significant at p \ .01 level and we stopped the procedure to reduce the risk of false positives. Even though the DCFI criterion was not reached (DCFI = .024), the DRMSEA and DSRMR were quite small (\.010), and practical fit indices (Table 4) were within acceptable limits. Because these indices are to be interpreted in combination (Brown 2015), we deemed the resulting partial scalar invariance model acceptable. The complete list of non-invariant parameters obtained from the final partial scalar invariance model is given in Supplementary Information (Table SI. S-B v 2 Satorra-Bentler Chi square (df = 20), SCF scaling correction factor for Satorra-Bentler v 2 , CFI comparative fit index, RMSEA root mean square of approximation, SRMR standardized root mean square residual *** p \ .001; ** p \ .01; * p \ .05 mean and intercept estimates based on the resulting model are also given in Supplementary Information (Tables SI.4, SI.5).

1). The item
The number of non-equivalent intercepts ranged from 6 to 12 per item. Some of the intercepts revealed meaningful patterns. For instance, non-equivalence of the intercept of item three (achieving in life) was more often found in Asian, collectivistic cultures. Non- The list of error covariances included in the model is given in Table 2. The list of non-invariant loadings and intercepts is given in Table SI.1 SCF scaling correction factor, CFI comparative fit index, RMSEA root mean square of approximation, SRMR standardized root mean square residual Satorra-Bentler v 2 , all p \ .001 equivalence of the intercept of item six (feeling part of your community) was typically found in Latin American countries, but not in post-Communist ones. The estimates of latent factor means and variances obtained from the model are presented in Table 5. The latent factor means were highly correlated with the observed means (r = .95).
To cross-validate the model, we performed the alignment procedure. The results are presented in Supplementary Information (Table SI.2). The alignment procedure has identified a smaller number of non-equivalent parameters, 1 loading and 37 intercepts, suggesting that the conventional approach is more conservative. Thirty-three of these intercepts were also identified as non-equivalent using the conventional approach. The latent factor means estimated using the alignment procedure were highly correlated with the observed means (r = .98) and with those obtained using the conventional approach (r = .96). These data indicate a fairly good convergence of the findings from the two procedures.

Invariance Analysis Across Gender
As our national samples differed in the gender distribution, we considered the possibility that some of the non-invariant parameters we found in the CFA analyses could be explained by gender. To gauge the contribution of gender to the measurement non-invariance, we performed a multi-group CFA for females and males in the combined sample, using country as a cluster variable to account for non-independence of observations within each country. The resulting fit indices and model comparison results are shown in Table 6. All three models showed good fit to the data. The difference between the nested models in terms of practical fit indices was below the thresholds suggested by Chen (2007), indicating that gender does not have any uniform effects on measurement invariance across countries.

Discussion
Our objective was to establish measurement invariance of the PWI-8 across 26 countries. We found that the PWI was unidimensional, with Cronbach's alphas indicating acceptable internal consistency in all countries. MGCFA confirmed that the basic construct structure of the PWI-8 is similar across groups. Although additional covariances between items can improve the fit in some countries, in most countries the fit of the theoretical model was very close to the acceptable range. The same was true for configural and metricinvariant MGCFA models. Although the difference between these two nested models was significant in terms of Chi square difference, the difference in practical fit indices was small, suggesting that the comparison of effects (e.g., correlations) obtained using the PWI in different languages across countries should not be biased by non-equivalent loadings. However, the poor fit of the scalar invariance model indicates that the comparison of raw scores between countries is impossible. On the other hand, the partial scalar invariance model indicated acceptable fit, allowing for a meaningful comparison of latent means across groups. Taken together, these findings suggest that it is possible to examine the predictors and correlates of the subjective well-being phenomenon using the PWI-8 across countries. The disagreement between the practical fit indices (DCFI vs. DRMSEA and DSRMR) can be explained by the fact that these fit indices are associated with the number of parameters in the model in different ways, and existing cut-off criteria (Chen 2007;Cheung and Rensvold 2002) are all based on simulations for two groups, where the number of parameters is much smaller. We could not find any simulation studies investigating optimal cut-off points for practical fit indices with a large number of groups. Therefore, simulation studies investigating the effects of the number of groups on fit indices are necessary.
The results of analysis using an alignment approach based on the theoretical model suggest that despite the presence of some weak non-invariance of loadings and pronounced non-invariance of item intercepts in some countries, the observed mean scores are very similar to unbiased latent factor scores. Although removal of bias can slightly change the rank ordering of countries, these effects were only pronounced for a few countries (Kenya, Japan, Israel, and South Africa).
We also developed modified measurement models by introducing theoretically interpretable and strongly significant (p \ .001) modification indices for some countries. However, in most cases these modification indices accounted for translation artefacts, which can be removed by improving certain translations of the instrument into other languages. Interestingly, in all countries where the English version was used instead of local languages (i.e., India, Iran, Kenya, Nepal, and South Africa), there was no need to introduce modifications and the unidimensional theoretical model was well fitted to the data.
The attempts to develop partial scalar invariance models using the older, manual approach and alignment approach resulted in different, although largely overlapping, sets of non-invariant parameters. Items referring to more objective realities, such as one's living standard and health, turned out to be more invariant, compared to the items referring to more subjective phenomena, such as one's spirituality or future security.
Finally, gender does not seem to contribute to non-invariance of loadings and intercepts in any uniform manner across countries. However, because our samples did not allow for the evaluation of gender invariance in each country separately, this analysis does not rule out the possibility of country-specific non-invariance associated with gender.

Limitations and Recommendation for Future Studies
The current report has several limitations: the use of student samples, the lack of several important cultures and countries (such as Chinese or American), and the overrepresentation of European countries. For practical reasons we used both online and paper-pencil surveys, and in some countries, the questionnaire was distributed in English, rather than in a native language, which may have led to increased measurement error. Also, the student samples were not representative of their respective countries, which precludes us from interpreting the substantial differences in the mean score estimates. Finally, although we decided in favour of using a combined item for measuring satisfaction with religion and spirituality, these constructs are not interchangeable (Piedmont and Friedman 2012). As this solution happened to work well both in CFA and MGCFA it could be used in cross-cultural comparisons; however, for further exploration of the importance of religion and spirituality as separate factors in shaping overall life satisfaction, two separate items should be used (see Casas et al. 2009;Sarriera et al. 2014).
In terms of specific recommendations, PWI researchers could use this tool in all of the countries included in the current study as an indicator of general life satisfaction, measured by a combination of domain-specific items. The small number of non-invariant loadings that we found suggests that satisfaction in these domains contributes more or less equally to overall life satisfaction. However, we have found some consistently repeating cultural differences, which could be further explored. For instance, in the South Korean sample we have found evidence in favour of an additional factor representing concern about financial success, suggesting that in this population life satisfaction might be somewhat affected by materialism. Differences in intercepts for some PWI items suggest that these items may have specific meaning in certain cultural contexts. For instance, lower intercept for item three, reflecting satisfaction with achievement in life, was typically found in collectivistic countries, indicating that individuals in such countries are somewhat less likely to admit satisfaction with their individual achievements. As collectivistic countries are typically ''face-saving'' cultures (Bond 1991), life achievement could mean that individuals just fit into their social environment, contrary to individualistic countries, where life achievements would mean developing unique characteristics. As PWI statements are typically very general (as they are aimed to represent broad life domains; see International Wellbeing Group 2013), the cultural meaning of these broad statements could be affected by cultural context. Our study provides some suggestions of where this may be the case (e.g., life achievement, feeling part of community).

Conclusion
The current report provided information about the possibility of cross-cultural research among university students based on PWI-8 scores, providing evidence of partial metric invariance allowing the cross-country comparison of effects, but not of group or individual raw scores. We also compared the results of different approaches to establishing unbiased factor means across countries. This provides valuable information on the further development of subjective well-being research in different cultural contexts. As the main goal of the International wellbeing group is to explore the importance of satisfaction with particular domains in shaping overall life satisfaction, our findings indicate that this research goal could be realised successfully in cross-cultural research.