Selection bias was reduced by recontacting nonparticipants

Objective: One of the main goals of health examination surveys (HES) is to provide unbiased estimates of health indicators at the population level. We demonstrate how multiple imputation methods may help to reduce the selection bias if partial data on some non-participants are collected. Study Design and Setting: In the FINRISK 2007 study, a populationbased health study conducted in Finland, a random sample of 10 000 men and women aged 25–74 were invited to participate. The study included a questionnaire data collection and a health examination. 6255 individuals participated in the study. Out of 3745 non-participants, 473 returned a simplified questionnaire after a recontact. Both the participants and the nonparticipants were followed up for death and hospitalizations. The follow-up data allowed to check the assumptions on the missing data mechanism and tailored multiple imputation methods were used to handle the missing data. Results: Non-participation is a strong predictor for mortality in the fiveyear follow-up. However, the recontact response does not predict mortality or morbidity among the non-participants when adjusted for age and sex. The M AN US CR IP T AC CE PT ED ACCEPTED MANUSCRIPT 2 result suggests that the recontact respondents can be used as proxy for all non-participants. A comparison of raw estimates and estimates adjusted for selection bias reveals clear differences in the estimated population prevalences of smoking and heavy alcohol usage. Conclusion: All efforts to collect data on non-participants are likely to be useful even if the response rate for the recontact remains low. Statistical analysis of the recontact respondents provides an indication of the extent of the selection bias, even in studies where follow-up data are not available to check the assumptions.


Introduction
Health examination surveys (HES) provide objective information about the health and health behaviors of the general population. Such data facilitate evidence-based policy decision and they can be used to plan and evaluate health promotion activities and for research.
Selective non-participation poses a major threat to the population representativeness of HESs. Especially when the aim is to monitor information about the target population, the representativeness of the results is essential [1,2]. An estimate of a health indicator may suffer from selection bias if the decision of participating HES depends on a variable associated with the health indicator. The lower the participation rate, the more serious the problem might be [3].
The data on the participants alone offers no means to evaluate the impact of the non-participation. Sampling frames often contain information M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 3 on the age, sex and area of residence also for non-participants. Earlier studies have reported that young men and persons living in big cities are overrepresented among the non-participants [4][5][6][7]. Studies where the HES data have been linked with data from administrative registries covering also the non-participants reveal that there typically are significant differences between the participants and the non-participants. It has been found that nonparticipants have lower education, more often live on social welfare and are more often unemployed than participants [8]. Non-participants are reported to use more alcohol than participants [9,10] and to have higher smoking and alcohol related mortality [11]. Non-participants of health surveys also more often use out-patient health care and have higher hospitalization rates than participants [4,6,7,12,13], have more psychotropic prescriptions [14], and have a higher mortality rate during the follow-up [15][16][17].
The sampling frame and administrative registries are not the only potential sources of data on the non-participants. The non-participants can be contacted again and asked to provide answers to a limited set of questions, so called non-response questionnaire. This kind of recontact data collection should not be mixed with reminders which are sent to make the person participate in the actual health examination. In the situation we consider, the sample can be divided into three non-overlapping participation groups: 1. the participants who both took part in the physical measurements and returned the questionnaire, 2. the non-participants of health examinations who after the recontact returned the non-response questionnaire and 3. the non-participants of health examinations who after the recontact did not return the non-response questionnaire.
Our objective is to provide unbiased estimators for health indicators, especially for the prevalence of daily smoking and heavy alcohol usage. A priori all three groups differ from each other and assumptions on the missing data mechanism are needed for unbiased estimation. We apply graphical models [18][19][20] to describe the assumptions on the missing data mechanism. The validity of the assumptions is evaluated using follow-up data on mortality and morbidity. The rationale is that any major differences in risk factors between the groups should reflect as a difference in total and cause specific mortality and morbidity. The missing data are handled with a multiple imputation (MI) method where the imputation model is built according to the assumed missing data mechanism. M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 4 2 Data and methods

Data
The FINRISK 2007 study is a cross-sectional population-based HES including a self-administered questionnaire, physical measurements such as the blood pressure and anthropometric measurements, and the collection of biological samples. The study was conducted in five areas: the Provinces of North Karelia and Kuopio in Eastern Finland, Turku-Loimaa region in Southwestern Finland, cities of Helsinki and Vantaa in Southern Finland and Oulu province in Northern Finland. A random sample of 10 000 men and women aged 25-74 years was selected from the National Population Register. The sampling was stratified by age group (10 year intervals), sex and geographical region.
The data available for everyone selected for the sample included background variables Z (age, sex and geographical region) as well as follow-up data on mortality and morbidity. To obtain mortality and morbidity data T , the whole FINRISK 2007 sample (both participants and non-participants) was linked to the National Causes of Death Register and the National Hospital Discharge Register using the personal identification code. Causes of death and reasons for the hospitalization, classified using ICD-10, were obtained until the end of 2012. For the analysis, we have used the grouping of the causes of death and hospitalizations presented in the Appendix A.
Each selected person received a letter of invitation together with the survey questionnaire. The invetees were instructed to fill in the questionnaire at home and return it during the health examination. The participation rate for the health examination was 63% and three persons explicitly refused any further contact. The same questionnaire was sent again to those who did not participate in the health examination after a reminder and did not explicitly refuse. The recipients were asked to return the filled questionnaire by mail. The response rate for the recontact was 13%. Selection variable M 1 has value 1 if the person participated in the health examination and 0 otherwise. Selection variable M 2 has value 1 if a non-participant returned the filled questionnaire and 0 otherwise.
The questionnaire data X are available for participation groups 1 and 2. The questionnaire included a large number of questions on lifestyle and health behavior. Questions on alcohol usage and smoking were used to derive indicators (binary variables) for heavy alcohol usage and daily smoking which are the main health indicators of interest in our analysis. Auxiliary variables used in the analysis include self-reported marital status, level of education (high, middle, low), self-reported hypertension, recency of blood pressure The data on physical measurements Y are available only for the participants. These include measurements on body-mass index, systolic and diastolic blood pressure and total cholesterol.
The health indicators of interest include: • the prevalence of heavy alcohol use (self-reported questionnaire data), • the prevalence of daily smoking (self-reported questionnaire data), • the prevalence of obesity (measured body-mass index ≥ 30 kg/m 2 ), • the prevalence of high blood pressure (measured systolic blood pressure ≥ 140 mmHg), and • the prevalence of elevated total cholesterol (measured total cholesterol ≥ 5 mmol/l).

Causal model
Causal model with design [18] for the FINRISK 2007 study is presented in Figure 1. The graphical model describes the assumed causal relationships together with the study design and the hypothesized missing data mechanism. The invited cohort {i : m 1i = 1} is selected in random from the population strata defined by background variables Z i . The decision of participation M 1i depends on background variables Z i as well as variables X i and Y i to be measured in the questionnaire and in the health examination, respectively. Non-participants {i : M 1i = 0} make another decision M 2i whether or not to provide questionnaire data after the recontact. Physical measurements Y * are available for the participants {i : M 1i = 1}. Measurements X * on the questionnaire variables are available for the participants {i : M 1i = 1} (participation group 1) and for the non-participants of the health examination who after recontact returned the non-response questionnaire {i : M 1i = 0, M 2i = 1} (participation group 2). The questionnaire variables include self-reported health indicators as well as auxiliary variables. Measurements Z * on the background variables and the dates T * describe the effect of the survey variables on the decision responding in the recontact data collection. We study empirically whether these edges are necessary in the graph. If M 2i is independent on X i and Y i given background variables Z i , the edges can be removed. This would imply p(X i | M 2i = 1, M 1i = 0, Z i ) = p(X i | M 2i = 0, M 1i = 0, Z i ), which means that in the estimation of the averages of questionnaire variables, the participation group 2 can be used to represent all non-participants when conditioned on the background variables.
Assumption B concerns the impact of the physical health indicators on the decision participating in the health examination. In the graph, this impact is shown by the dash-dotted edge Y i → M 1i marked with 'B'. If it is assumed that this edge does not exist, i.e. the HES participation does not depend on the physical health indicators given the background variables and the questionnaire variables, the conditional distribution p(Y i | X i , Z i ) can be estimated from the data on the participants. This conditional distribution can then be used together with the distribution p(X i , Z i | M 1i = 0) estimated from the data on the recontact respondents to obtain an estimate for the marginal distribution of Y i for the non-participants. Differently from the assumption A, the validity of assumption B cannot be studied empirically with our current data.
Causal models with design for some alternative scenarios are presented in Appendix B.

Statistical methods for assumption checking
To check the validity of assumption A, we fit a statistical model that explains the risk of death and hospitalization by selection variables M 1 and M 2 . Background variables Z are included as covariates. If the regression coefficients for M 1 and M 2 differ from zero and the differences are statistically and practically significant, we conclude that the risks differ between the participation groups. It follows that the distribution of some risk factors must also differ between the participation groups. In the opposite case, we conclude that the risk factors do not differ between the participation groups when adjusted for the background variables. In particular, assumption A is validated if the regression coefficient for M 2 is close is zero. Theoretically, there exists a possibility that the differences in one risk factor are incidentally canceled by the differences in other risk factors such that the risks are the same in all the groups but this is considered unlikely in the practice.
All statistical analyses are carried out with R [21]. Deaths and hospitalizations are analysed separately. Logistic regression is used to model the death during the follow-up and the sampling weights are applied. Two mod-M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 8 els are fitted: a model for all individuals in the sample and a separate model for non-participants {i : M 1i = 0} only. The predictors common for both models are sex, age and region. Both models also include recontact response indicator M 2 as a predictor. In addition, the model for all individuals includes participation indicator M 1 as a predictor.
Zero-inflated negative binomial regression model is used to model the number of the hospital visits during the follow-up. The predictors are the same as in the models for the deaths. R package pscl [22] is utilized.

Statistical methods for estimation of health indicators
In the estimation of the population averages of health indicators, MI [23,24] is used to handle the missing values of questionnaire variables X and physical measurement variables Y . In addition to missing data due to nonparticipation, there are occasional item non-response in questionnaire variables. The imputation method is called MI-MNAR where MNAR tells that the data are assumed to be missing not at random. The possible bias of Rubin's MI variance estimator for data that are collected with a complex sample design has been discussed by many authors [25][26][27][28]. Although unbiased estimation cannot be guaranteed theoretically, MI methods seem to work well in the practice [25,29]. In our analysis, the imputation model includes the stratification variables age, sex and region as explanatory variables. Interactions between sex and age and between sex and region are included. The sampling weights are applied when estimates and their variances are calculated from the imputed datasets.
The questionnaire variables daily smoking, heavy alcohol usage, marital status, level of education, self-reported hypertension, recency of blood pressure measurement and self-reported high cholesterol are imputed variableby-variable using fully conditional specification [24]. These variables added with age, sex and region are used as explanatory variables in the imputation models for each other. Interactions between sex and age and between sex and region are included in all imputation models. Assumption A is implemented by using participation indicator M 1 as an interaction term for the explanatory variables. This means that only the data from participation group 2 are used to estimate the imputation model for the missing questionnaire variables in participation group 3.
The physical measurement variables are imputed as follows: • obesity is explained by sex, age, region, level of education, marital status, heavy alcohol usage and daily smoking, • high blood pressure is explained by sex, age, region, level of education, marital status, heavy alcohol usage, daily smoking, obesity, recency of blood pressure measurement and self-reported hypertension, • elevated total cholesterol is explained by sex, age, region, level of education, marital status, heavy alcohol usage, daily smoking, obesity and self-reported high cholesterol.
From assumption B it follows that indicators M 1 and M 2 are not needed here.
Interactions between sex and age and between sex and region are included in all imputation models. The imputation model is a logistic regression model for binary variables and predictive mean matching for all other variables. R package mice [30] is utilized and 50 imputations are generated.

Alternative statistical methods for estimation of health indicators
For methodological comparisons, two alternative imputation methods called MI-MAR and MI-MAR (no recontact) are also applied. MI-MAR is an imputation method similar to MI-MNAR with an exception that participation indicator M 1 is not used as an interaction term in the imputation models. This means that the data are assumed to be missing at random and the data from both participation groups 1 and 2 are used to estimate the imputation model for the missing questionnaire variables in participation group 3. The comparison between MI-MNAR and MI-MAR provides an indication of the extent of selection bias that cannot be removed only by conditioning on the background variables Z . MI-MAR (no recontact) uses otherwise the same imputation method as MI-MAR but simulates the situation where non-participants are not recontacted. The comparison between MI-MNAR and MI-MAR provides an indication of the importance of the recontact.
To compare MI-MNAR with the MI-MAR methods, we also impute the deaths and hospital visits during the follow-up. As these are available for the full cohort, the imputation based estimates can be benchmarked against the real data [29,31]. In the imputation, it is assumed that the deaths and hospital visits are available for participation groups 1 and 2 but missing for participation group 3 and the imputation model is similar to the model used for the questionnaire variables.

M A N U S C R I P T
A C C E P T E D ACCEPTED MANUSCRIPT 10

Results
The basic demographics of the three participation groups are shown in Table 1. The non-participants without a recontact response have a lower mean age and a higher proportion of men than the participants. However, the age and sex distributions of the non-participants with the recontact response seem to resemble the participants rather than the non-participants without recontact response.
Logistic regression models for the death during the follow-up are presented in Table 2. In both models, the point estimate of the regression coefficient for the recontact response is practically zero, which means that the variable is not a predictor of death. On the contrary, participation to the physical measurements is a very strong predictor of death. The model indicates that the difference between the participants and the non-participants is equivalent to the age difference of 12.5 years. Region is not included in the final model because it did not predict death during the follow-up.
The zero-inflated negative binomial regression model for the number of hospital visits (excluding pregnancy related visits with ICD-10 codes O00-O99) is presented in Table 3. The results show that the non-participants have more hospital visits than the participants but the differences between non-participants with and without a recontact response are very small. The results shown in Tables 2 and 3 provide the empirical justification for the assumption A in the causal model in Figure 1. Table 4 provides a summary of the causes of death and the causes of hospitalization in the three participation groups. It can be seen that the total mortality rate for the non-participants in 5 year follow-up is more than twice the total mortality rate for the participants. Even larger differences are seen for specific causes of death. The ratio of the mortality rates for the nonparticipants versus the participants is over three for alcohol related causes and about four for smoking related causes. The results support the interpretation that heavy alcohol usage and smoking are more common among the non-participants than among the participants.
The non-participants also have more hospital visits than participants. The difference is clear in hospital visits due to alcohol related causes but there are differences also in hospital visits due to infections and smoking related causes.

M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
14 Table 5 presents the health indicators estimated using the proposed MI-MNAR method for the missing data. The MNAR estimates of the population prevalences of heavy alcohol usage and smoking clearly differ from the estimates from the participants only. It can also be seen that simpler MI methods provide estimates that clearly differ from the MI-MNAR estimates. The estimated population prevalences of overweight (BMI ≥ 30 kg/m 2 ) , high blood pressure (systolic blood pressure ≥ 140 mmHg) and high cholesterol (total cholesterol ≥ 5.0 mmol/l) are practically same for all estimation methods and close to the estimates calculated from the participants only.
The last two columns of Table 5 provide the estimates for the number of deaths and the number of hospital visits. For these statistics, benchmark numbers from the full cohort are available. It can be seen that the point estimates from the full cohort locate outside the confidence intervals calculated by the MI-MAR methods. The same holds true also for the confidence intervals obtained from participants only. The MI-MNAR estimates are higher than the full cohort estimates but the MI-MNAR confidence intervals nevertheless contain the full cohort point estimates. The selection bias increases uncertainty, which leads to wider confidence intervals in the MI-MNAR approach.

M A N U S C R I P T
A C C E P T E D ACCEPTED MANUSCRIPT 16

Discussion
We have studied the methods for reducing selection bias with data from FINRISK 2007. In addition to questionnaire data and physical measurements data for the participants, the data included also follow-up data for both participants and non-participants and questionnaire data for the recontact respondents. These data components allowed us to develop a MI method tailored for the current study. The follow-up data, which will not be available for timely survey reporting, was used only for checking the assumptions on the missing data mechanism.
As expected, we found that non-participation is a strong predictor for mortality in the five-year follow-up. This result is consistent with other studies [15][16][17] and indicates that the data are MNAR. However, rather surprisingly, the recontact response did not predict mortality or morbidity among the non-participants when adjusted for age, sex and region. The result was utilized in the MI-MNAR method devised to reduce selection bias. The comparison of raw estimates and estimates adjusted for selection bias reveals clear differences in the estimated population prevalences of smoking and heavy alcohol usage.
The validity of the results depends on the validity of assumptions A and B concerning the non-response mechanism. The follow-up data supported assumption A but the FINRISK 2007 data could not be used to study the validity of assumption B. Thus, the prevalence estimates for heavy alcohol usage and daily smoking can be considered more plausible than the prevalence estimates for obesity, hypertension and elevated total cholesterol because the former estimates require only assumption A to hold whereas the latter estimates require both assumptions A and B to hold.
The strength of the proposed MI-MNAR approach lies in its ability to utilize the data on the recontact respondents according to assumption A. On the contrary, the MI-MAR method used a stronger assumption on the similarity of participants and non-participants conditioning on the background variables and failed to produce credible estimates for the prevalences of smoking and heavy alcohol usage. In addition, the MI-MAR approach underestimates the uncertainty in the estimates. We conclude that methods based on the MAR assumption for participation were insufficient for removing the selection bias and should not be used here. If assumption A was not valid or could not be verified, alternative models, such as repeated attempt selection model [32] or repeated attempt pattern mixture model [33], could have been applied.
MI was used for the missing data problem but it is not the only option available. An alternative approach would have been a combination of MI M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 17 and inverse probability weighting (IPW) [25]. In this approach, MI is first applied to data containing participation groups 1 and 2. Individuals in these participation groups are then re-weighed to also represent the persons in participation group 3. Bayesian methods could have been used as well.
From our current study, we can conclude the following three main messages. First, non-participation is a serious problem in HESs. In FINRISK 2007, this is seen by comparing the mortality and morbidity rates between the participants and non-participants as well as comparing the estimates for population level health indicators with and without correcting for selective non-participation. Second, the follow-up data for the whole sample is useful in checking the essential assumptions on the missing data mechanism. In FINRISK 2007, the follow-up data revealed that there must be large differences in the health status and the risk factors between the participants and the non-participants. The follow-up data also revealed that the decision responding in the recontact data collection does not seem to depend on the physical measurements or questionnaire variables. Third, efforts to collect data on non-participants may turn out to be worthwhile even if the response rate for the recontact remains low. In FINRISK 2007, only 13% of the nonparticipants responded in the recontact data collection but these data have a central role when the health indicators are estimated for the whole sample.
Further analyses are needed to check if our empirical finding on the nonparticipants with and without a recontact response holds also in other cohorts from Finland and elsewhere. For instance, the Leiden 85-plus Study (n = 692) reached a conclusion that the non-participants with and without a recontact response differ by their health status [34].
Our results showed marked changes in the health indicators for heavy alcohol use and daily smoking when adjusted using the MI-MNAR method. On the other hand, only minor changes were observed for obesity, hypertension and elevated total cholesterol. This reflects the importance of additional information on the non-participants. The non-response questionnaire had specific questions on alcohol use and daily smoking, while there were no good proxy questions for obesity, hypertension and elevated total cholesterol to be used for non-participants. For obesity, we could have obtained better estimates if we had had self-reported height and weight for those responding to the non-response questionnaire. Unfortunately, this information was not asked.
Partial questionnaire information from some of the health examination survey non-participants does not remove the non-participation bias completely but it provides valuable information which can be used to estimate the magnitude of the non-participation bias. There is some evidence indicating that the prevalences of health behaviors, such as smoking [9,35] and M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 18 alcohol use [9,10,36], and the prevalences of health conditions, such as mental disorders [9,37] and obesity [38,39], would be underestimated if based on survey participants only. Even in studies where follow-up data are not available to check the assumptions, the statistical analysis of the recontact respondents provides an indication of the extent of the selection bias.
Representative estimates for the levels of risk factors at the population are needed for the health policy decision making and the planning and evaluation of prevention programs. Therefore, any additional information which helps us to improve our population level estimates is needed.