Bayesian models for data missing not at random in health examination surveys

In epidemiological surveys, data missing not at random (MNAR) due to survey nonresponse may potentially lead to a bias in the risk factor estimates. We propose an approach based on Bayesian data augmentation and survival modelling to reduce the nonresponse bias. The approach requires additional information based on follow-up data. We present a case study of smoking prevalence using FINRISK data collected between 1972 and 2007 with a follow-up to the end of 2012 and compare it to other commonly applied missing at random (MAR) imputation approaches. A simulation experiment is carried out to study the validity of the approaches. Our approach appears to reduce the nonresponse bias substantially, where as MAR imputation was not successful in bias reduction.


Introduction
Population level estimates of risk factors are of major interest in epidemiology. Data on risk factors such as blood pressure, cholesterol level, body mass index, alcohol consumption and daily smoking are often collected in health examination surveys (HES). In a HES, the data on risk factors are gathered usually via both questionnaires and physical measurements. The trends of population level risk factors are monitored and they are valuable input for policy decisions.
Missing data by unit nonresponse occurs in a HES as invitees neither participate to physical measurements nor return a survey questionnaire. The decision about participation have been found to depend on the risk factors, such as smoking (Shahar et al., 1996), either directly or via a common cause such as health awareness. This may be deduced from the fact that the non-participants have a higher risk of death (Jousilahti et al., 2005;Harald et al., 2007;Karvanen et al., 2016). This dependence causes missing data to be classified as missing not at random (MNAR) (Rubin, 1976). Because the data are MNAR, the population level risk factors calculated from the participants' data are biased, and they usually give an overly healthy view of the population. Biased estimates of risk factor prevalence may seriously misinform decision makers. Instead of analysing only the participants data, the posterior distributions of risk factor levels of a whole sample, including non-participants, should be estimated. This requires external information and modelling assumptions.
In this paper, we demonstrate how follow-up data on endpoints associated with the risk factor of the interest provides external information that allows us to reduce the bias caused by selective non-participation. We propose a Bayesian method for the estimation of risk factor prevalence and the missing data mechanism, when the data are MNAR.
Our datasets origin from the FINRISK studies, which are national HES providing information about the health of Finns. We improve and extend an earlier work (Kopra et al., 2015) on the estimation of smoking prevalence from FINRISK data. The key improvements are: a fully Bayesian model is used, the survival model is more flexible, and informative prior is utilised instead of assumption of conditional independence (Kopra et al., 2015, Eq. (2)). Differently from Kopra et al. (2015) the study years 2002 and 2007 are included in the modelling.
Next section describes the data of the FINRISK studies and follow-up. Section 3 presents the Bayesian model and the priors that we apply to smoking prevalence estimation. Section 4 explains model fitting, and Section 5 provides a simulation study on the proposed approach. We evaluate alterna-tive methods in Section 6. In Section 7 we apply our approach to real data from the FINRISK studies and provide smoking prevalence estimates for both men and women. Section 8 discusses the results and methods presented.

Data description
Our HES data contain eight FINRISK studies conducted in selected geographical areas of Finland once in every five years in 1972(Laatikainen et al., 2003Harald et al., 2007). In each study year, persons were selected to the FINRISK studies in a random sampling stratified by region, gender and 10-year age group. Our data are restricted to the two regions (Northern Savonia and North Karelia) that have been included in all eight studies. In total, the data contain 52, 325 persons including 9, 928 persons with missing smoking indicator.
Each person selected to the study received a letter of invitation, in which he or she was asked to fill in a survey questionnaire and participate to physical measurements in the local survey site. If the person participated, the filled questionnaire was collected and the physical measures were taken. If the person did not participate, then risk factors are missing, but background variables, study year, age, gender, and region, are known from the sampling frame. Table 1 shows that the participation rates have dramatically decreased from 1972 to 2007. It can be also seen that women have participated more actively than men in all study years. We also know that person's age affects participation (Kopra et al., 2015).
Our HES data were linked together with follow-up data of all participants and non-participants. The follow-up data contains the exact dates and diagnoses (ICD codes) of hospitalisations and deaths. In Finland, this kind of follow-up data can be collected from administrative registers for both participants and non-participants. The follow-up period started at the time of study for each person and ended on 31st December 2012 for all FINRISK study years. Thus, the length of the follow-up period varies by study years.
It is well-known that smoking is a key risk factor for lung cancer and chronic obstructive pulmonary disease (COPD) (Doll & Hill, 1956;Mannino & Buist, 2007). Thus, we use lung cancer and COPD events together as an endpoint. Table 2 shows that non-participants have a higher rate of disease events than participants.
We limit in our analysis the age range to 25-64 years-old and select the subset of healthy persons with respect to our endpoint variables. The two exceptions are 1972 and 1977 studies, which have age ranges of 25-59 and 30-64 years-old, respectively.

Bayesian model
The modelling is based on the idea that although it is impossible to directly observe the smoking status of non-participants, we can obtain information on smoking indirectly from the follow-up data. More precisely, the modelling uses the observed incidence differences of the smoking-based diseases between participants and non-participants, which allows us to adjust the estimates of smoking prevalence. Full Bayesian approach is applied, and model fitting is executed using Markov chain Monte carlo (MCMC) methods (?).

Notation for the data
We introduce our model using causal models with design (Karvanen, 2015), and make a difference between measurements and underlying causal variables. The model is presented in Figure 1. For each person i = 1, . . . , N invited to the survey, we denote participation indicator by M i , which takes the value M i = 1 if person i participated, and value M i = 0 otherwise. Value M i = 0 indicates missing risk factor data. The indicator of self-reported daily smoking is denoted by Y i and the corresponding measurement by Y * i . Variable Y i takes value 1, if a person is a daily smoker, and 0 otherwise. Class Y i = 0 includes earlier smokers who quitted. The value of Y * i is known for the participants, then Y * i = Y i , but missing for the non-participants. We denote by vector X * i the variables age a i , gender g i , region r i and study year s i in the background data observed for all sample members. The values g i = 0 stand for men, and g i = 1 for women. The North Karelia region is denoted by r i = 0 and Northern Savonia by r i = 1.
We denote T i as the age at the day of diagnosis, which may also be the age at the time of death if a person dies without previous lung cancer or COPD diagnoses and the death is caused by either of the two diseases. If the person has not been diagnosed, the corresponding measurement T * i is missing, and T i is right-censored. Variable T cens,i is the age of the person i in the end of the year 2012, which is the end of our follow-up period, or the age of death for the person who have died before the end of the year 2012. The variable T obs,i is the minimum of T i and T cens,i , so T obs,i = min(T i , T cens,i ), and T * cens,i = T cens,i and T * obs,i = T obs,i .

Submodels
The joint model for data from a HES linked with follow-up consists of three submodels: 1. a participation model in which participation is explained by daily smoking and background variables (arrow X i → M i in Figure 1), 2. a risk factor model for daily smoking given the background variables 3. a survival model for the follow-up data given the daily smoking and background variables (arrows Y i → T i and X i → T i ).
These three submodels together form a joint model for the data, which we call Bayesian MNAR model, see Figure 1. The arrows X i → M i and Y i → M i correspond to the participation submodel, that can be written as P (M i = 1|X i , M i ). The arrow X i → Y i corresponds to the risk factor submodel (distribution P (Y i |X i )), and the arrows Y i → T i and X i → T i correspond to the survival model (distribution P (T i |Y i , X i )). All the submodels are fitted together because each of them contains the indicator of smoking, which has missing values to be imputed.

Participation model
g i , s i , a i and r i are part of X i and they stand for gender, study year, age, and region, respectively. Variable Y i stands for smoking. The roles of model and α 2 are explained below. Parameter α 0[g i ,s i ] is a regression coefficient (intercept) which varies over the levels of Figure 1: A graph representing the model and the dependencies between the variables of HES data and the follow-up data. Direct causal effects are represented as arrows. Measurement variables are denoted with asterisk, e.g. X * i , and are presented as filled circles. The causal variables do not have asterisk symbol (e.g. X i ), and they are drawn unfilled to indicate that they are not observed directly but via measurement variables. The measurement variables always have one participation indicator (m i or M i ) and one causal variable as their parent. The graph tells that X * i , T * obs,i and T * obs,i are collected for each member of the sample while Y * i is measured only for participants, and is missing for the non-participants. g i and s i , i = 1, . . . , N . The variable g i is binary and s i has eight possible values, which create the total of 16 intercept parameters. The parameters η [g i ,s i ] are gender-specific regression coefficients modelling how daily smoking affects participation in each year. We also take into account how the age of person affects participation; the gender-specific coefficients α 1[g i ,0] and α 1[g i ,1] model how age affects participation for non-smokers and smokers, respectively. The parameter α 2 describes the differences in participation between the two regions.

Risk factor model
Next, we need to model smoking indicator Y i by background variables X i = (g i , s i , a i , r i ). We use a logistic regression model where coefficients β 0 and β 1 vary between groups defined by combinations of gender g i , region r i and study year s i similarly as in (1). The year of birth s i − a i for person i is centered at its rounded population average 1938 in the model.

Survival model
To define a survival model for P (T i |X i , Y i ), a counting process notation is used. Let N i (t) stand for the count of disease diagnoses up to age t for person i. Let dN i (t) be the increment of the counting process over one-year time interval [t, t+1), and let t take discrete values 25, 26, . . . , 100. Now, we model dN i (t) with a piecewise constant hazard model assuming that for each oneyear time-period, the gender-specific hazard h 0,g (t) remains constant (g = 0, 1 stands for the gender). The model for follow-up data is where γ 1 and γ 2 model how smoking increases the hazard for men and women, respectively.

Prior distributions
For the participation model, we use informative prior distribution for the difference between smokers and non-smokers. An informative prior for η [g i ,s i ] is derived as follows. We consider a 45-year-old non-smoker, who participates with probability p = 0.7, and elicit the corresponding prior probability for a smoker, who is otherwise similar. We consider that there is a 15% chance that the participation prior probability p is less than 0.50, about 30% chance for less than 0.60, and 50% chance for less than 0.70. These considerations together with an assumption on a logistic distribution for η [g i ,s i ] lead to prior distribution The prior distributions for participation model coefficients α 0[g i ,s i ] and α 1[g i ,Y i ] , and risk factor model parameters β 0[g i ,r i ,s i ] and β 1[g i ,r i ,s i ] are normal distributions with mean µ = 0 and variance σ 2 = 1000 (uninformative priors).
Survival model parameters γ 1 and γ 2 are also a priori normally distributed with µ = 0 and σ 2 = 1000. Our prior distribution for baseline hazard h 0,g (t) is monotonically increasing with age where t = 26, 27, . . . , 100, where g stands for gender, 0 for men and 1 for women. This means that model assumes that risk of smoking-based diseases only increase with age. This assumption seems to be in agreement with our data.

Model fitting
As the number of model parameters (316) and missing values (9,928) is large, there are over 10,000 variables to sample at each iteration of the MCMC model fitting process. This creates a computational challenge for Bayesian model fitting. The Markov chains typically require thousands of iterations or more to obtain satisfactory convergence, which requires a lot of computing time.
To impute the missing values for smoking indicators, the data augmentation was applied (Tanner & Wong, 1987). A Bayesian MNAR model described in Figure 1 and Sections 3.3, 3.4 and 3.5 was used. The augmented data for smoking indicator Y i are drawn from fully conditional distribution P (Y i |M i = 0, X i , T i ), given its parent nodes (X i ) and child nodes (M i and T i ).
We used Just Another Gibbs Sampler -software (JAGS) (?), R (?) and rjags package (Plummer, 2015) to fit the model. Seven parallel MCMC chains were used. Each chain had 9000 burn-in iterations, 45900 actual iterations with thinning interval 75, which makes a total of 612 iterations per chain to be recorded. The time consumed for this model fitting using parallel chains was about 107 hours, which makes 7 seconds per each iteration and less than 0.7 milliseconds per each parameter for one iteration. The high absolute number of missing values (9, 928) explains the long running time. Because of high autocorrelations in the chains, we decided to use a long thinning interval and many iterations to reduce the autocorrelations in the saved iterations of the MCMC and larger sample of the posterior for the final trends. The convergence of the chains was examined both visually and using Brooks-GelmanR-diagnostics (Brooks & Gelman, 1998). All theR test statistics for model coefficients were below 1.01 which indicates convergence.

Simulation study
We carried out a simulation experiment to demonstrate the performance of the model. The simulated data allow us to compare the performance of the estimated prevalence of smoking with true prevalence, which is known with the simulated data but not with the HES data. The actual values of the background variables from the FINRISK study were used together with parameter estimates from a preliminary analysis. Thus, the simulation experiment had conditions similar to the real data, e.g. the smoking prevalence had a decreasing trend for men and an increasing trend for women. For both genders the participation was selective, and the participation rate had a decreasing trend.
The simulation was implemented using R language, and the data were simulated from the Bayesian MNAR model presented in Section 3. Because of the computational burden of model fitting, the model was fitted into a single simulated data set.
The model fitting for the Bayesian MNAR model with simulated data was implemented as described in Section 4. All theR-diagnostics were below 1.01 which indicates convergence. We inspected the posterior correla-tions between the model parameters and found strong correlations (≥ 0.9 or ≤ −0.9) between some of the parameters. In the risk factor model, the strongest correlations were observed between the parameters β 0 and β 1 . The median of these correlations was −0.434 and the range was [−0.905, 0.661]. Conditioning with M i likely causes these posterior correlations. On the participation model, the strongest posterior correlations were found between the α 0[g i ,s i ] and η [g i ,s i ] . For men, those eight correlations ranged between −0.972 and −0.899, and for women between −0.860 and −0.665. In the survival model, strong positive correlations occurred particularly between the hazards of consecutive years, which is natural to this type of models. The highest correlations for men were 0.920 and for women 0.952.
It can be seen from Figure 2 that with an exception of women in 1972, the true prevalence is located inside the credible interval and there is no indication of systematic bias in the posterior mean. In contrast, the prevalence calculated using only the participants systematically underestimates the true prevalence. As the same family of models was used both to simulate the data and to fit the parameters, the Bayesian MNAR approach is expected to perform very well. However, the experiment demonstrates that the model can be estimated from the data.

Alternative methods
In addition to the Bayesian MNAR approach, we considered two alternative modelling approaches and the complete case analysis. Both modelling approaches utilize the MAR assumption and the complete case analysis uses data on participants only. In terms of Figure 1, the MAR assumption omits the arrow Y i → M i , which means that participation is not selective with respect to smoking.
First, the Bayesian MAR approach differs from the Bayesian MNAR approach such that the entire survival model (4) is omitted, and the regression coefficient η [g i ,s i ] is fixed to zero in participation model (1).
We also used the frequentist MAR model which was implemented using the mice package in R (van Buuren & Groothuis-Oudshoorn, 2011). The missing smoking indicator was imputed using a logistic regression model that had full interactions between year of birth, gender, region and year, and full interactions between gender, event indicator, and age at the event/censoring. The year of birth and the age at event/censoring were used as linear covariates and the other variables were categorical.
We fitted these alternative approaches to data simulated from the MNAR model which is selective with respect to smoking. The trend estimates are presented in Appendix A in Table 4. We calculated root mean square errors (RMSE) for the Bayesian MNAR model and each of the alternative approaches. The RMSE was calculated using formula where y is the estimated smoking prevalence (%) and y true is the true smoking prevalence from simulation. The RMSE was calculated over the regions, the genders and the study years, which gives one RMSE value for each approach. The Bayesian MNAR approach had the smallest RMSE, 1.65. The Bayesian MAR and the frequentist MAR methods have very similar RMSE with each other, 3.34 and 3.37, respectively. These two methods were slightly more accurate than the complete case approach with RMSE 3.48.

Application to FINRISK data
The trends of daily smoking for the FINRISK data estimated using the Bayesian MNAR model are reported in Figure 3 and compared to participant trends, which are often reported in HES. The difference between the smoking prevalence estimates of the complete case (participants) and the Bayesian MNAR approaches is the highest for the study years 1977years , 1982years , and 1987 and Table 3 in Appendix A). Starting from 1992, the complete case trends are within the 95% credible interval of the Bayesian MNAR model, but they are systematically below the posterior mean. The proportion of missing data is higher for the later study years than the earlier ones, which makes the credible intervals wider.
The values of the region variable r i for the study years 1972 and 1977 were missing for non-participants. We used a single imputation with fixed probabilities P (r i = 1|s i = 1972) = 0.495 and P (r i = 1|s i = 1977) = 0.493 as in Kopra et al. (2015). We decided to use single imputation a prior to model fitting because the regions seemed to be rather similar with respect to smoking prevalence, and the participation rates were high.
We executed sensitivity analysis to find out if model can be fitted with even less informative prior. We tried prior distributions for η with s parameter set to (2.05/2) −1 , which corresponds to doubling the prior variance. We found out that the Markov chains do not converge. More precisely, convergence problems were found with η-parameters for which theRs ranged  Table 4 in Appendix A. between 1.097 and 1.828 after 45900 iterations and 9000 burn-in. Thus, it appears that vague prior distributions are not applicable.

Discussion
We have proposed the Bayesian MNAR modelling approach to reduce the non-participation bias and applied the approach to the FINRISK studies. In a simulation experiment, we compared our approach to the Bayesian MAR approach, to the complete case analysis, and the frequentist MAR imputation. The latter two were easier to use than the Bayesian MNAR approach but did not substantially reduce the non-participation bias. The proposed approach appears to reduce the non-participation bias. Trends by the Bayesian and the frequentist MAR approaches are essentially the same as participants' trends. Thus, the MAR approaches do not reduce the bias of risk factor levels by much. Although there may be ways to improve the imputation model (?), the MAR imputation do not account for selective non-participation.
The information about non-participants' risk factor levels comes from the survival data. The risk factor of interest must be a strong predictor of the survival outcome. The estimation of survival model parameters requires that a sufficient number of events have been recorded. This implies that the length of follow-up must be long enough and if the event is rarely observed, the number of persons must be high. The information obtained from the survival model may be insufficient by itself and need to be supported by an informative prior on the selection mechanism.
In many countries, it is not technically or legally possible to link the HES data of non-participants to follow-up data. The requirement on the availability of survival data for both participants and non-participants is a major limitation for the proposed approach.
The Bayesian model fitting is often a computational challenge, particularly when the amount of missing data is large. The memory management and computation time were issues in our case study, due to a large amount of missing data. In our first attempts, we tried to save all the imputations, which filled the RAM memory of the computer (16 GB) quite rapidly. We later realised that it is possible to calculate sufficient (summary) statistics, e.g. count of smokers and non-smokers by gender, year and region, and store only them. We also reduced the time required per one iteration by coarsening the continuous covariates (the age) of the survival model and summing over discrete or discretised covariates, and by modelling the number of events in each risk group using Poisson distribution.
Another challenge was the posterior correlations caused by the model structure and non-random missingness. One possibility to alleviate this problem could be to develop a custom MCMC algorithm with blocked updating of the parameters with high posterior correlations (Haario et al., 2001). However, this often requires custom programming and is therefore much more laborious than an application based on JAGS, which we have applied. Thus, if one needs to use data with more missing data than in our case study or multiple variables with MNAR missingness, we recommend using some specialized MCMC algorithms, or possibly the iterated importance sampling algorithm (Celeux et al., 2006).
Additional fully observed covariates can be added into the model without excessive increase in computational burden. If more variables with missing data are imputed, the model fitting is slowed down proportional to increase of absolute amount of missing values. This is because at each MCMC iteration all the missing values need to be imputed.
Selective non-participation in HES is an important problem that may have implications to the decisions on health policy. Our solution is not simple to implement but the reduction of selection bias makes it worth of the effort. Table 4: Participant trends and model-based posterior trends with 95% credible intervals for the FINRISK data.

Northern Karelia
North Savonia year method men women men women