Optimal selection of individuals for repeated covariate measurements in follow-up studies

Repeated covariate measurements bring important information on the time-varying risk factors in long epidemiological follow-up studies. However, due to budget limitations, it may be possible to carry out the repeated measurements only for a subset of the cohort. We study cost-efficient alternatives for the simple random sampling in the selection of the individuals to be remeasured. The proposed selection criteria are based on forms of the D-optimality. The selection methods are compared with the simulation studies and illustrated with the data from the East–West study carried out in Finland from 1959 to 1999. The results indicate that cost savings can be achieved if the selection is focused on the individuals with high expected risk of the event and, on the other hand, on those with extreme covariate values in the previous measurements.


Introduction
Many epidemiological follow-up studies include covariates, such as blood pressure, cholesterol and weight, that may vary over the time. If only the baseline measurement of these covariates is used, the analysis may suffer from the regression dilution problem, 1,2 i.e. the measurement made a long time ago does not anymore predict the disease risk. To avoid this problem, we need to carry out repeated measurements on the time-varying covariates. Conducting these measurements in a large population sample is expensive, and due to budget limitations, we may not be able to select the entire cohort for the new measurement, but only a subset of it. In this article, we study how this subset should be selected in order to estimate the effect of the covariate on survival time as accurately and precisely as possible. We propose the subset to be selected so that a Fisher information based optimality criterion will be optimized.
The planning of cost-efficient study designs has led to the development of multi-stage (or sequential) designs. In multi-stage studies, the next stage is constructed on the basis of the information obtained from earlier stages under given budget limitations. Multi-stage designs allow us to allocate the sample size optimally between the stages. 3,4 Even further, optimality criteria developed originally for design of experiments 5,6 can also be applied in observational multi-stage studies. Karvanen et al. explore optimal ways to select a small subset of individuals for expensive genotyping in follow-up studies. 7 Mehtälä et al. consider optimal designs for the measurement times of a binary continuous-time Markov process. 8 The research question outlined above is explored here in a simplified setup with one time-varying covariate and two measurement times using simulated and real data. Several authors have considered general approaches for the joint modeling of survival data and longitudinal covariate data. [9][10][11][12] The real data come from the Finnish cohorts of the Seven Countries Study 13 carried out in Finland from 1959 to 1999. The objective of The Seven Countries Study was to investigate variation in cardiovascular disease and related risk factors levels. In our example, diastolic blood pressure is the time-varying covariate of interest, and the survival time is considered to be the age at the time of death. With these data, we compare the results corresponding to different selection methods to the situation where every individual is selected for the second measurement.
As we are selecting only a subset of individuals for the new measurement, the majority of individuals does not have this measurement, and thus the handling of missing data plays an important role in the modeling. We study here two alternative ways to deal with the missing data: multiple imputation and a likelihood-based approach with numerical integration. 14 The underlying model is described in Section 2 and the selection criteria are presented in detail in Section 3. In Section 4, we discuss an algorithm, which is used in finding optimal or nearly optimal designs with respect to the chosen criterion. Statistical analysis of data collected during the whole follow-up, is described in Section 5, with an emphasis on the handling of missing data. Simulation studies comparing different selection methods are presented in Section 6, followed with results obtained using the real data in Section 7. Finally, Section 8 concludes the paper.

Survival model
We consider a follow-up study with a predetermined length, where all individuals have a baseline measurement of a single time-varying covariate. The outcome variable is survival time, which is censored at the end of the follow-up. The second measurement of the covariate will be taken halfway through the follow-up. Assume, that we cannot afford to remeasure the entire cohort, and therefore have to select a subset of individuals for the second measurement. Here, we study cost-efficient alternatives for the simple random sampling in this selection, which is conducted just before the time of the second measurement. The interest lies in the utilization of the baseline measurement information. In addition, the age of an individual may also affect the selection as we are operating with the age as the time axis in our Weibull proportional hazards model.
We start by specifying the survival model, which we need later in defining the selection criteria. Let Y 1 = (t 1 , δ 1 ) denote the survival information at the time of the second measurement, where continuously measured time t 1 is censored if the individual is still alive. The status indicator gives δ 1 = 1 for the event and δ 1 = 0 for censoring. For the second part of the follow-up we use similar notation Y 2 = (t 2 , δ 2 ).
Under the proportional hazards model with a time-varying covariate x(t) the hazard function has the form where parameter β describes the relation between the covariate and survival time. We continue by assuming the Weibull distribution for the survival times. The baseline hazard function is parameterized with shape parameter a and scale parameter b as Now, assume a piecewise constant covariate for an individual j: when the hazard can be written as for the first part of the follow-up and as for the second part of the follow-up. The whole data for N individuals are denoted by (X 0 , X 1 , Y 1 , Y 2 ). In general form, the survival function and the density function are S(t|x(t)) = S 0 (t) exp(βx(t)) and f (t|x(t)) = λ(t|x(t))S(t|x(t)), where S 0 (t) is the baseline survival function. Because we are operating with age as the time axis and individuals enter the follow-up at different ages, we have to deal with truncated distributions. Denote the survival time (age) at entering the follow-up by t 0 . The likelihood function of parameters θ = (β, a, b) for an individual j has the form where the first two factors of the product correspond to information from the first half of the follow-up and the last two factors correspond to information from the second half of the follow-up. The survival functions in denominators are needed to scale distributions, because we do not assume the follow-up to start from the time origin. An individual has contribution to the part of the likelihood concerning the second part of the follow-up only if he or she has not had an event before the second measurement.

Selection criteria
Our main question is, how to select a subset of individuals for the second measurement if only n individuals can be selected. We would like to select a subset which allows parameter β to be estimated as accurately and precisely as possible. Applying the principles of optimal experimental design, the selection criteria are based on the Fisher information matrix of parameters θ = (β, a, b) where at the time of the selection X 0 and Y 1 are observed and X 1 and Y 2 are unknown. By factorizing the logarithmic joint distribution of X 0 , X 1 , Y 1 and and the information matrix becomes Above, the terms log p(Y 1 |X 0 ) and log p(X 1 |X 0 , Y 1 ) do not include parameters θ and cancel out. The likelihood contribution to p(Y 2 |X 1 , Y 1 ) comes only from the individuals, who have not had an event before the time of the second measurement, i.e. individuals with δ 1 = 0. In practice, we replace the first term of (1) by observed information J Y 1 |X 0 (θ), and get a mixture of the observed and expected information matrices, which has the elements where the first term consists of the information from the first period of the follow-up from all individuals k = 1, . . . , N . The second term includes the information from the second measurement, thus the sum is only over the selected subset of individuals k = 1, . . . , n. As the selection of individuals is carried out just before the time of the second measurement, the variables X 1 and Y 2 are not observed for anyone and the expectation above can be calculated by Monte Carlo integration. 15 The two selection criteria we will apply, are well known criteria from the theory of optimal designs. Methods we will apply in the selection are based on forms of the D-criterion. The D-optimal design is obtained by maximizing the determinant of the information matrix det(Ψ X,Y (θ)). This is equivalent to minimizing det(Ψ X,Y (θ) −1 ).
The second method we are using is the D β -criterion. D β -optimal design is obtained by minimizing the D β -criterion, which we define as a diagonal element corresponding to β of the 'covariance matrix' Ψ X,Y (θ) −1 . In other words, we want to minimize Var(β). This criterion is a special case of D soptimality, where we are interested in a subset of s parameters. 5 In our case this subset consists of the parameter β.
The calculation of optimality criteria requires the second order partial derivatives of log p(y 1 |x 0 ) and log p(y 2 |x 1 , y 1 ). First, considering log p(y 1 |x 0 ), we calculate the second order partial derivatives of For log p(y 2 |x 1 , y 1 ) corresponding formulae are obtained by replacing f (t 1 |x 0 ), S(t 1 |x 0 ) and S(t 0 |x 0 ) by f (t 2 |x 1 ), S(t 2 |x 1 ) and S(t 1 |x 1 ), respectively.
From these formulae, it cannot be directly seen what kind of individuals would be included in the D β -optimal or D-optimal subsets. Intuitively, we could expect that individuals with extreme covariate values or with high risk of the event would be important for the estimation. In a case of first-order linear regression models extreme selection of covariate values is optimal, 16 but this does not hold e.g. for quadratic models or binary responses. Extreme selection means selecting individuals with highest and lowest covariate values. Although this method could be applied also in our case, there is no guarantee that it would be optimal in any sense, because we consider a survival model instead of a linear regression model. On the other hand, in survival models, events contain more information than censorings. Intuitively, this means that individuals, who are likely to have an event during the follow-up should be selected. Results with simulated and real data in Sections 6 and 7 show, how these intuitive principles will be balanced according to D β -optimal and D-optimal selections.

Finding optimal designs
As the optimal designs of nonlinear models depends on the parameters, we need initial estimates for parameters a, b and β, to use optimality criteria in practice. These can be obtained from the data already collected during the follow-up before the time of second measurement and/or from previous studies. For study cohorts of reasonable size, it is not computationally possible to explore all possible subsets of individuals to be selected for the new measurement but heuristic methods are needed. We use a greedy method, 17 where the individuals are selected sequentially one by one to find the optimal subset. This method is also known as the sequential search 18 and is a simplified special case of the Fedorov-Wynn algorithm 19 . The jth individual is selected so that the selection criterion is optimized on the condition that j − 1 individuals have already been selected.
Denote the set of individuals already selected by S. Using D β -optimality, the next individual j / ∈ S is selected so that Var(β) obtained from the ap- is minimized. To find the D-optimal design, the selection is carried out so is maximized. In a case in which there are more than one individual that optimizes the criterion at issue, the selection between them is done randomly. The selected individual is added to the set S and the procedure continues until the set S has reached the predetermined size.
In general, the selection problem is NP-hard and the greedy method produces only a suboptimal solution. However, the empirical results 7 indicate that the gain from more complicated heuristics may not be large in this kind of design problem.

Likelihood-based approach with numerical integration
After the follow-up study we have data with a large amount of missing data, as only a subset is selected for the second measurement. When that subset is selected using D β -optimality or D-optimality, the missingness of that measurement is clearly not missing completely at random (MCAR). However, as the selection depends only on the observed data (X 0 , Y 1 ), the missing data are missing at random (MAR). Handling missing data plays an important role in the analysis, and for that there are several methods, which can be used. In this section we present a likelihood-based approach and Section 5.2 considers a multiple imputation approach for carrying out the analysis. Next, we will use the following indexing of individuals. Individuals j = 1, . . . , n are measured both at the baseline and at the halfway of the followup and individuals j = n + 1, . . . , N have the baseline measurement only. We also divide individuals without second measurement into two groups so that individuals j = n + 1, . . . , n have not had an event before the time of the second measurement and thus could have been selected and individuals j = n + 1, . . . , N have already had an event and were not candidates for that measurement.
The analysis can be carried out with the likelihood-based approach for incomplete data. 20 This requires also the specification of the distributions of the covariate, namely p(x 0 ) and p(x 1 |x 0 , y 1 ). The parameters associated only with the covariate process are denoted by ψ. Utilizing the assumption p(y 2 |x 0 , x 1 , y 1 ) = p(y 2 |x 1 , y 1 ), the likelihood becomes Missing data are treated here as integrals with respect to the missing variable over the support of it. In simple settings with only one covariate, the integrals in the likelihood function can be calculated by numerical integration, which we apply in this paper. If the covariates are categorical variables, the integration would simplify to summation and direct numerical maximization would be feasible and straightforward. In the general setting, methods such as EM-algorithm 21 or Bayesian data augmentation 22 could be used.

Multiple imputation
Another method we apply in handling missing data is multiple imputation. 23 Now, we do not have to model the marginal distribution p(x 0j ) as in the previous approach, but only the distribution p(x 1j |x 0j , t 2j , δ 2j ) (the imputation model) has to be specified. It is known that when the missing data are in a covariate of the analysis model, the outcome variable should be used in imputation model. 24 In our case, this means that in addition to baseline covariate measurement, the survival information must also be used to predict the missing covariate value of the second measurement.
We use a linear imputation model of the form proposed by White and Royston 25 x where ε j ∼ N (0, σ 2 ) and the survival information is included in the predictors using the status indicator δ 2 and the cumulative baseline hazard function H 0 (t 2 ). For normally distributed x 1 this imputation model is approximately valid when covariate effects and cumulative incidence are small. 25 White and Royston also discuss the estimation of H 0 (t) in the case of the semiparametric Cox model. 25 We assume instead a Weibull proportional hazards model, when we have The multiple imputation, using the above imputation model, is carried out in a standard way. 23 6 Simulation study 6

.1 Description of the simulation study
Simulation studies were carried out in different settings to explore the performance of the proposed selection methods, namely D β -optimality and Doptimality. Simulated data were made to resemble our real data of Section 7 apart from a few exceptions. We considered a follow-up setting of 20 years with 1500 individuals. A time-varying piecewise constant covariate had the baseline measurement x 0 in the beginning and second measurement x 1 in the half-way of the follow-up. First, the baseline values x 0 of the covariate were generated from the normal distribution with mean µ 0 = 0 and variance σ 2 = 0.02 and the values of second measurement were drawn from the normal distribution conditioned on the baseline measurement with mean µ 1 = 0.5x 0 and variance σ 2 ε = 0.015, which leads to the correlation of 0.5 between x 0 and x 1 . Then, the ages of individuals were generated from the uniform distribution. We created datasets with a narrow age range from 35 to 44 years and a wide age range from 25 to 64 at the baseline to see whether this has an effect on the results. The age and the covariate of an individual were assumed to be independent.
Survival times were simulated from the Weibull distribution depending on the time-varying values of the covariate through the proportional hazards model with regression coefficient β = 3. We chose to use such a large value of β compared to real data, so that the possible effect of covariate stands out in the selection. The parameters of the Weibull distribution were set to shape a = 5.7 and scale b = 28000 (time scale in days), which roughly equal those estimated from the real data. The survival times of individuals who were alive at the end of the follow-up were censored. The follow-up was conducted at the same calendar time with all individuals, but at each measurement time the ages are not the same for the individuals.
With both narrow and wide age range, 1000 datasets were generated. The numbers of events were on average 101 and 187 during the first half of the follow-up and 222 and 288 during the second half of the follow-up, in datasets with narrow and wide age range, respectively.

Selection of individuals
The selection of individuals for the second measurement was carried out after ten years of follow-up by simple random sampling (SRS), D β -optimality selection and D-optimality selection. Two different sizes, n = 600 and n = 200, of subsets selected for the second measurement were considered. In the case where n = 600, we selected on average 43% with narrow age range and 46% with wide age range of the individuals, who were alive at that moment. The corresponding numbers in the case where n = 200 are 14% and 15%, respectively.
Before presenting the results obtained with different selection methods, we examine what kind of individuals are selected according to D β -criterion and D-criterion, in other words, who are most important to remeasure. Figure 1 shows the selection order for any n up to 600, when the selection is based on the D β -optimality. The order is, of course, irrelevant when we analyze the data. The selection seems to prefer extreme baseline covariate values, but also the age of an individual has an effect. With the wide age range, the age is the more important factor in the selection than with the narrow age range.
From Figure 2 we see that using the D-optimality leads to rather different kind of selection compared to D β -optimality. With the narrow age range, during the first few dozens of selection rounds, the D-optimal selection is focused on individuals with high covariate values and high age. However, the age has a greater effect on selection than with D β -optimality, especially when we consider the wide age range.

Analyses with different designs
The aim of the selections illustrated in Figures 1 and 2 is to find the subset which would lead to as reliable as possible estimation of parameter β in our Weibull proportional hazards model. All analyses were carried out with the R statistical software. 26 With multiple imputation the weibreg function from the eha package 27 was used to fit the Weibull model. When applying the likelihood-based approach with numerical integration, the likelihood was optimized using the optim function with the BFGS algorithm and the standard errors were evaluated using the hessian function from the numDeriv package 28 . Integrals were calculated numerically with the integrate function. In Weibull proportional hazards models, especially with small samples, the profile likelihood could be considered instead of the maximum likelihood, which we are using, to improve the estimation. 29 Estimation results using numerical integration in handling missing second measurement data are presented in Table 1. Bias seems to be negligible in all the designs. From the standard errors, obtained from the inverse of the numerically differentiated Hessian matrix, and standard deviations of estimates, it can be seen that with the optimal selections the estimation is more precise than with the simple random sampling (SRS). The differences in the standard errors and standard deviations between the full cohort and the designs with n = 600, are smaller than one would expect simply considering the difference in the number of observations. Although the D β -optimality is defined to provide the minimal variance forβ there are no clear differences between the standard errors of the D β -optimal and D-optimal designs. Table 2 shows the results when multiple imputation is used in dealing with missing data. We see that there is virtually no bias. Again we see that although only a subset is selected for the second measurement, results are rather good compared to the case where everyone is measured twice.
In Tables 1 and 2 the standard errors and deviations seem to be quite consistent when n = 600, especially with the wide age range. Nevertheless, when we compare the results of multiple imputation and numerical integration with Table 1: Simulation results for different designs using numerical integration. Bias is the mean bias of estimatesβ, SD is the standard deviation of estimatesβ and Mdn(SE) is the median of the estimated standard errors of β from 1000 simulation runs for both narrow and wide age ranges.
Design narrow age range in the case where n = 200, we see that multiple imputation produces clearly greater standard errors and deviations. Furthermore, in this setting the D-optimality seems to perform better than D β -optimality. This is, however, a problem related to multiple imputation, since theoretically the D β -optimal design cannot give larger standard error forβ than the D-optimal design. The assumptions of the imputation model (2) do not hold, because the covariate effect or cumulative incidence cannot be considered small. As a summary of the simulation study we can say that optimal selections improve the estimation and that the two design criteria seem to lead virtually to same improvement.

Results for the East-West study
Data from the Finnish cohorts of the Seven Countries Study are used. The Seven Countries Study was initiated in the late 1950s to study variation in cardiovascular disease and related risk factors levels. 13 The Finnish cohorts (N = 1711) included all men born between 1900 and 1919 in two geographically defined areas, one in Eastern and the other in South-Western Finland, from which comes the name East-West study. The baseline survey was conducted in 1959 and re-examinations in 1964, 1969, 1974, 1984, 1989 , 1994 and 1999. 30,31 The cohorts were followed-up for mortality until the end of 2010. In these analyses data on re-examination from 1964 and 1974, and information on the age of death are used. We use only a part of the data so that we consider the measurement of the year 1964 as the baseline measurement, 1974 as the year of the second measurement and 1984 as the end of the follow-up, when censoring is carried out. As a time-varying covariate, we use diastolic blood pressure, which is logarithmic and centered in the analyses. Survival time is considered to be the age at the time of death. After removing individuals who were already dead before the measurement of the year 1964 and who do not have observations of diastolic blood pressure, we have 1501 eligible individuals for our study. In this setting, we have 354 events in the first half and 424 events in the second half of the follow-up.
The selection order with D β -optimality and D-optimality can be seen in Figure 3. D β -optimality seems again to prefer extreme covariate values and at a fixed value of covariate, older individuals are selected first. In Doptimality, the age is clearly more important factor in the selection. All in all, the selection with the East-West data looks very similar compared to one with the simulated data, which could have been expected, because the simulated data were made to resemble these real data.
It turned out that the numerical integration did not work well with the real data. Different initial values of the optimization function led to different estimates, which may have arisen from invalid distributional assumptions. Thus, only the estimation results using multiple imputation for missing data are presented. Table 3 shows that there are some differences in the estimates, especially with D-optimal design when n = 200. Relative to the standard errors, the D β -optimal seems to be the best. By fitting the model separately for younger and older age groups, we see that the same model does not hold for younger and older individuals. This explains the problems with the D-optimal design, since it consists mainly of individuals with high age. It is worth noticing, that the gain from the optimal designs is free of charge once the selection procedure has been implemented. Comparing the designs with n = 600, we approximate that reducing the standard error of the SRS-design to the level of the D β -optimal design would require about 189 more individuals to be remeasured. That is, the efficiency of the SRS-design is 75% in this comparison. This kind of amount of measurements would create notable additional costs in epidemiological studies, like the East-West study. In the case where n = 200, approximately 23 more individuals are required in the SRS-design to reach the standard error of the D β -design, which means that the SRS has here the efficiency of 90%.

Conclusion
Limited resources lead us to investigate the cost-effectiveness of study designs. In this paper, we considered the selection of individuals for the often costly re-examination of a time-varying covariate in a follow-up study. Two different Fisher information based optimality criteria, D β -optimality and Doptimality, were applied and compared to the SRS using simulated data and real epidemiological follow-up data. The selections carried out according to the optimality criteria indicate that individuals with extreme baseline covariate values and high age would be most important to remeasure. The criteria balance differently between these characteristics: D β -optimality stresses the extremity, whereas D-optimality prefers old individuals. With both criteria, age is the more important factor in the selection when the age range of individuals is wide than when the age range is narrow. Results from the analyses with different designs show that, when handling missing data with multiple imputation or numerical integration, the precision is usually better for the optimal selections than for the SRS. No clear differences between the two optimal selections were observed. Numerical integration looks better than multiple imputation according to the simulation results, but from the real data we have learned that it may be sensitive to the model assumptions.
When the proportion of the missing data is large, the estimation may be sensitive to the model misspecification. 32 However, it should be emphasized that the model used for the optimal selection may be different from the model used in the analysis. After the second measurement it is possible to check the validity of the distributional assumptions and change the model if needed.
The future work will consider more complicated designs where several repeated measurements will be carried out on several covariates. The current work forms a basis for these extensions.