Register data in sample allocations for small-area estimation

ABSTRACT The inadequate control of sample sizes in surveys using stratified sampling and area estimation may occur when the overall sample size is small or auxiliary information is insufficiently used. Very small sample sizes are possible for some areas. The proposed allocation based on multi-objective optimization uses a small-area model and estimation method and semi-collected empirical data annually collected empirical data. The assessment of its performance at the area and at the population levels is based on design-based sample simulations. Five previously developed allocations serve as references. The model-based estimator is more accurate than the design-based Horvitz–Thompson estimator and the model-assisted regression estimator. Two trade-off issues are between accuracy and bias and between the area- and the population-level qualities of estimates. If the survey uses model-based estimation, the sampling design should incorporate the underlying model and the estimation method.


Introduction
Sample surveys provide estimates of the various parameters not only for the population of interest but also for subpopulations, referred to as "areas" here. Stratified sampling is a common design, where strata and areas coincide. How are area sample sizes controlled to provide satisfactory area and population estimates? The small overall sample size or an insufficient use of auxiliary information may lead to the fact that the areas are not defined at the planning stage of the survey. The consequence is that the area sample sizes cannot be controlled. Nonresponse as one cause of randomness is beyond the scope of the study. The lack of control can lead to small or even to null sample sizes for some areas. They are regarded as small because the area-specific samples are small enough to hinder direct estimates of adequate precision (Rao and Molina, 2015). Various model-assisted or model-based small-area estimation techniques, which are hard to implement, have been designed to solve this problem (Pfeffermann, 2013). The World Bank uses the software PovMap for producing business statistics. Burgard et al. (2014) have used various estimators and studied the performances of small-area point and accuracy estimates under different sampling designs.
We estimate the area and the population totals of the variable of interest under different sampling designs. The variable measures some quantity in business. Because the overall sample size is small and the population contains small areas, model-based estimation yields moderately accurate area estimates. The "borrowing strength" principle implies that sample information provides a higher estimation power for small areas. Two auxiliary variables correlated with the variable of interest serve as predictors. The selected model contains area-specific effects because the variable of interest is likely to vary from one area to another. We shall compare the main estimation method, which is model-based, to the design-based Horvitz-Thompson estimator and to the model-assisted regression estimator, on the basis of model-free allocations. The model-based estimators have lower variances, but may be biased. The design-based estimators are design-unbiased, but their variances are large for small areas with small sample size. The second motivation for using these three estimators is to clarify the trade-off between accuracy and bias.
Our allocation method, called "three-term Pareto method," also uses the model and the estimation method as auxiliary information at the planning stage. It is based on multi-objective optimization, on the model-based empirical best linear unbiased predictor estimator for obtaining the area and population total estimates of the variable of interest, and on the mean squared error (MSE) estimator. We shall compare this method with five reference methods displaying various optimization criteria and using auxiliary information. The method called "Molefe and Clark" also uses an area model. We introduce model-related allocations in Section 2 and four model-free allocations in Section 3: "equal," "Costa," "nonlinear programming," and modified "box constraint." A fixed small overall sample size is a common restriction. We explain the numerical basis of some allocations in section 4.2.
We simulate the allocation-specific random samples from a population containing real register data, by using stratified simple random sampling without replacement. Because the variable of interest is unknown and the between-area variation of each auxiliary variable in the population is too small to support allocation, the allocation-specific sampling design, except for equal allocation, is based on previous register data, called "proxy data." The relative root mean square error and the absolute relative bias measure the accuracy and the bias of an estimator in design-based simulations. They are sample-based approximations of the design mean squared error and of the design bias. The primary measure is the relative root mean square error, but we also compute the absolute relative bias for design-based estimates.
The area-specific relative biases reflect the validity of the model in each area. There is a first trade-off between the quality of area estimates and the quality of population estimate, and a second one between accuracy and bias.
The results support the sampling strategy, in which not only auxiliary information but also the model and the estimation method should be fixed early, in the design phase of the survey. The proposed allocation uses all information available before choosing the allocation method, avoiding fixed priorities for the importance of estimation at the area and the population levels.

Model-related allocations
In the model-based estimation, the area parameter and often the population parameter estimates result from the statistical model and from the chosen estimator. The proposed allocation (section 2.2) is based on the model and the estimator introduced in section 2.1 and on auxiliary information. Keto and Pahkinen (2010) used this model and this estimator to describe the relationships between area and sample sizes, estimation results, and area characteristics. One reference allocation (section 2.3) is based on a different area model and on a composite estimator, and uses auxiliary information. These two allocations are "model-related allocations." We introduce them in sections 2.2 and 2.3. Table 1 shows the summary details ofthese allocations.

Model and model-based area total estimator
The area total estimator of the variable of interest is based on the unit-level linear mixed model (Battese et al., 1988): where x dk is the vector of auxiliary information for unit k in area d, D is the total number of areas, N d is the size (total number of units) of area d, β is the vector of fixed regression parameters, the area-specific effects v d are : sample sizes minimize the sum of squares S ¼ ðACV ðŶ Eblup Þ À Min pop Þ 2 þ ðMACV À Min are Þ 2 , based on the approximate coefficients of variation according to Eq. (13), at the area and population level. The register of proxy data is used.

Molefe and
Clark where q is an adjustable constant (0 ≤ q ≤ 2), ρ is the common intra-area correlation, and σ d is the area-specific standard deviation obtained from the proxy variable y*. Area distributed as Nð0;σ 2 v Þ, independently of the random errors e dk , which are distributed as Nð0; σ 2 e Þ. The first value of the vector x dk is one, and the vector β contains the intercept term β 0 . Eq (1) is applicable when unit-level values are available for the variables x.
The expected value for the unit k in area d is Eðy dk Þ ¼ x 0 dk β, and the total variance is decomposed into the variance σ 2 v between areas and the variance σ 2 e within areas. The common intra-area correlation (Meza and Lahiri, 2005) measures the relative variation of the variable of interest between the areas. Before the area parameters, we estimate the model parameters and the area effects from the sample data. We denoteσ 2 v andσ 2 e the estimated variance components, andv d the area effects of the empirical best linear unbiased predictor. The estimateβ of β is obtained using the generalized least-squares method.
The empirical best linear unbiased predictor (Eblup) for the area total Y d is the sum of n d sampled y-values and the sum of predicted y-values for (N d À n d ) non-sampled units: where s d and s d denote the sampled and the non-sampled units, and the vectors x dk and β are defined as in Eq (1). The design mean squared error for the estimator in Eq (4) is the sum of the variance and the squared bias. The Prasad-Rao prediction mean squared error estimator (Rao and Molina, 2015) for finite populations is (6) where the terms g 1d , g 2d , g 3d , and g 4d are functions of the variance components The terms g 1d and g 2d include the shrinkage factor The matrix X contains the sampled values of the auxiliary variables, and the vectors x d and x Ã d contain the area-specific means for the sampled and the non-sampled x-values. The variance-covariance matrix V = V(y) has a block diagonal form, with the blocks V d defined as (Meza and Lahiri, 2005): where ρ is defined in Eq.
(3), I n d is the n d Â n d identity matrix, and J n d is the n d Â n d matrix, whose all entries are equal to 1. The term g 3d contains the asymptotic variances Vðσ 2 v Þ and Vðσ 2 e Þ, and the asymptotic covariance Covðσ 2 e ;σ 2 v Þ. If these parameters are estimated by restricted maximum likelihood, the estimator in Eq. (6) is approximately unbiased (Nissinen, 2009). The area-specific mean squared error estimates are obtained when the variance parameter estimates are inserted into Eq. (7). Nissinen (2009) states that the term g 1d contributes for 85-90% of the estimated mean squared error, that the proportion of g 4d is seldom over 1%, that the proportion of g 2d is between 4% and 6%, and that the proportion of g 3d is between 6% and 10%. We obtained similar percentages in our simulations. The high proportion of g 1d indicates that the variation in the area estimates is mostly related to the uncertainty about the area effects (Nissinen, 2009).
The proposed allocation in Eq. (17) uses three terms of the mean squared error estimator in Eq. (6). The term g 2d is excluded because of its small proportion of the estimated mean squared error and because it involves complex matrix operations and auxiliary variables, whose values depend on the sample.

2.2.
Model-based three-term Pareto method allocation using multiobjective optimization A sample allocation is often based on the solution of an optimization problem subject to given restrictions. It is related to the sample design, the variance, the mean squared error, and the coefficient of variation of the estimator.
Our allocation uses the approximation of the mean squared error (amse) in Eq. (6): Eq. (10) contains the fixed area sizes N d , the area sample sizes n d to be found by optimization, and the unknown variance and covariance parameters. Their values are estimated through sample simulations drawn from the register of proxy data (section 1), on the basis of auxiliary variables. The estimates of the variance and covariance parameters depend on the sample. The means of their sample estimates are inserted into Eq. (10). The sum of the area-specific approximations in Eq. (10).
is an approximation for the mean squared error estimator of the population total estimatorŶ Eblup ¼ P D d¼1Ŷ d; Eblup . The design-based direct estimatorŶ d ¼ N d y d ( y d is the sample mean) is the estimator for the area total Y d andŶ ¼ P d N d y d is the estimator for the population total Y. The coefficients of variation (CV) of these estimators are In the model-based estimation, the mean squared error replaces the variance, and in accordance with the design-based estimation, the approximate coefficient of variation (ACV) for the area total estimatesŶ d; Eblup and the population total estimateŶ Eblup are: where Y d and Y are obtained from the variable of interest in the proxy data. We denote this variable "y*." This allocation should provide the optimal accuracy both on area and population levels. This is the reason why the optimal area sample sizes result from a multi-objective optimization, yielding the minimal approximate population coefficient of variation and the minimal mean of approximate coefficients of variation over areas. For multi-objective optimization, there may exist several solutions, called Pareto optimal solutions, where none of the objectives can be improved without impairing another one (Miettinen, 1999). In this case, the Pareto optimal solutions are such that smaller values for the approximate population coefficient of variation cannot be obtained without letting the mean of the approximate coefficient of variation over areas increase, and conversely. For two objectives, the Pareto front consisting of all optimal solutions is a curve in the two-dimensional objective space. Then all solutions on the Pareto front are candidates for the final solution, in the absence of information on preference. A multi-objective optimization problem is solved either by approximating the whole Pareto front or by identifying a preferred solution from the Pareto front. In the first alternative, a set of Pareto optimal solutions is generated through optimization. It approximates the whole set, which can be infinite, of Pareto optimal solutions. In the second alternative, we take account of information on preference in the optimization and identify a Pareto-optimal solution as close as possible to this information. We develop both alternatives. The functions to be optimized are too complicated to yield closed-form solutions, so that a nonlinear numerical optimization method is mandatory. The area sample sizes are the variables in the multi-objective optimization subject to the constraints To approximate the Pareto front, we use the ε-constraint method (Miettinen, 1999), where one objective is minimized while the other one is converted into a constraint with a fixed upper bound ε. The solutions on the Pareto front are then obtained by solving the resulting single objective optimization problems, where we use different values for the upper bound ε. If the resulting single objective problems are not convex, then the globally optimal solutions may be intractable and we resort to an appropriate single objective optimization method. If the solutions are only locally Pareto optimal, they are Pareto optimal in some neighborhood of the solution. We use the ε-constraint method also in the nonlinear programming allocation (section 3.3), because it corresponds to a multiobjective minimization of the overall sample size n, of the coefficient of variation for each area, and of the coefficient of variation for the whole population. This problem includes D + 2 objective functions. Figure 1 shows an example of the approximated Pareto front, where the approximate population coefficient of variation is minimal under the constraints of 48 upper bounds for the approximate mean coefficient of variation over areas, corresponding to 48 Pareto optimal solutions (denoted by the star symbols). Each solution represents an allocation with corresponding area sample sizes. The Pareto front allows the selection of the allocation. It shows the trade-offs between the two objectives.
The second alternative is to use preference information for identifying the preferred trade-off, without computing all Pareto optimal solutions. We have used the method of global criterion (Miettinen, 1999). The principle is to minimize the distance to the vector whose components are the optimal values for each objective. First we compute the minimum of the approximate population coefficient of variation in Eq. (13), subject to the constraints of Eq. (14). The mean approximate coefficient of variation over the areas is ignored in this optimization. Second, we compute the minimal mean over the areas: Figure 1. The approximated Pareto front minimizing the mean of approximate coefficients of variation over areas and of the approximate population coefficient of variation. The label "Optimum" denotes the Pareto optimal solution.
subject to the constraints of Eq. (14), while ignoring the approximate population coefficient of variation. The resulting area sample sizes in these two optimizations are only by-products. These two minima form the ideal objective vector and are denoted Min pop ¼ minðACVðŶ Eblup ÞÞ; Min are ¼ minðMACVÞ; subject to constraints of Eq. (14).
We set the initial values on the area sample sizes n_dn d and minimize the sum of squares subject to the constraints of Eq. (14). We obtain the preferred area sample sizes. The solution of Eq. (17) is a trade-off between the estimation efficiencies at the area and at the population levels. Figure 1 shows the solution obtained by using this allocation, which belongs to the Pareto front and is the closest to the objective vector. The dotted lines indicate the values of the vector constituting the objective. The accuracy at the population level improves to the detriment of accuracy at the area level. The optimal allocation corresponding to the three-term Pareto method allocation has a minimal distance to the objective vector. We use the Excel Solver with the option "generalized reduced gradient nonlinear" to provide the full Pareto optimal solutions to the single objective optimization problems.

Model-assisted Molefe and Clark´s allocation
Molefe and Clark (2015) have developed an allocation based on a composite estimator for estimating the area-specific means of the variable of interest. A simple random sample of n d units is selected from each stratum d = 1, . . ., D, defined by small areas and containing N d units. The relative size of the area d is combines a synthetic estimator b Y dðsynÞ ¼β 0 X d , whereβ is the coefficient in the regression Eq. (18) and X d the vector of area-specific means of auxiliary variables, and a direct estimator whereβ and X d are the same as in the estimator in Eq. (18), and y d and x d are the sample means of the variable of interest and of auxiliary variables in the area d. The coefficients ϕ d minimize the design mean squared error of the estimator in Eq. (18). Under the conditions given by Molefe and Clark (2015), the approximate design-based mean squared error estimator of Eq. (18) is where v dðsynÞ is the sampling variance of the synthetic estimator b Y dðsynÞ . The denoting the approximate design-based expectation ofβ. Molefe and Clark (2015) assume a two-level linear model ξ, conditional on the values of the auxiliary variables x, with uncorrelated stratum random effects u d and unit residuals ε i : where i refers to the unit i in the stratum d. This model implies that the areaspecific variance of the variable of interest according to Eq.
and holds for all population units. The covariance of the y-values between the two units i and j ≠ i is cov ðy i ; y j Þ ¼ ρ d σ 2 d for units in the same stratum and zero otherwise, where is the intra-class correlation in the area d. Molefe and Clark (2015) assume that the areas have a common intra-class correlation ρ d ¼ ρ for all d. The ratio of the between-area variation to the total variation of y is constant. After computing the optimal weight ϕ d in Eq. (19), we obtain the approximate optimal anticipated mean squared error: The criterion F using anticipated mean squared errors of the small-area mean and the overall mean estimators for the model-assisted allocation has the approximative form: The optimal area sample sizes minimize Eq. (23) subject to P D d¼1 n d ¼ n, and the solution is consistent with Longford (2006). The weight N q d reflects the inferential priority for area d, with 0 ≤ q ≤ 2 and N ðqÞ þ ¼ The quantity G is a relative priority coefficient at the population level. When G is null, we focus on area-level estimation. The larger G, the less important the area-level estimation. The values of q and G depend on these priorities.
When the population estimation has no priority (G = 0) and the cost of the survey are fixed, the minimization of Eq. (23) with respect to n d has the unique solution In Eq. (23) and (24), both the intra-class correlation ρ and the area-specific standard deviation σ d of the variable of interest y are unknown. We replace the intra-class correlation ρ by the adjusted homogeneity coefficient obtained from the proxy variable of interest y*: where MSW is the mean sum of squares of areas, provided by a one-way analysis of variance between the areas in the proxy population, and S 2 y Ã is the variance of y*. We replace the parameter σ d by the standard deviation of the proxy variable y* in the area d.
The reason for both replacements is the link between y and y*. The allocation favors large areas with large variances of y*: the higher the value of the constant q, the more likely the occurrence of negative sample sizes for small areas with small variances. Also, if the population estimate has a strictly positive priority G, then F in Eq. (23) must be minimized numerically; theoretical values of q and G are out of reach.

Model-free reference allocations
One of the model-free reference allocations, equal allocation, uses only numberbased information. Others use both number-based and parameter-based information on the variable of interest, which is unknown and is replaced by a proxy variable y*. It can be the same variable obtained from an earlier research of the same subject. An auxiliary variable correlated with the variable of interest also can serve as a proxy variable if its area characteristics are available. We introduce these model-free allocations in sections 3.1-3.4. Table 2 shows the summary details of these allocations.

Equal allocation
In equal allocation, the sample size is The expression of this allocation in Eq. (26) includes neither the area-specific characteristics nor the between-area variation. It may perform well at the area level, but may lead to poor estimates for very large areas and for the population size. The total sample size n should be an integer multiple of the total number of areas D. The minimal overall sample size n = 2D allows the unbiased estimation of area-specific sampling variances.

The Costa allocation
Costa et al. (2004) introduce the convex combination of proportional and equal allocations, where 0 ≤ k ≤ 1. The equal allocation corresponds to k = 0 and the proportional allocation to k = 1. The equal allocation at the area level and the proportional allocation at the population level perform satisfactorily. The choice of k depends on the wished qualities of estimates at each level. The design coefficient of where N d is the size of the area d counted in statistical units, S 2 y;d is the variance of y on the area d and Y d the total of y on d, and the sample size n d is defined according to Eq. (27). The area-specific coefficients of variation C d depend on k, because the area-specific totals, the variances, and the area sizes are fixed.
The optimal value for k minimizes the difference subject to the constraints 0 k 1; The idea of this solution is to obtain at least moderately accurate area estimates for the areas and for the population size. We use the area statistics of the proxy variable y*, instead of the unknown variable of interest, and Excel Solver with the option "generalized reduced gradient nonlinear." We insert the optimal value of k from Eq. (29) into Eq. (27) to compute the area-specific sample sizes, rounded to the closest integer.

Allocation using nonlinear programming
The allocation for the design-based direct estimation of area-specific and population means (Choudhry et al., 2012) uses nonlinear programming and the area-specific and population coefficients of variation for the variable of interest: The criterion is the minimization of the overall sample size n ¼ P D d¼1 n d , subject to the fixed upper limits for the coefficients of variation in Eq. (31) and n d ≥ 2. This allocation favors areas with a high coefficient of variation, regardless of the area size N d . Many combinations of upper limits may lead to the same minimum overall sample size. This allocation is also applicable for the total estimatorsŶ d ¼ N d y d andŶ ¼ P D d¼1 N d y d , because CVðŶ d Þ ¼ CVð y d Þ and CVðŶÞ ¼ CVð yÞ under stratified simple random sampling.
Our allocation by nonlinear programming is based on finding the upper limits, which lead to the fixed overall sample size n. We use the area and population statistics of the proxy variable y*, and Excel Solver with the option "generalized reduced gradient nonlinear." 3.4. The allocation using box constraints Tschuprow (1923) and Neyman (1934) introduced the allocation for minimizing the variance for the population total estimatorŶ ¼ P D d¼1 N d y d under stratified simple random sampling. The minimization of Eq. (32) subject to n ¼ P D d¼1 n d leads to the Neyman allocation where the area-specific standard deviations S y;d of the variable of interest or in its absence, of a proxy variable, and the total number of units must be available. This allocation favors large areas with high variation and can lead to area sample sizes n d < 2 or even to over-allocation n d >N d . When n d < 2, the unbiased estimation of the sample variance is impossible. The box-constraint optimal allocation avoids these difficulties, by allowing the control of the sample sizes or of the sampling fractions and the design weights. The allocation minimizes Eq. (32) subject to constraints X D d¼1 n d n; where L d is the lower limit and U d is the upper limit for the sample size of the domain d. The limits are adjusted according to the desired accuracies for the area total estimates, but the choices affect the precision of the population total estimate. The lower limit is L d = 2 and the upper limit U d =N d . We call this allocation "box constraint" (BCO). We use an R program (Gabler, Ganninger, and Münnich, 2012) and the R software (http://www.R-project. org) to compute the sample sizes. Longford (2006) introduces inferential priorities for the areas and for the population. He uses those constraints for deriving sample size allocation schemes for direct, composite, and empirical Bayes estimators. Molefe and Clark´s (2015) reference allocation uses the allocation idea of Longford for a composite estimator, but Longford´s other solutions are not applicable here. Falorsi and Righi (2008) propose a sampling strategy for multi-variate and multi-domain estimation guaranteeing a pre-defined precision for the domain estimators when the overall sample size is small. The point is to collect the sample data by using a multi-stage sampling design based on a balanced sampling technique and on generalized regression. This solution can be extended with indirect small-area estimators, but we cannot apply it, because variables of interest are too many.

Design-based estimation methods for model-free allocations
We apply the three estimation methods to model-free allocations. The design-based Horvitz-Thompson method and the model-assisted generalized regression method use survey weights, which are the inverses of the inclusion probabilities.
The finite population U is composed of D non-overlapping domains or areas, with N d units in each, and P D d¼1 N d ¼ N. A probability sample is drawn from U for estimating the area totals Y d ¼ P N d k¼1 y dk , where y dk is the variable of interest for unit k in area d.
The Horvitz-Thompson estimator for the area total Y d iŝ where n d is the sample size for area d, π dk is the inclusion probability of unit k in area d, and w dk ¼ π À1 dk is the sampling weight for the same unit. The model-assisted generalized regression estimator for the area total Y d iŝ where the predicted valueŷ dk ¼ x 0 dkβ þv d is based on Eq. (1), and π dk is the inclusion probability (Lehtonen et al., 2003). The first part of Eq. (37) is the predicted value for Y d when the assisting model is applied. The predicted valuesŷ dk can be computed, because the unit-level values of the auxiliary variables x are available. The second term protects against model mis-specification (Lehtonen et al., 2003).

Application: Finnish business registers
The estimated parameters are the area and the population totals of the variable of interest, and the overall sample size n is fixed at 216 individuals.

Business registers for sampling and allocations
A national Finnish register of block apartments for sale constitutes the data set. This register is maintained by the private company Alma Mediapartners Ltd. Its customers are real estate agencies. They deposit all the appropriate information about the apartments in this register as soon as they receive an assignment from the owners. The population for sample simulations consists of 21,025 sampling units, which are block apartments for sale, selected from the register. In October 2015, it concerned 18 Finnish provinces, which are treated as areas. The smallest area contains 160 units and the largest one 6,813 units. The variable of interest y measures the price (in thousands €) of the apartment and two auxiliary variables measure the size (in m 2 ) and age (in years) of the apartment.
All allocations except equal allocation are based on the proxy variable y*, which is the price of an apartment in the register of April 2015. This proxy register contains 22,230 apartments for sale in 18 provinces, and the variables are the same as in the sample population. Table A1 in the Appendix contains the sizes N d of the areas, population summary statistics for the variable of interest y, and statistics on the differences between y and y*. The area characteristics of these variables have a wide range. The differences between area sizes, area totals, and area means are mostly negative, in contrast to the differences in area standard deviations and coefficients of variation. This indicates a slight increase in the variation of the prices from April to October 2015.
Table A2 in the Appendix shows the population statistics for the auxiliary variables and correlations between the variables. The betweenarea variations of the auxiliary variables are very small (1.7% for size and 3.9% for age of total variation, according to a one-way analysis of variance), which means that the allocations cannot be based on the present auxiliary variables. The province of Uusimaa (near Helsinki) is a dominating area, because it contains the largest number of apartments (32.4% of the population) and the price level there is the highest among all provinces. The variable of interest has a strong positive correlation with the size of the apartment except for one small area, and a negative correlation with the age of apartment except for the largest area. The auxiliary variables are not correlated to one another. The area-specific changes between the correlations (Table A3 in Appendix) are small, except between auxiliary variables for some areas.
Considering the reported changes in the variables between the business registers in April and October 2015, we consider the structures of these registers to be sufficiently similar. This justifies our using the register of April 2015 as the population, which provides the data for computing the allocation-specific sample sizes.

Allocations
The small overall sample size (n = 216, sampling ratio 1.0%) is a key feature in our procedure. The proxy variable y* replaces the variable of interest in the model-free allocations using area parameters. The implementation of the Excel Solver with the option "nonlinear generalized reduced gradient" yielded a weight of 0.3528 for k used in the Costa allocation. We use the same Excel option for solving the area sample sizes in the nonlinear programming allocation. The selected limit of 19.01% for the coefficient of variation for areas and the 8.00% limit for the population size lead to an overall sample size 216. The adjusted homogeneity coefficient of 0.1697 computed with the proxy variable y* replaces the unknown intra-class correlation in the Molefe and Clark allocation. The low value 0.25 for the constant q and zero for the quantity G in this allocation avoid the concentration of sampling units in a single area (here the province of Uusimaa). The three-term Pareto method allocation is based on simulations and multi-objective optimization. We estimated the unknown variance and covariance parameters in Eq. (7) using the 1,500 simulated simple random samples drawn from the proxy data register, before running the actual allocation-specific simulations. The minimum value of 3.74% for the approximate population coefficient of variation and the minimum value of 22.33% for the mean approximate coefficient of variation over the areas result from the first optimization in Eq. (16). The solution of the optimization criteria in Eq. (17) yields the area sample sizes.
The area sample sizes (Table 3) vary much between the allocations. The largest area, the province of Uusimaa, dominates in two allocations. For the box-constraint allocation, this area contributes for almost 60% of the overall sample size. Four of the smallest areas have a sample size of 2, which allow the computation of standard errors for the area total estimates in the design-based estimation. The other allocations contain no very small area-specific sample sizes. The structures of the four other allocations have common features. The three-term Pareto method allocation favors the smallest areas and one larger area (the province of Kymenlaakso). It favors less one area (the province of North Karelia). The sample sizes for the Costa allocation are concordant with the area sizes. The nonlinear programming allocation favors areas with a high coefficient of variation, which is characteristic of this allocation.

Comparison of the allocations
The results are based on design-based simulation experiments. For each allocation, we simulated r = 1,500 independent stratified simple random samples and estimated the area totals, variance parameters, mean-squared error approximations, and the allocation-specific quality measures (relative root mean square error and absolute relative bias), using the SAS software (www.sas.com/en_us/home.html) or the IBM SPSS software (www.ibm.com/analytics/data-science/predictive-analysis/spss-statisticalsoftware). We computed design-based Horvitz-Thompson and modelassisted regression estimates for the model-free allocations and modelbased empirical best linear unbiased predictorestimates than those estimates for every allocation. We compare the allocations, combined with estimators, on the basis of the accuracy and bias, which we measure with the relative root mean square error and absolute relative bias. We compute these quantities, in percent, as sample-based approximations of the expressions in Eq. (5).
The area-specific relative root mean square error and the absolute relative bias in percent are whereŶ di is the design-or the model-based estimate of the area total Y d for the simulated sample i = 1,. . ., r. Their means over D areas, in percent, are: The sumŶ i ¼ P D d¼1Ŷ di is the estimate for the population total in the sample i = 1,. . ., r. The relative root mean square error for the population total, in percent, is where Y is the true value of the population total, and the corresponding absolute relative bias, in percent, is We evaluate two measures of quality: the mean over the areas and the mean at the population level. Tables 8 and 9 in the Appendix show the values for these measures at the area and at the population levels. Figure 2 shows the means of area-specific relative root mean square errors and population relative root mean square errors for each combination of allocation and estimation method. The model-based estimation by empirical best linear unbiased predictor leads to more accurate area estimates than those obtained from the design-based estimation (Horvitz-Thompson and generalized regression), whatever the estimation method among the three ones applied to whichever of the four model-free allocations. The population values among these allocations are the lowest ones for the model-assisted regression estimate. The relative root mean square errors obtained with the equal allocation contrast from those obtained with the box-constraint allocation. The equal allocation has the lowest mean over areas (12.3%) and the highest population value (12.2%) for the estimation by empirical best linear unbiased predictor. The boxconstraint allocation performs satisfactorily at the population level, as expected (between 5.0 and 5.6%, depending on the estimation method), but poorly at the area level (mean between 22.3% and 40.6%). The highest mean is obtained for the model-assisted regression estimation, in contrast with other model-free allocations. At the population level, the smallest value is for the Molefe and Clark allocation (5.1%). The allocations provided either by the three-term Pareto method, the Costa method, or by nonlinear programming are good trade-offs, provided the criterion is accurate enough at both the area and at the population levels. No allocation has an optimal accuracy at both levels at the same time. Figure 1 shows the trade-offs for the area and for the population levels, in the shape of the approximated Pareto front of the bi-objective optimization..
In Figure 3, the distributions of the area-specific relative root mean square errors for each allocation show the relative variation of the area total estimates and the presence of randomness in the simulated samples. The modelfree allocations are more accurate with the model-based estimation. Randomness is the smallest in the three-term Pareto method allocation (lowest median and range of values without outliers). The nonlinear programming allocation has the smallest area as an outlier. The means over the areas of these three allocations are close to each other (Figure 2), although they come from different area-specific distributions. The equal allocation has the lowest median, but a narrow range of variation, and a single outlier (23.4%) for the largest area, which is the province of Uusimaa. This is a difficulty inherent in this allocation. The area estimates in the box-constraint allocation are the least accurate, regardless of the estimation method. The model-assisted regression estimation is the least accurate. The empirical best linear unbiased predictor estimates of the four areas, where the sample size is 2 in the box-constraint allocation, have high relative root mean square errors, except for the province of Ostrobothnia (14.4%, close to the median). The model-based estimation then can produce at least moderately accurate estimates for a single area, in spite of a small sample size. Table A5 in the Appendix shows the simulation biases for the designbased estimates. As expected, these estimates are almost unbiased. The areaspecific biases of the Horvitz-Thompson and of the regression estimates are under 2%, except for three areas in the box-constraint allocation. The areaspecific bias distributions for each allocation (Figure 4) demonstrate the similarity between accuracy and bias in the case of the estimation by empirical best linear unbiased predictor. As for the distributions of the relative root mean square errors, the model-based three-term Pareto method allocation has the narrowest range and is the only allocation with biases under 10%. In the distribution of the equal allocation, the upper quartile is under 4%, but four outliers appear, including the largest area (almost 15%). The distributions of the Costa and of the nonlinear programming allocations are similar, ranging to over 15%. Molefe and Clark´s and the box-constraint allocations are the most dispersed. The contrast between the equal and the box-constraint allocations is similar for the biases and for the relative root mean square errors. The three-term Pareto method, the Costa, and the nonlinear programming allocations with moderately low biases on both levels are satisfactory trade-offs. The population estimate is almost unbiased for Molefe and Clark´s allocation (1.2%), but most of the area estimates are seriously biased, regardless of the sample size. Five areas have important biases for most of the allocations, which indicates that the model is not up the task. Table 4 presents the allocation-specific means over the areas, the population values, and their sums, for the relative root mean square errors and the relative biases. The aggregate relative root mean square errors are the lowest for the empirical best linear unbiased predictor estimates, except for the equal and the box-constraints allocations. The Horvitz-Thompson estimates are less accurate. The model-assisted regression estimates are more accurate than the Horvitz-Thompson ones, except for the box-constraint allocation, which is high (45.6%). The Horvitz-Thompson and the regression estimates are almost unbiased for the model-free allocations, but the box-constraint allocation is an exception. For the empirical best linear unbiased predictor estimates, the three-term Pareto method, Molefe and Clark´s, Costa´s, and the nonlinear programming allocations have the smallest aggregate biases, which are close to each other; the box-constraint allocation has the largest aggregate bias.
We evaluate the allocations by integrating the aggregate values for the relative root mean square error and the absolute relative bias. The modelassisted regression estimates of Costa´s, of the nonlinear programming, and of the equal allocations have the smallest values (24.5%, 25.6%, and 27.1%). The three-term Pareto method allocation has the second smallest value (28.1%), which includes a high aggregate bias. The aggregate values indicate that the estimation by model-assisted regression performs the best for the three model-free allocations, although the result is not supported by the areaspecific relative root mean square errors (Table A4 in Appendix). The box constraint and the equal allocations are extreme, in the sense that they are either strongly or not at all associated with the area sizes. These solutions lead to satisfactory estimates only at either the population or the area. The three-term Pareto method, Costa´s, and the nonlinear programming allocations take both the between-area and the within-area variations into account. They perform well at both levels, when the model is included. The three-term Pareto method and Costa´s allocations do not use fixed priorities or limits for the area-level and the population-level estimations, unlike the nonlinear programming and Molefe and Clark´s allocations.
For small areas, the model-based estimation produces area estimates of moderate accuracy, despite a small sample size (provinces of North Karelia and Ostrobothnia). Large sample sizes, however, do not guarantee high accuracy (provinces of Satakunta, Kymenlaakso, and Kainuu). The accuracy of the area estimates seems to be related to the area-specific means and to the coefficients of variation of the variables. Large deviations from the corresponding the population statistics may bias the estimation of the area totals. The skewness of the variable of interest usually confuses the empirical best linear unbiased predictor estimation, as the important biases for some areas indicate.
We examined the validity of the unit-level linear mixed model in Eq. (1) by testing the null hypothesis that the error terms v d and e dk are normally distributed.
We computed the transformed residuals ðy dk Àτ d y d Þ À ðx dk Àτ d x d Þ 0 β, whereτ d ¼ 1 À ð1 Àγ d Þ 1 2 and the factorγ d is defined in Eq. (8) (Rao and Molina, 2015). Under the null hypothesis, the residuals are approximately identically and independently distributed as Table 4. Means over areas, population values, and aggregate values for quality measures (in percent), by allocation. Estimation methods for model-free allocations: 1 = Horvitz-Thompson, 2 = regression estimation, and 3 = empirical best linear unbiased predictor. Nð0;σ 2 e Þ. We applied the test to a simple random sample, of n = 5,000 individuals, selected from the population. We took σ 2 v = 1,570 and σ 2 e = 17,550. The Shapiro-Wilks test yielded a p-value of 0.00, leading us to reject the null hypothesis. We also computed the allocation-specific means for the variance parameters, and the regression coefficients and the area effects of the area total estimator in Eq. (4), for the simulated samples. The means for Molefe and Clark´s and the box-constraint allocations differ from those for the other allocations. Our model has deficiencies when its parameters are estimated by generalized least-squares or by restricted maximum likelihood. It is possible, before the estimation phase, to make the distribution of the variable of interest more symmetric by an algebraic transformation such as the lognormal method, but we have not done that.

Conclusion
We compared six allocation methods in stratified sampling, when applying the model-based estimation and the design-based estimation for obtaining area and population estimates. The fixed and small total sample size is a common restriction. Our three-term Pareto method allocation uses auxiliary information, the model, and an estimation method. Accuracy at both the area and at the population levels are optimized, which requires multi-objective optimization techniques. We chose the reference allocations on the basis of the variety of information, which the allocations use: model and estimator, optimization criteria, fixed limits or priorities, and auxiliary information. The allocation-specific area sample sizes are various. The sample is concentrated on the largest area for two allocations, a situation which may lead to inaccurate and biased estimates for small areas.
We computed the area sample sizes for five allocations using the previous register data, because the auxiliary variables are insufficient to support the allocations. The distance between apartment and town center has a predictive power, but it is not available.
We applied design-and model-based estimations and evaluated the allocations in terms of accuracy and bias obtained from design-based sample simulations. We confirm that, in this survey framework, the model-based estimates are more accurate than the design-based estimates. The "borrowing strength" principle may be significant in surveys where some areas have too small sample sizes to allow direct estimates of satisfactory quality. The model-free allocations have similar performance structures at different levels, regardless of the estimation method.
The studied allocations have all pros and cons, depending on the estimation level (area and population). Considering the aggregate values, the empirical best linear unbiased predictor estimates for the three-term Pareto method, the Costa, and the nonlinear programming allocations are the most accurate. The randomness associated with the area estimates is best controlled in the three-term Pareto method allocation, from the viewpoint of the area-specific distributions of relative root mean square errors.
The bias results for the empirical best linear unbiased predictor estimates demonstrate that the allocations have very different performances. The threeterm Pareto method and the Costa allocations perform better, with respect to aggregate values and area-specific distributions.
By considering accuracy and bias, we showed that the Costa, the nonlinear programming, and the equal allocations under model-assisted regression estimation perform the best, and that the three-term Pareto method allocation performs almost as well. This comes from the fact that the design-based estimates are almost unbiased, but that many of these estimates are inaccurate. The model-based estimation suffers from an important bias, leading to try methods likely to improve accuracy and reduce bias. The appropriate software is also necessary.
Getting a well-performing allocation is not an easy task; it is very casespecific and depends on the objectives of a survey and on the availability of auxiliary information. Accurate estimates, both at the area and at the population levels, are made obtainable by multi-objective optimization. The model and the estimation method have become part of the sampling design.
The first trade-off is between the quality of the area estimates and the quality of population estimates. We showed the impossibility of obtaining maximum quality at both levels simultaneously. The fixed priorities or limits at the area and at the population levels, which some allocations use, do not guarantee the maximum quality.
The second trade-off is between accuracy and bias of the estimates. Modelbased estimators are usually more accurate than design-based estimators when the sample size is small, but model-based estimators may be importantly biased. The sample allocation affects accuracy and bias, but the increment of the area sample size does not correct the bias entirely. This trade-off appears commonly in the literature, but the discussion has seldom concerned the priorities of these measures.  Table A2. Population summary statistics of the auxiliary variables "size" (m 2 ) and "age" (years) and correlations between variables in the business register in October 2015.  Table A4. Relative root mean square errors (in percent) for areas and population, by allocation. Estimation methods for model-free allocations: 1 = Horvitz-Thompson, 2 = regression estimation, 3 = empirical best linear unbiased predictor (EBLUP).

Model-related
Model-free