Study design in causal models

The causal assumptions, the study design and the data are the elements required for scientific inference in empirical research. The research is adequately communicated only if all of these elements and their relations are described precisely. Causal models with design describe the study design and the missing data mechanism together with the causal structure and allow the direct application of causal calculus in the estimation of the causal effects. The flow of the study is visualized by ordering the nodes of the causal diagram in two dimensions by their causal order and the time of the observation. Conclusions whether a causal or observational relationship can be estimated from the collected incomplete data can be made directly from the graph. Causal models with design offer a systematic and unifying view scientific inference and increase the clarity and speed of communication. Examples on the causal models for a case-control study, a nested case-control study, a clinical trial and a two-stage case-cohort study are presented.


Introduction
Causal models are commonly used to describe the true or hypothesized causal relationships between a set of variables. The model is typically presented as a directed acyclic graph, where the nodes represent the variables and the edges represent the causal relationship so that the arrow shows the direction of the effect. A graphical model serves as a tool for visualizing and discussing causal relationships, but even more importantly, it is a mathematically well-defined object from where causal conclusions can be drawn in a systematic way. Causal calculus (Pearl, 1995(Pearl, , 2009 can be used to estimate causal effects from observational data providing that the study has been carefully designed (Rubin, 2008).
Causal models are not sufficient for the estimation of causal effects without the data. After specifying the causal model and the objectives of the study, the first questions of the researcher should be 'How should the data be collected?' and 'How should the data collection be taken into account in the analysis?' (Heckman, 1979;Rosenbaum & Rubin, 1983). In many fields of science, the data are not obtained as a simple random sample of the population. The pressure of cost-efficiency leads to complex study designs where the expensive measurements are made only for a carefully selected subset of individuals (Reilly, 1996;Van Gestel et al., 2000;McNamee, 2002;Kulathinal et al., 2007;Langholz, 2007;Karvanen et al., 2009a). It is therefore crucial to take the study design into account in the estimation of causal effects. The increased complexity of study designs also emphasizes the need for accurate and efficient reporting von Elm et al., 2007;Moher et al., 2010;Schulz et al., 2010).
An introduction to causal models with design is given through an example in Section 2. The formal definition of the concept is then presented in Section 3. In Section 4, it is shown how the causal effects can be estimated from a case-control study. Examples describing a clinical trial, a nested case-control study and a two-stage case-cohort study as causal models with design are provided in Section 5. Finally, the benefits, the limitations and the implications of the proposed concept are discussed in Section 6. Scand J Statist 42 Fig. 1. Graphical models for the example on the causal effect of smoking on lung cancer. Pearl (2009) considers an example where the causal effect of smoking X on lung cancer Y is studied. It is assumed that the causal effect is mediated through the tar deposits in the lungs Z. In addition, there might be an unknown confounder U , which has a causal effect on both X and Y , but not on Z. Figure 1A illustrates the causal model.

Introductory example
In numerical calculations, Pearl implicitly assumes that the data are obtained as a simple random sample from the population. This assumption is made explicit in Fig. 1B. Variable m i , where subscript i indexes the individuals, represents an indicator for a finite well-defined closed population D ¹1; : : : ; N º. It is defined m i D 1; i 2 , and m i D 0; i … . Variable m 1i represents the sampling. This indicator variable has a value of 1 if the individual was selected to the sample and 0 otherwise. The arrow from m i to m 1i describes the fact that the sample is selected from the population; that is, m 1i D 1 implies m i D 1. The value of m 1i can be determined by the researcher, which is shown in the graph by using diamond symbols for the nodes.
Variables X i ; Z i and Y i are related to the underlying population and are not directly observed, which is shown in the visualization with the open circles. Instead, the variables X i ; Z i and Y i are measured from the sample. Because these variables are observed, they are shown as filled circles. The value of Y i is Y i if the individual belongs to the sample; that is, m 1i D 1; otherwise, Y i is not available. This is described in the graph by arrows from m 1i and Y i to Y i . In other words, the causal assumptions, the study design and the data are all presented in the same graph where the causal effects are defined consistently regardless of the type of the variable.
Instead of simple random sampling, case-control designs are often used in epidemiology to study rare diseases. Figure 1C represents a case-control design where the selection for the risk factor measurement is made on the basis of the lung cancer status. In practice, for instance, 1000 lung cancer cases and 1000 non-cases are selected. The lung cancer status Y i is determined for the sample ¹i W m 1i D 1º. Smoking X i and tar deposits Z i are measured for the case-control set ¹i W m 2i D 1º. In the graph, there are arrows from m 1i and from Y i to m 2i , which indicates that the selection for the case-control set depends on the measured lung cancer status.
It is well known that the study design must be taken into account in the analysis of the data from the case-control design. This means that although Fig. 1A presents the causal model for both situations (b) and (c), the analysis of the case-control study (c) differs from the analysis of the simple random sample (b). This difference is made explicit by combining the study design to the causal model. As these causal models with design are causal models, the actual estimation of causal effects can be carried out by applying the rules of causal calculus as demonstrated in Section 4.

Causal models with design
The formal definition of causal models with design relies on the definition of causal models as presented by Pearl (2009) and the missing-data concept presented by Rubin (1976). The definition of causal models is extended to reflect the elements of inference: the causal assumptions, the study design and the data. The immediate benefit is that the methods of causal calculus are directly applicable for questions related to the study design and estimation. Graphical models with explicit sampling or selection mechanism have been earlier used by Cooper (2000), Geneletti et al. (2009), Didelez et al. (2010 and Bareinboim & Pearl (2012b).
Causal model and probabilistic causal model are defined by Pearl (2009)  (ii) V is a set ¹V 1 ; V 2 ; : : : ; V n º of variables, called endogenous, that are determined by variables in the model -that is, variables U [ V ; and (iii) F is a set of functions ¹f 1 ; f 2 ; : : : ; f n º such that each f i is a mapping from (the respective domains of ) U i [ PA i to V i , where U i Â U and PA i Â V n V i and the entire set F forms a mapping from U to V . In other words, each f i in v i D f .pa i ; u i /, i D 1; : : : ; n, assigns a value to V i that depends on (the values of ) a select set of variables in V [ U , and the entire set F has a unique solution V .u/.
Definition 2 (probabilistic causal model, Pearl 7.1.6). A probabilistic causal model is a pair hM; P .u/i where M is a causal model and P .u/ is a probability function defined over the domain of U .
The causal diagram G.M/ of a causal model M is a directed graph where each node corresponds to a variable and the directed edges point from members of PA i and U i towards V i .

Scand J Statist 42
A causal model with design can be defined as an extension of the probabilistic causal model presented by Pearl where the notation for selection and missing data follows the lines of Rubin (1976): Definition 3 (causal model with design). Causal model with design is a probabilistic causal model that fulfils the following conditions: (1) Each node in the causal diagram is either a causal node, a selection node or a data node. Each node has an information-type attribute with possible values: 'observed','not observed', 'determined and known' and 'determined and unknown'. (2) Each selection node represents a binary variable with the possible values 1 and 0. There is always a unique selection node M (population node), which is an ancestor of all selection nodes and has a value of M D 1.
(3) Each data node has two parents, one causal node and one selection node. A causal node cannot be a parent for more than one data node. For a data node X with parents causal node X and selection node M , it holds that where NA represents a missing value.
In the first item of definition 3, the node types are named, and the possible values of information-type attributes are listed. The information-type attribute of the variable with the possible values of 'observed', 'not observed', 'determined and known' and 'determined and unknown' describes the knowledge of the researcher. In visualizations, these types are presented as a filled circle, an open circle, a filled diamond and an open diamond, respectively. In an observational set-up, a causal variable X is not observed as such; only the corresponding measurement X is observed. In an experimental set-up, the values of some causal variables can be determined by the researcher. Usually, causal variables determined by the researcher are known, but in principle, they can be also unknown if the information on the values set for the variable has been lost after the execution of the experiment. The data are by definition always observed. A selection variable can have all four information types. The value of a selection variable is determined when sampling or other selections are applied to the population. The selection variable can be 'determined and known' or 'determined and unknown'. The latter type, 'determined and unknown', may occur, for instance, when the sample is drawn from administrative register with personal identifiers, but these are later removed from the data, and the researcher is not able to tell which individuals of the population are present in the sample. When the missing data can be identified as an empty record, the selection variable is observed. If the missing individuals are not identified at all, as is the case in left truncation for instance, the selection variable is not observed.
In the second item of definition 3, the role of the population and the selection variables is specified. Causal assumptions are always made with respect to some finite population known as the study source in epidemiology (Miettinen, 2011). There is always only one population node. If there is more than one conceptual population, the population can be defined as the union of the conceptual populations. The conceptual population, for instance, a geographical area, becomes a causal variable in the model. If the causal mechanisms differ by area, the model contains arrows from the area to the causal nodes where the functions f j differ between the conceptual populations. This allows defining models where some causal relationships are similar across the areas and some are different. The selection probabilities for the sampling may also differ by area, which is shown in the model by an arrow from the area to the selection node.
The members of the population can be a priori known or unknown. In the former case, the researcher has a unique identifier, for instance, the social security number, available for each member of the population before the study. In the latter case, the researcher identifies the members of the population only when they enter the study. A selection node M induces the subpopulation ¹i 2 j M i D 1º, which consists of the selected individuals. The causal effects are typically estimated for the population , but, for instance, in epidemiological cohort studies, the effects are often estimated only for the cohort ¹i 2 j M i D 1º, also known as the study base (Miettinen, 2011).
In the third item of definition 3, the relations of the causal variables, the selection variables and the data are specified. The value of random variable X i is measured only if the individual i is selected to be measured, which is indicated by the selection variable M i . This means that the measured value X i is a random variable that depends on the variables X i and M i in a deterministic way. The definition of a univariate random variable is extended so that in addition to a real axis, a random variable may also have a special value 'NA', which indicates missing data. With this definition, all elements of scientific inference can be expressed as random variables and their causal relationships. If a data node or a selection node has a directed path to a causal node, the measurement or the selection has a causal effect on the underlying causal variable. This may be the case, for instance, in health examination studies where participation in the study may increase awareness on healthy lifestyle and consequently also have an impact on later measurements of health indicators.
In a causal model, the causal effects define a partial ordering between the variables. In addition to this causal time, the time of observation can be linked to each variable in a causal model with design. The causal time and the observational time together define the relative location of each node in a visualization where the causal time is presented on the horizontal axis and the observational time on the vertical axis. To make the visualization more informative, the stages of the study can be used as labels for the vertical axis as is done in the examples of Sections 2 and 5.
Measurement error can be added to a causal model with design by introducing two causal variables: the original variable X i and the version with measurement error Q X i . In the graph, there is an arrow from X i to Q X i . Both X i and Q X i are unobserved, and only Q X i is observed for the sample. Variable X i is usually unobserved unless some kind of benchmark measurements without measurement error are carried out for a subsample. If two variables X i and Y i have correlated measurement errors, an explicit unobserved causal variable U is needed to describe the structure of the measurement error. In the graph, there are arrows from U to Q are observed in the sample. In causal models with design, sampling and non-response are formally treated in a similar way; the only difference is the type of the selection node, which is 'determined' for sampling and 'observed' for non-response. Some conclusions on the type of missing-data mechanism (Rubin, 1976) can be made directly from the causal model with design. Let M be the selection variable for the measurement Y of causal variable Y . If there is no (undirected) path from Y to M except through Y , the data on Y are missing completely at random (MCAR), more precisely, everywhere MCAR (Seaman et al., 2013). If there is an arrow from Y to M , the data are missing not at random (MNAR). The traditional MCAR/missing at random/MNAR classification concerns the data as a whole, whereas causal models with design provide a description of the missingness mechanism variable by variable.
Many recent theoretical results on missing data and selection bias in causal inference can be applied to causal models with design. As these results are not defined directly for causal models with design but for other extensions of causal models, transformations are applied as the first step. Mohan et al. (2013) consider estimation when data are MNAR and derive conditions a 'missingness graph' should satisfy to ensure the existence of a consistent estimator for a given probabilistic relation. In order to utilize these results, a causal model with design can be collapsed to a missingness graph by removing the intermediate selection nodes, that is, selection nodes that are not parents of a data node. Formally, this can be defined as follows.
Definition 4 (collapse to a missingness graph). Missingness graph H is a collapse of causal model with design M with causal diagram G.M/ if (i) the set of nodes in H consists of the causal nodes of M, the data nodes of M and such selection nodes of M that are parents of some data node; and (ii) there exists an edge from node X to node Y in H if there exists an edge from X to Y in G.M/ or if X is a causal node and Y is a selection node and there exists a directed path from X to Y in G.M/.
The results and algorithms by Bareinboim & Pearl (2012b) can be used to mitigate and sometimes to eliminate the selection bias caused by preferential data collection. The results are applicable in the important special case where a single selection node (often marked by S ) is the parent for all data nodes. In order to apply these results, a causal model with design is first collapsed to a missingness graph, and then the data nodes are removed. The transformed graph contains the selection node S and all causal nodes. The results by Didelez et al. (2010), Geneletti et al. (2009) andCooper (2000) can be also applied to the same transformed graph. Bareinboim & Pearl (2013a,2013b consider theoretical conditions for the transfer of experimental results from one or several populations to other populations. Causal models with design have only one population, but the transportability results can be used between the conceptual populations. The application of the results and the algorithms by Bareinboim & Pearl (2013a,2013b requires that the causal model with design has been collapsed to a selection diagram as follows. Definition 5 (collapse to a selection diagram for transportability). Selection diagram H S is a collapse of causal model with design M with respect to a set of selection variables S if (i) the conceptual populations of M are identified by the variables of S ; (ii) the set of nodes in H S consists of the causal nodes of M; and (iii) there exists an edge from node X to node Y in H S if there exists an edge from X to Y in G.M/ and Y does not belong to S .
Other recent developments that can be applied to causal models with design include the results on the testability of counterfactuals (Shpitser & Pearl, 2007) and z-identifiability of surrogate experiments (Bareinboim & Pearl, 2012a).

Estimation of causal effects
The following steps are required to estimate causal effects using causal models with design: (1) Specify the causal model.
(2) Check the identifiability of the causal effect in the causal model using the results by Tian & Pearl (2002), Shpitser & Pearl (2006b, 2006a and Bareinboim & Pearl (2012a). If the effect can be identified, use the rules of causal calculus (Pearl 1995(Pearl , 2009) to express the causal effect in terms of observed probabilities. Causal models with design allow the estimation of causal effects in complex designs using only the rules of causal calculus and the likelihood. This requires, however, that the causal effect can be expressed in terms of observed probabilities (step 2) and the parameters of the likelihood can be estimated (step 5). Even if a causal effect is not identifiable in a general nonparametric form, it may still be identifiable under a specific parametric model. For example, an instrumental variable may help to identify a causal effect in a linear model, but not in a nonlinear model (Pearl, 2009), and the average causal effect in clinical trials with non-compliance can be identified under specific assumptions (Angrist et al., 1996). Even if a causal effect is identifiable in the general non-parametric form, it may not be estimable from the collected data. A well-known example is the MNAR situation where a variable has a causal effect on its selection variable and the estimation is not possible in general without strong assumptions on the selection mechanism (Little & Rubin, 2002).
As an example of the estimation procedure, the smoking and lung cancer example of Section 2 is considered again. The causal model is specified in Fig. 1A (step 1). The goal is to estimate the causal effect p.y j do.X D x//; where the do-operator represents action/ intervention. The result (step 2) is obtained applying the following three rules of causal calculus (Pearl 1995(Pearl , 2009): (1) Insertion and deletion of observations: p.y j do.x/;´; w/ D p.y j do.x/; w/; if .Y ? ? Z j X; W / in the graph G X : (2) Exchange of action and observation: p.y j do.x/; do.´/; w/ D p.y j do.x/;´; w/; if .Y ? ? Z j X; W / in the graph G XZ : (3) Insertion and deletion of actions: p.y j do.x/; do.´/; w/ D p.y j do.x/; w/; if .Y ? ? Z j X; W / in the graph G X Z.W / ; where Z.W / is the set of the Z-nodes that are not ancestors of any W -node in the graph G X .
Here, G X represents a graph where the incoming edges of the set of nodes X are removed, G X represents a graph where the outgoing edges of the set of nodes X are removed and G XZ represents a graph where the incoming edges of the X -nodes and the outgoing edges of the Z-nodes are removed. The rules of causal calculus are sufficient for deriving all identifiable causal effects from observational data (Huang & Valtorta, 2006;Shpitser & Pearl, 2006b) and experimental data (Bareinboim & Pearl, 2012a) for a given population. Alternatively, the backdoor and front-door criteria (Pearl, 2009) and the moralization (Lauritzen et al., 1990) can be also used to derive formulas for the causal effects. Algorithms for the automated application of causal calculus have been developed (Tian & Pearl, 2002 Next, consider the case-control design of Fig. 1C (step 3). To estimate the causal effects, the model parameters must be estimated from the data collected according to this design. The likelihood can be factorized according to the graphical model p m ; m 1 ; m 2 ; Z; X; Y; U; Z ; X ; Y j Â; where Â represents the model parameters, represents parameters related to the design and the vector notation, such as m 1 D .m 11 ; : : : ; m 1N / T , refers to the variables for all individuals ¹1; : : : ; N º in the population. The distributions are defined with respect to the first argument unless otherwise specified. The likelihood of the observed data is obtained as an integral over the unknown variables Z; X; Y and U (step 4) p .m 1i D 0 j m i ; / : (2) As the selection m 1 is random sampling from the population, the term p .m 1i D 0 j m i ; / may be ignored in the estimation of Â. The selection m 2 depends on the response Y , and the term p.m 2i D 0 j m 1i D 1; Y i ; / must not be ignored. Note also that although X is not a parent of Y in the causal model, the likelihood (2) has the term p.Y D 1 j X D x; Z D´/.
In step 5, the likelihood must be written in a parametric form. Finding a good parametrization, that is, finding a good statistical model, is purely a statistical problem. Causal considerations are not needed in the model selection or in the parameter estimation, and the vast literature on these topics is directly applicable. It follows from (1) that the probabilities p.x/; p.´j X D x/ and p.y j X D x; Z D´/ are needed to estimate p.y j do .X D x//. The same probabilities are also components in the likelihood (2), and it is therefore natural to parametrize them. For simplicity, Pearl (2009) assumes that the variables X; Z and Y have possible values of 0 and 1. The observed probabilities mentioned earlier can be now parametrized as follows: With this parametrization, the causal effect of smoking on the risk of lung cancer given by (1) can be written as These equations link the model parameters Â D .Â X ; Â Z ; Â Y ; Â ZX ; Â YX ; Â Y Z ; Â Y ZX / to the causal effects. The dependence of the selection probability on Y may be parametrized as As the variables are binary, the data collected according to the case-control design can be presented in the form of frequencies given in Table 1. The size of the population is N D N 11 C N 10 C N 01 C N 01 , where N 11 is the number of cases selected, N 10 is the number of noncases selected, N 01 is the number of cases not selected and N 00 is the number of non-cases not selected. In other words, it is assumed that the lung cancer prevalence in the population is known. The log-likelihood derived from the likelihood (2) becomes  where represents summation over the corresponding marginal and is a shorthand notation for the marginal probability of Y . The maximum likelihood estimates of Â can be obtained by numerical optimization of the log-likelihood. Naturally, a Bayesian analysis may be carried out as well.
For a numerical illustration, consider a case-control study where 1000 lung cancer cases and 1000 controls are selected for the covariate measurements. The parameters Â are set according to the (unrealistic) population probabilities used by Pearl (2009, p. 84). The expected frequencies are shown in Table 1 which are similar to the causal effects estimated from the whole population by Pearl (2009, p. 84). The differences in the third decimal are due to the rounding of the expected frequencies in Table 1 to the nearest integer.

Examples with complex study design
The examples presented in this section aim to demonstrate how causal models with design can describe the essential features of complex experimental and observational studies in a precise and illustrative way. The examples are from medicine and epidemiology, where complex study designs are commonly used. The first example is based on a real study, and causal models with design are used to make conclusions on the identifiability of various causal effects from data MNAR. The two other examples describe imaginary but realistic scenarios. Causal graphs with design remove the ambiguity related to the common names of study designs such as retrospective study, prospective study, cohort study, case-control study and two-stage study (Vandenbroucke, 1991;Knol et al., 2008). The process of the data collection can be seen directly from the causal graph with design.
For the estimation of causal effects, the procedure presented in Section 4 is applicable. Causal models with design are also useful in the estimation of predictive models when the study design and the missing-data mechanism must be taken into account in the analysis. The likelihood factorized according to the causal model with design offers a natural starting point for the  (Evans et al., 2005;Kulathinal et al., 2007). The sampling frame ¹i W M 0i D 1º is conditioned on the health status Y 0i at the beginning of the study, and this dependence must be taken into account when estimates for the population ¹i W m i D 1º are required. At the first stage of the study, a random sample ¹i W m 1i D 1º is selected. The decision to participate M 1i may depend on classic risk factors and current health status. Classic risk factors X i and current health status Y 0i are measured at the beginning of the study for the cohort members ¹i W M 1i D 1º. Blood samples taken at the baseline are frozen to be used later. After a follow-up period of 10 years or more, the selection for the second stage is made on the basis of the measurements X i and Y i . All disease cases and an age-stratified random subset of the cohort are selected in the case-cohort set ¹i W m 2i D 1º for which genetic factors Z i are measured. Non-response M 2i occurs because of missing or contaminated samples or other technical reasons. parameter estimation in both the frequentist and Bayesian approaches. The idea is to write first the full likelihood for the data, the design and the latent variables and then see which parts of the likelihood are not needed in the estimation of the parameters of interest. The likelihood functions for the examples of this section are given in the online Supporting Information. Figure 2 illustrates a causal model with design for the two-stage case-cohort design used in the MORGAM Project (Evans et al., 2005;Kulathinal et al., 2007). The project aims to estimate the impact of classic and genetic risk factors on the risk of cardiovascular diseases. Currently, 15 cohorts from six countries participated in the genetic component of the project. Most of the cohorts are selected as random samples of the underlying population of a certain age range, typically 24-65 years, although there is variation between the cohorts. Over 50,000 individuals have been examined for the classic risk factors and followed up for mortality and disease endpoints. Because of the cost of genotyping, genes have been measured only for a subset of each cohort. Over 10,000 individuals have been genotyped in the case-cohort setting.
The causal assumptions are described using four variables: genetic risk factors Z i , classic risk factors X i and health status at baseline Y 0i and at the end of the follow-up Y i . Here, classic risk factors are understood to include the actual risk factors such as smoking, cholesterol and blood pressure as well as all relevant background variables measured at baseline. The internal causal structure between these variables is not specified because it is not needed in the following considerations. From the graph, it can be read that genes may affect the disease risk directly and via classic risk factors. Classic risk factors measured at baseline may be affected by the health status at baseline. The following conclusions can be made using causal calculus: that conditioning on the health status at baseline does not change the situation qualitatively. From result (9), it follows that the cohort data can be used to estimate the predictive model p .Y i j X i D x i ; Y 0i D y 0i / for the healthy population. This kind of conditioning on the health status is commonly used in epidemiology and has been applied also in the MORGAM Project (e.g. in Asplund et al., 2009). To estimate the causal effects of classic risk factors in the population, the missingness mechanism must be taken into account because (5) contains the distribution p .Z D´; Y 0 D y 0 /, which is potentially different for participants and non-participants. Similarly, the term p .Z D´j Y 0 D y 0 / in (6) implies that the missingness mechanism must be taken into account also when the causal effects of classic risk factors are estimated for the healthy population. From result (10), it follows that the case-cohort data can be used to estimate the causal effect of genetic risk factors for the healthy population on the condition of classic risk factors. As the case-cohort selection M 2i depends on Y , the casecohort selection should be taken into account in the estimation (Kulathinal & Arjas, 2006;Kulathinal et al., 2007). The data on health status at baseline Y 0i include information on nonfatal cardiovascular events before baseline. Restricting the analysis to the individuals healthy at baseline, that is, removing individuals with prior non-fatal events, discards a considerable amount of potentially useful data. In the MORGAM Project, several attempts have been made to use these so-called baseline cases. In the work of Karvanen et al. (2009b), baseline cases are analysed separately. The joint analysis of baseline cases and follow-up cases requires compensation for the left truncation, which can be carried out using non-parametric imputation (Karvanen et al., 2010) or conditional likelihood (Saarela et al., 2009). These works, however, do not take non-participation into account. Figure 3 shows how the experimental set-up of a clinical trial can be described in a causal model with design. The treatment in the clinical trial is a causal variable determined by the researcher by means of randomization. In the graph, this is presented by causal node T 0 i , which has the type 'determined and known'. The example also demonstrates the compliance problem encountered in clinical trials: the actual treatment may differ from the allocated treatment if the participant does not follow the instructions given. In the graph, there is an arrow from T 0 i to the actual treatment T i , and T 0 i affects outcome Y i only through T i . In the intention-totreat analysis, the observed outcome Y i is explained by the intended treatment T 0 i using all Fig. 3. A causal model with design for a clinical trial. A sample ¹i W m 1i D 1º is selected for screening from the population ¹i W m i D 1º. The inclusion for the trial m 2i is based on the screening variable Z i . At the baseline, covariate X i is measured for the trial participants, and a randomized decision on the treatment T 0 i is made. The actual treatment T i during the treatment period may differ from the intended treatment T 0 i because of non-compliance. The outcome Y i depends on the covariate X i and the treatment T i . At the end of the treatment period, measurements for the observed outcome Y i and the observed treatment T i are made. Fig. 4. A causal model with design for a nested case-control study. The idea of the case-control design is to select the individuals for the measurement of the expensive risk factor Z i on the basis of the outcome Y i and the inexpensive risk factor X i . At the first stage, a sample ¹i W m 1i D 1º is selected from the population ¹i W m i D 1º, and variables X i and Y i are measured. The selection of cases and controls m 2i depends not only on measurements of individual i , X i and Y i , but also on the outcome Y j and the covariate X j of all other individuals in the sample. Each individual has a similar causal graph, which has been omitted in the figure. The non-response M 2i reflects the fact that the measurement Z i may not be available for all individuals selected in the case-control set.
included participants in the trial ¹i W m 2i D 1º. In the per-protocol analysis, only the compliant participants with T 0 i D T i are included. Figure 4 illustrates a situation where there is a dependence structure between the selection variables of the individuals in the sample. In a nested case-control design, the controls are selected by considering the individuals at risk at the time (age or calendar time) of the disease event. A control may later become a case, which creates a complicated dependence structure between the selection probabilities (Saarela et al., 2012). Consequently, the selection probability for individual i depends on the covariates and outcomes of all other individuals. In the graphical presentation drawn for individual i, the case-control selection node m 2i has incoming arrows from X i ; Y i ; X j and Y j , where index j is used to refer to all other individuals.

Discussion
Causal models with design offer a systematic and unifying view to scientific inference. They present the causal assumptions, the study design and the data collection in a way that accounts for the complexity encountered in real-world problems. The examples in Section 5 demonstrate how the concept can be used to describe medical studies with multiple stages. Conclusions on whether a causal or observational relationship can be estimated from the collected incomplete data can be made directly from the graph as was demonstrated with the MORGAM Project. Despite the complex design, the estimation of the causal effects can be carried out in a systematic way via causal calculus as illustrated in Section 4.
Causal models with design present the population and the selection as intrinsic parts of the model. Selection nodes may have both incoming and outgoing connections to other nodes. A distinction is made between a random variable and its measured value. Combined with the selection, this allows the description of various sampling and missing-data set-ups in terms of causal effects.
The limitations of causal models with design are in many ways similar to the limitations of causal models in general. The presentation of causal assumptions in the form of a graphical model has the benefit that many problems can be solved without specifying the parameters of the model. On the other hand, the explicit parametric definition of the functional relationships is still the only decisive presentation of the model. Certain causal effects may be identifiable only under specific parametric assumptions such as linearity of the effect.
The implications of the concept are twofold. First, it ties together causality and study design and opens new possibilities for the practical application of graphical models. Second, it shows the key elements of the study in a compact visual format and thus increases the clarity and speed of communication. High standards of design, analysis and communication of scientific studies will significantly reduce the time and effort needed for the synthesis of scientific knowledge.