Causal Effect Identification from Multiple Incomplete Data Sources: A General Search-based Approach

Causal effect identification considers whether an interventional probability distribution can be uniquely determined without parametric assumptions from measured source distributions and structural knowledge on the generating system. While complete graphical criteria and procedures exist for many identification problems, there are still challenging but important extensions that have not been considered in the literature. To tackle these new settings, we present a search algorithm directly over the rules of do-calculus. Due to generality of do-calculus, the search is capable of taking more advanced data-generating mechanisms into account along with an arbitrary type of both observational and experimental source distributions. The search is enhanced via a heuristic and search space reduction techniques. The approach, called do-search, is provably sound, and it is complete with respect to identifiability problems that have been shown to be completely characterized by do-calculus. When extended with additional rules, the search is capable of handling missing data problems as well. With the versatile search, we are able to approach new problems such as combined transportability and selection bias, or multiple sources of selection bias. We perform a systematic analysis of bivariate missing data problems and study causal inference under case-control design. We also present the R package dosearch that provides an interface for a C++ implementation of the search.


Introduction
In many fields of science, a primary interest is determining causal effects, that is, distributions P (Y | do(X), Z), where variables Y are observed, variables X are intervened upon (forced to values irrespective of their natural causes) and variables Z are conditioned on (Pearl 2009).In this paper, instead of placing various parametric restrictions based on background knowledge, we are interested in the question of identifiability: can the causal effect be uniquely determined from the distributions (data) we have and a graph representing our structural knowledge on the generating causal system.
In the most basic setting we are identifying causal effects from a single observational input distribution, corresponding to passively observed data.To solve such problems more generally than what is possible with the back-door adjustment (Spirtes, Glymour, and Scheines 1993;Pearl 2009;Greenland, Robins, and Pearl 1999), Pearl (1995) introduced do-calculus, a set of three rules that together with probability theory enable the manipulation of interventional distributions.Shpitser and Pearl (2006b) and Huang and Valtorta (2006b) showed that docalculus is complete by presenting polynomial-time algorithms whose each step can be seen as a rule of do-calculus or as an operation based on basic probability theory.The algorithms have a high practical value because the rules of do-calculus do not by themselves provide an indication on the order in which the rules should be applied.The algorithms save us from manual application of do-calculus, which is a tedious task in all but the simplest problems.
Since then many extensions of the basic identifiability problem have appeared.In identifiability using surrogate experiments (Bareinboim and Pearl 2012a), or z-identifiability, an experimental distribution is available in addition to the observed probability distribution.For data observed in the presence of selection bias, both algorithmic and graphical identifiability results have been derived (Bareinboim and Tian 2015;Correa, Tian, and Bareinboim 2018).More generally, the presence of missing data necessitates the representation of the missingness mechanism, which poses additional challenges (Mohan, Pearl, and Tian 2013;Shpitser, Mohan, and Pearl 2015;Bhattacharya, Nabi, Shpitser, and Robins 2019).Another dimension of complexity is the number of available data sources.Identification from a mixture of observational and interventional distributions that originate from multiple conceptual domains is known as transportability for which complete solutions exist in a specific setting (Bareinboim and Pearl 2014).
While completeness has been accomplished for a number of basic identifiability problems, there are still many challenging but important extensions to the identifiability problem that have not been studied so far.Table 1 recaps the current state of the art identifiability results; it also describes generalizations that we aim to investigate in this paper.To find solutions to the more complicated identifiability problems, we present a unified approach to the identification of observational and interventional causal queries by constructing a search algorithm that directly applies the rules of do-calculus.We impose no restrictions on the number or type of known input distributions: we thus provide a solution to problems for which no other algorithmic solutions exist (Row 8 in Table 1).We also extend to identifiability under missing data together with mechanisms related to selection bias and transportability (Row 11 in Table 1).
The following introductory example does not fall under any of the previously solved special cases of Table 1, thus necessitating our approach.Consider human resource management analyzing the remuneration policy in a company.The graph of Figure 1 shows the key variables.The salary (Y ) of an employee consist of a base salary (not modeled explicitly) and a bonus (B).The base salary depends on education (E) and performance (X) of the employee.The level of performance is evaluated by the supervisor of the employee and it is one of the factors that affects the level of bonus.The performance depends on the education as well as the attitude (A) of the employee and the team.The attitude may also have a direct effect on the bonus.
We are interested in estimating the total causal effect of performance on the salary, i.e., quantifying how changes in performance affect the salary.The salary (Y ), bonus (B), education (E) and performance (X) of the employees are available from the registry of the human resource department.Data on attitude (A), bonus (B) and performance (X) have been collected in an anonymized survey that cannot be linked to the registry on the personal level.The question of interest is the identifiability of P (Y | do(X)) from the data sources P (Y, B, E, X) and P (A, B, X).Using the machinery developed in this paper, we can identify the causal effect with the formula P (Y | do(X)) = where the conditional distributions can be directly determined from the available data sources (see Section 6 for further details).
To combat the inherent computational complexity of the search-based approach, we derive rules and techniques that avoid unnecessary computational steps.We are able to detect trivial queries where non-identifiability can be determined directly from the inputs.We also present a search heuristic that considerably speeds up the search in cases where the effect is indeed identifiable.The approach, called do-search, is provably sound and it retains the completeness in the cases previously proven to be solved by the rules of do-calculus.We can easily scale up to the problem sizes commonly reported in the literature.The R package dosearch (R Core Team 2021; Tikka, Hyttinen, and Karvanen 2020) provides an implementation of dosearch and is available from the Comprehensive R Archive Network (CRAN) at https:// CRAN.R-project.org/package=dosearch.Other available software for causal effect identifiability problems are only applicable to a subset of the problems presented in Table 1.The causaleffect R package (Tikka and Karvanen 2017a) can only be used for problems on Rows 1-3 and 5-7 of the table, providing implementations of the relevant algorithms.For the problem on Row 1, the generalized adjustment criterion (Perković, Textor, Kalisch, and Maathuis 2015) can be applied using the dagitty package (Textor, van der Zander, Gilthorpe, Liśkiewicz, and Ellison 2016) or the pcalg package (Kalisch, Mächler, Colombo, Maathuis, and Bühlmann 2012) in R. The generalized back-door criterion (Maathuis and Colombo 2015) is also available in the pcalg R package.The standard back-door criterion is available in the Python package DoWhy 9 Missing data recoverability P (V) 10 Missing data recoverability P (V) Arbitrary do-search with missing data (No) Table 1: Solved and unsolved problems in causal effect identification.Bold-italic denotes the previously unsolved problems for which do-search can now be used.Input P (V) stands for passively observed joint distribution of all variables.Input P (V * ) is the joint distribution with missing data (see Section 4).The variable sets present in the same distribution are disjoint.Input P (V \ B | do(B)) stands for an experiment where all variables are measured and input P (A | do(B)) stands for an experiment where only a subset A ⊂ V of the variables is measured.Notation {•} denotes a set of inputs enumerated by the index i.The assumptions of nested experiments (NE), entire distributions (ED) and nested experiments in different domains (NEDD) are explained in Section 2. Assumptions related to surrogate outcomes (SO) can be found in (Tikka and Karvanen 2019).Input P (V | S) means the joint distribution under selection bias.The last column gives the name of an algorithm that can be used to solve the problem if one exists and whether it (or a theorem when no algorithm is provided) provides a complete solution to the problem, or whether the completeness status is not known.An algorithm is complete if it returns a correct formula precisely when the target query is identifiable.Problems 1-7 are special cases of Problem 8 and Problems 1-10 are special cases of Problem 11. (Sharma, Kiciman et al. 2019).Algorithms implemented in causaleffect run in polynomial time and can outperform dosearch in their respective restricted problem settings especially with larger graphs.For a comprehensive performance comparison between causaleffect and various adjustment criteria, see (van der Zander, Liśkiewicz, and Textor 2019).
The paper is structured as follows.Section 2 formulates our general identification problem and explains the scenarios in Table 1 and previous research in detail.Section 3 presents the search algorithm, including the rules we use, search space reduction techniques, heuristics and theoretical properties.Section 4 shows how the search can be extended to problems that involve missing data.Section 5 demonstrates how the search can be used in R via the dosearch package.Efficacy of the search is assessed via simulations.Section 6 shows a number of new problems for which we can find solutions by using the search including a real-world application.These problems include combined transportability and selection bias, multiple sources of selection bias, and causal effect identification from arbitrary (experimental) distributions.This section also includes a systematic analysis of missing data problems and case-control designs.Section 7 discusses the merits and limitations of the approach.Section 8 offers concluding remarks.

The general causal effect identification problem
Our presentation is based on Structural Causal Models (SCM) and the language of directed graphs.We assume the reader to be familiar with these concepts and refer them to detailed works on these topics for extended discussion and descriptions, such as (Pearl 2009) and (Koller and Friedman 2009).
Following the standard set-up of do-calculus (Pearl 1995), we assume that the causal structure can be represented by a semi-Markovian causal graph G over a set of vertices V (see Fig 2(a) for example).The directed edges correspond to direct causal relations between the variables (relative to V); directed edges do not form any cycles.Confounding of any two observed variables in V by some unobserved common cause is represented by a bidirected edge between the variables.This graphical representation allows us to deal with any causal structures where some variables are unmeasured.We assume a positive distribution over the variables (Huang and Valtorta 2006a) ensuring that all considered causal effects and conditional distributions are well-defined.
In a non-parametric setting, the problem of expressing a causal quantity of interest in terms of available information has been described in various ways depending on the context.When available data are affected by selection bias or missing data, a typical goal is to "recover" a joint or marginal distribution.If data are available from multiple conceptual domains, a distribution is "transported" from the source domains, from which a combination of both observational and experimental data are available, to a target domain.The aforementioned settings can be expressed in the SCM framework by equipping the graph of the model with special vertices.However, on a fundamental level these problems are simply variations of the original identifiability problem of causal effects and as such, our goal is to represent them as a single generalized identifiability problem.Formally, identifiability can be defined as follows (Pearl 2009;Shpitser and Pearl 2008).
Definition 1 (Identifiability).Let M be a set of models with a description T and two objects φ and θ computable from each model.Then φ is identifiable from θ in T if φ is uniquely computable from θ in any model M ∈ M. In other words, all models in M which agree on θ also agree on φ.
In the simplest case, the description T refers to the graph induced by causal model, θ is the joint distribution of the observed variables P (V) and the query φ is a causal effect P (Y | do(X)).On the other hand, proving non-identifiability of φ from θ can be obtained by describing two models The general form for a causal identifiability problem that we consider in this paper is formulated as follows.
Input: A set of input distributions of the form and a semi-Markovian causal graph G over V.
Task: Output a formula for the query P (Y | do(X), Z) over the input distributions, or decide that it is not identifiable.
Here A i , B i , C i are disjoint subsets of V for all i, and X, Y, Z are disjoint subsets of V.The causal graph G may contain vertices which describe mechanisms related to transportability and selection bias.In the following sections we explain several important special cases of this problem definition, some that have been considered in the literature and some which have not been.

Previously considered scenarios as special cases
We restate the concepts of transportability and selection bias under the causal inference framework, and show that identifiability in the scenarios of Rows 1-7 of Table 1 falls under the general form on Row 8. We return to problems that involve missing data on Rows 9-11 later in Section 4.

Causal Effect Identifiability
Input is restricted to a passive observational distribution P (V).The target is either a causal effect P (Y | do(X)) for Row 1 of Table 1 or a conditional causal effect P (Y | do(X), Z) for Row 2 of Table 1 (Shpitser and Pearl 2006b,a).

z-identifiability
Similarly to ordinary causal effect identification, the input consists of the passive observational distribution P (V) but also of experimental distributions known as surrogate experiments intervening on a set B (Bareinboim and Pearl 2012a)

Surrogate Outcome Identifiability
Surrogate outcomes generalize the notion of surrogate experiments from z-identifiability.For surrogate outcomes, the assumption of nested experiments still holds, but the assumption of entire distributions can be dropped.Some less strict assumptions (SO) still apply (Tikka and Karvanen 2019).The idea of surrogate outcomes is that data from previous experiments are available, but the target Y was at most only partially measured in these experiments and the experiments do not have to be disjoint from X.

Transportability
The problem of incorporating data from multiple causal domains is known as transportability (Bareinboim and Pearl 2013).Formally, the goal is to identify a query in a target domain π * using data from source domains π 1 , . . ., π n .The domains are represented in the causal graph using a special set of transportability nodes T which is partitioned into disjoint subsets T 1 , . . ., T n corresponding to each domain π i .The causal graph contains an extra edge T ij → V j whenever a functional discrepancy in f V j or in P (u V j ) exists between the target domain π * and the source domain π i .The discrepancy is active if T ij = 1 and inactive otherwise.A distribution associated with a domain π i is of the form P (A | do(B), C, T i = 1, T −i = 0), where T −i denotes the other subsets of the partition of T except T i .In other words, only the discrepancies between the π i and π * are active.A distribution corresponding to the target domain has no active discrepancies meaning that it is of the form P (A | do(B), C, T = 0).Any variable is conditionally independent from inactive transportability nodes since their respective edges vanish.Furthermore, since transportability nodes set to 0 vanish, we can assume any present transportability node to have the value 1.Thus an input distribution from a domain π i takes the form P (A | do(B), C, T i ).In the specific case of mz-transportability, the assumptions of entire distributions (ED) and nested experiments in different domains (NEDD) apply, which means that

Selection Bias Recoverability
Selection bias can be seen as a special case of missing data, where the mechanism responsible for the preferential selection is represented in the causal graph by a special sink vertex S (Bareinboim and Pearl 2012b).Typical input for the recoverability problem is P (V | S = 1), the joint distribution observed under selection bias.Just as in the case of transportability nodes, selection bias nodes only appear when the mechanism has been enabled.Thus we may assume that the input is of form P (V | S).More generally, we can consider input distributions of the form P (A | do(B), C, S).

New scenarios as special cases
The following settings are special cases of the general identifiability problem of Row 8 in Table 1 that do not fall under any of the problems of Rows 1-7.They serve as interesting additions to the cases considered in the literature.Concrete examples on these new scenarios are presented in Section 6. Section 4 extends the general problem of Row 8 in Table 1 to the general problem with missing data on Row 11 while also showcasing the special cases of Rows 9 and 10.

Multiple Data Sources with Partially Overlapping Variable Sets
The scenario where only subsets of variables are ever observed together has been extensively considered in the causal discovery literature (Danks, Glymour, and Tillman 2009;Tillman and Spirtes 2011;Triantafillou, Tsamardinos, and Tollis 2010), but not in the context of causal effect identification.In the basic setting the input consists of passively observed distributions P (A i ) such that A i ⊂ V. We may also observe experimental distributions P (A i | do(B i )) (Hyttinen, Eberhardt, and Hoyer 2012;Triantafillou and Tsamardinos 2015) or even conditionals P (A i | do(B i ), C i ).Our approach sets no limitations for the number or types of input distributions.

Combining Transportability and Selection Bias
To the best of our knowledge, the frameworks of transportability and selection bias have not been considered simultaneously.The combination of these scenarios fits into the general problem formulation.For example, we may have access to two observational distributions originating from different source domains, but affected by the same biasing mechanism: where T 1 and T 2 are the transportability nodes corresponding to the two source domains and S is the selection bias node.

Recovering from Multiple Sources of Selection Bias
In recent literature on selection bias as a causal inference problem, the focus has been on settings where only a single selection bias node is present (e.g., Bareinboim, Tian, and Pearl 2014;Correa and Bareinboim 2017;Correa et al. 2018).However, multiple sources of selection bias are typical in longitudinal studies where dropout occurs at different stages of the study.Our approach is applicable for an arbitrary number of selection bias mechanisms and input distributions affected by arbitrary combinations of these mechanisms.In other words, if S is the set of all selection bias nodes present in the graph, the inputs can take the form P (A | do(B), C, S ), where S is an arbitrary subset of S.

A search based approach for causal effect identification
The key to identification of causal effects is that interventional expressions can be manipulated using the rules of do-calculus.We present these rules for augmented graphs where an additional intervention variable I X such that I X → X is added to the induced graph for each variable X (Spirtes et al. 1993;Pearl 2009;Lauritzen 2000 means that nodes Y and intervention nodes I Z of Z are d-separated given X, Z and W in a graph where edges incoming to (intervened) X have been removed (Hyttinen, Eberhardt, and Järvisalo 2015;Dawid 2002).The three rules of do-calculus (Pearl 1995) can be expressed as follows: The rules are often referred to as insertion/deletion of observations, exchange of actions and observations, and insertion/deletion of actions respectively.Each rule of do-calculus is Algorithm 1 An outline of a search for causal effect identification.Input: Target Q = P (Y | do(X), W), a semi-Markovian graph G and a set of known input distributions P = {P 1 , . . ., P n }.Output: A formula for Q or NA if the effect is not identifiable.
1: for each P i ∈ P do 2: Derive new distributions from P i such that: • The required d-separation criteria are satisfied by G.
• Any possible additional input required must also be in P.

3:
Add the new identified distributions to P.

4:
If Q was derived, return a formula for it.5: Return NA.
only applicable if the accompanying d-separation criterion (on the right-hand side) holds in the underlying graph.In addition to these rules, most derivations require basic probability calculus.Do-calculus directly motivates a forwards search over its rules.The outline of this type of search is given in Algorithm 1.The algorithm derives new identifiable distributions based on what has been given as the input or identified in the previous steps.For each identified distribution every rule of do-calculus and standard probability manipulations of marginalization and conditioning are applied in succession, until the target distribution is found, or no new distributions can be found to be identifiable.A preliminary version of this kind of search is used by Hyttinen et al. (2015) as a part of an algorithmic solution to causal effect identifiability when the underlying graph is unavailable.
The formulas produced by Algorithm 1 correspond to short derivations and unnecessarily complicated expressions are avoided.Also, only distributions guaranteed to be identifiable are derived and used during the search.Formulas for intermediary queries that were identified during the search are also available as a result.Alternatively, one could also start with the target and search towards the input distributions; a search in this direction will spend time deriving a number expressions that are inevitably non-identifiable based on the input.A depth-first search would produce unnecessarily complicated expressions.The search can easily derive for example the back-door criterion in the graph of Figure 2(a) as shown by the derivation in Figure 2 From P (X, Y, Z) the search first derives the marginal P (Z) and the conditional P (Y | X, Z).
The two terms can be combined via the product rule of probability calculus to get P (Y, Z | do(X)) and finally the target is However, it is not straightforward to make a search over do-calculus computationally feasible.The search space in Figure 2(c) shows only the parts that resulted in the identifying formula: for example all passively observed marginals and conditionals over V can be derived from the input P (V).Especially in a non-identifiable case a naive search may go through a huge space before it can return the non-identifiable verdict.The choice of rules is also not obvious: a redundant rule may make the search faster or slower; false non-identifiability may be concluded if a necessary rule is missing.Also the order in which the rules are applied can have a large impact on the performance of the search.In the following sections we will provide non-trivial Figure 2: The back-door criterion holds in the example graph (a) for Z.The augmented graph (b) includes the intervention node I X for X explicitly.The labels M, C, P, R2 and R3 in the derivation of (c) refer to marginalization, conditioning, product rule and Rules 2 and 3 of do-calculus respectively (see Table 2).The required Table 2: The rules used to manipulate input distributions of the form P (Y | do(X), W).
The output distribution is identified if the input has been previously identified and if the corresponding d-separation Criteria 1 hold in the graph (for Rules 1±, 2± and 3±) or if the additional input has also been identified (Rules 6±).The sets Y, X and W are disjoint.The role of the set Z depends on the rule being applied (see Table 3).
solutions to these challenges.

Rules
Table 2 lists the full set of rules used to manipulate distributions during the search, generalizing the work by Hyttinen et al. (2015).

Do-calculus
Rules 1±, 2± and 3± correspond to the rules of do-calculus such that Rules 1+, 2+ and 3+ are used to add conditional variables and interventions and Rules 1−, 2−, 3− are used to remove them.Each rule is only valid if the corresponding d-separation criterion given in the beginning of Section 3 holds.

Probability theory
Rule 4 performs marginalization over Z ⊂ Y, and produces a summation at the formula level: Similarly, Rule 5 conditions on a subset Z ⊂ Y to obtain the following formula: .
Rules 6+ and 6− perform multiplication using the chain rule of probability which requires two known distributions.When Rule 6+ is applied, the distribution P (Y | do(X), W) is known and we check whether P (Z | do(X), W \ Z) is known as well.For Rule 6−, the roles of the distributions are reversed.In the case of Rule 6+, Z is a subset of W and we obtain The two version of the chain rule are needed: it may be the case that when expanding

Improving the efficacy of the search
In this section, we present various techniques that improved the efficiency of the search.These findings are implemented in a search algorithm in Section 3.3.

Term expansion
Term expansion refers to the process of deriving new distributions from an input distribution using the rules of Table 2.By term we mean a single identified distribution.A term is considered expanded if the rules of Table 2 have been applied to it in every possible way when the term is in the role of the input.Note that an expanded distribution may still take the role of an additional input when another term is being expanded.Consider the step of expanding the input term in Table 2 to all possible outputs with any rule.This can be done by enumerating every non-empty subset Z of V, and applying the rule with regard to it.Table 3 outlines the requirements for Z for each rule of the search.
Table 3: The conditions for the enumerated subset Z for applying the rules of Table 2 to a term P (Y | do(X), W).For Rules 6+ and 6−, the conditions specify valid variables of the second required term.
similar requirements as Rule 1+.Well-defined subsets for using Rule 3− are the same as for rule 2−.For Rules 4 and 5, the only requirement is that Z is a proper subset of Y.When the chain rule is applied with Rule 6+, we require that the variables of the second product term is observed in the first term.When applied in reverse with Rule 6−, the variables of the second term must not be present in the first term.

Termination conditions
Additionally, Table 3 lists the termination condition for each rule: if it is satisfied by the current term to be expanded we know that the rule cannot be applied to it.The following simple lemma shows that when any of the termination conditions hold, no new distributions can be derived from it using the respective rule, which allows the search to directly proceed to the next rule.
Lemma 1.Let G be a semi-Markovian graph and let Y, X and W be disjoint subsets of V. Then all of the following are true: Proof.For (i), the set W is empty so the application of Rule 1− using any subset Z would result in P (Y | do(X), W \ Z) = P (Y | do(X), W) which is already identified.For (ii), the set W is empty so no observation can be exchanged for an action using the second rule of do-calculus.For (iii), the set X is empty so no action can be exchanged for an observation Figure 3: A graph for the example where all rules of Table 2 are required for identifying the target quantity.
using the second rule of do-calculus.For (iv), the set X is empty so the application of Rule 3− using any subset Z would result in , W) which is already identified.For (v) and (vi), the set Y only has a single vertex, so it cannot have a non-empty subset.For (vii), the set W is empty so no subset Z ⊂ W can exist for the second input.

Rule necessity
The Rule 1 of do-calculus can be omitted as shown by Huang and Valtorta (2006b, Lemma 4).Instead of inserting an observation using Rule 1, we can insert an intervention and then exchange it for an observation.Similarly, an observation can be removed by first exchanging it for an intervention and then deleting the intervention.It follows that Rules 1+ and 1− of Table 2 are unnecessary for the search.The following example shows that the remaining rules of Table 2 are all necessary.In the graph of Figure 3, the causal effect P (Y, X 1 | do(X 2 ), W ) can be identified from the inputs ) when all rules are available, but not when any individual rule is omitted.This can be verified by running the search algorithm presented at the beginning of Section 3 or the more advanced algorithm of Section 3.3 with each rule switched off individually.

Early detection of non-identifiable instances
Worst-case performance of the search can be improved by detecting non-identifiable quantities directly based on the set of inputs before launching the search.The following theorem provides a sufficient criterion for non-identifiability.
Theorem 1.Let G be a semi-Markovian graph, let Q = P (Y | do(X), W) and let Proof.Since Y ⊆ n i=1 A i , there exists a variable Y j ∈ Y such that none of the sets A i contain it.No rule of Table 2 outputs a distribution P (Y | do(X ), W ) such that some member of Y would not already exist on the left-hand side of the input or additional input of the rule.Thus there is no sequence of rules that when applied to the available inputs P would result in a distribution of the form P (Y j , • | do(•), •).Thus there is no such sequence for P (Y | do(X), W).
In other words, Theorem 1 can be used to verify that the entire set Y of a target distribution P (Y | •) cannot be constructed from the inputs.If this is the case, the target quantity is not identifiable.

Heuristics
During the search, we always expand one term at a time through the rules and store the newly identified distributions.In order for the search to perform fast, we need to decide which branches are the most promising and should therefore be expanded first.We can do this by defining a proximity function relating the source terms and the target query, and by always expanding the closest term first.
Our suggestion here is motivated by the way an educated person might apply do-calculus in a manual derivation.Our chosen proximity function h links the target distribution P t = P (A t | do(B t ), C t ) and a source distribution P s = P (A s | do(B s ), C s ) in the following way: Each input distribution and terms derived using the search are assigned into a priority queue, where the priority is determined by the value given by h.Distributions closer to the target are prioritized over other terms.
The weight 10 for the term |A t ∩ A s | indicates that having the correct response variables is considered as the first priority.Having the correct intervention is considered as the second priority (weight 5) and having the correct condition as the third priority (weight 3).The remaining terms in h penalize variables that are in the target distribution but not in the source distribution or vice versa.Again, variables that are intervened on are considered to be more important than conditioning variables.

The search algorithm
We take Algorithm 1 as our starting point and compile the results of Section 3.2 into a new search algorithm called do-search.This algorithm is capable of solving generalized identifiability problems (Row 8 in Table 1) while streamlining the search process through a heuristic search order and elimination of redundant rules and subsets.The pseudo-code for do-search is shown in Algorithm 2.
The algorithm begins by checking whether the query can be solved trivially without performing the search.This can happen if the target Q is a member of the set of inputs or if Theorem 1 applies.Next, we note that each input distribution in the set P is marked as unexpanded at the beginning of the search.Distributions in P are expanded one at a time by applying every rule of let P be the unexpanded distribution closest to the target: P = argmax let M be the set of rules of Table 2, without Rules 1±, such that the termination conditions of Table 3 do not hold with respect to P .

7:
let P * be the set of all distributions derived from P using the rules in M 8: for each new candidate distribution P * ∈ P * , do 9: if P * is already in P, then continue 10: if the validity conditions of if Derive a formula F for Q by backtracking. 15: return F 16: Add P * to P, add P * to U 17: Mark P as expanded: remove P from U 18: return NA The iteration over the unexpanded distributions U proceeds as follows (Lines 4-5).Each input distribution and terms derived from it are assigned into a priority queue, where the priority is determined by the value given by the proximity function h.Distributions closest to the target are expanded first.In the implementation, only the actual memory addresses of the distribution objects are placed into the queue.The set P is implemented as a hash table that serves as a container for all input distributions and those derived from them.Each new distribution is assigned a unique index that also serves as the hash function for this table.The distribution objects contained in the table are represented uniquely by three integers corresponding to the sets A, B and C of the general form P (A | do(B), C).A distribution object also contains additional auxiliary information such as which rule was used to derive it, whether it is expanded or not and from which distribution it was obtained.This information is used to construct the derivation if the target is found to be identifiable.
Multiple distributions can share the same value of the proximity function h.In the case that multiple candidates share the maximal value, the one that was derived the earliest takes precedence.When the unexpanded distribution currently closest to the target is determined, the rules of Table 2 are applied sequentially for all valid subsets dictated by Table 3.When rules two and three of do-calculus are considered, the necessary d-separation criteria is checked from G (Line 12).For the chain rule, the presence of the required second input is also verified.The reverse lookup is implemented by using another hash table, where the hash function is based on the unique representation of each distribution object.The values contained in the table are the indices of the derived distributions.The same hash table is also used to ensure that we do not attempt to derive distributions again that have been previously found to be identifiable from the inputs.
We construct a set M of applicable rules for each unexpanded distribution P using the termination conditions of Table 3 (Line 6).If all the necessary conditions have been found to hold for an applicable rule and a subset, the newly derived distribution P * is added to the set of known distributions and placed into the priority queue as an unexpanded distribution.When the applicable rules and subsets have been exhausted for the current distribution P , the term is marked as expanded and removed from the queue (Line 17).If the target distribution is found at any point (Line 13), a formula is returned for it in terms of the original inputs.Alternatively, we can also continue deriving distributions to obtain different search paths to the target that can possibly produce different formulas for it.If instead we exhaust the set of unexpanded distributions by emptying the queue, the target is deemed non-identifiable by the search (Line 18).
We keep track of the rules that were used to derive each new distribution in the search.This allows us to construct a directed graph of the derivation where each root node is a member of the original input set P and their descendants are the distributions derived from them during the search.Each edge represents a manipulation of the parent node(s) to obtain the child node.For an identifiable target quantity, the formula F is obtained by backtracking the chain of manipulations recursively until the roots are reached (Line 14).The derivation of the example in the beginning of Section 3 depicted in Figure 2(c) can be efficiently found by applying this procedure.
We assess the worst case complexity of do-search in terms of the input graph size, which is the primary determining factor of the search time.Checks for separation and termination and validation conditions all run in polynomial time with respect to the number of vertices, but the number of distributions grows very rapidly.In a hypothetical absolute worst case scenario, the target is not identifiable, but every other distribution is.Supposing a graph with d vertices, we can determine how many distributions would have to be derived by the search in order to exhaust the search space.For a distribution of the form P (A | do(B), C), any variable in V can either belong to the sets A, B or C or not be present in the distribution, meaning that there are 4 d ways to categorize every variable.However, we must account for the fact that there must always be at least one variable in the set A. There are 3 d ways to assign all the variables in such a way that no variable is a member of set A. Thus the total number of distributions considered by do-search grows as O(4 d ).See the simulations of Section 5.1 for running time performance in more realistic settings.

Soundness and completeness properties
We are ready to establish some key theoretical properties of do-search.The first theorem considers the correctness of the search.
Theorem 2 (Soundness).do-search always terminates: if it returns an expression for the target Q, it is correct, if it returns NA then Q is not identifiable with respect to the rules of do-calculus and standard probability manipulations (in Table 2).
Proof.Each new distribution is derived by using only well-defined manipulations as outlined by Table 3 and by ensuring that the required separation criteria hold in G when rules of do-calculus are concerned.It follows that if the search terminates and returns a formula for the target distribution, it was reached from the set input distributions through a sequence of valid manipulations.If do-search terminates as a result of Theorem 1, we are done.Suppose now that Theorem 1 does not apply.By definition, do-search enumerates every rule of Table 2 for every well-defined subset of Table 3.By Lemma 1, no distributions are left out by applying the termination criteria of Table 3.We know that if Rules 1± of Table 3 are omitted, the distributions generated by these rules can be obtained by a combination of Rules 2± and 3±.Furthermore, the order in which the distributions are expanded does not matter, as every possible manipulation is carried out nonetheless.The search will eventually terminate, since distributions that have already been derived are not added again to the set of unexpanded distributions and there are only finitely many ways to apply the rules of Table 2.
The following theorem provides a completeness result in connection to existing identifiability results.Since do-calculus has been shown to be complete with respect to (conditional) causal effect identifiability, z-identifiability, g-identifiability and transportability, it follows that dosearch is complete for these problems as well.
Theorem 3 (Completeness).If do-search returns NA in the settings in Rows 1-4 and 6 in Table 1, then the query is non-identifiable.
Proof.Do-calculus has been shown to be complete in these settings.The rules of probability calculus encode what is used in the algorithms as can be seen for example from the proofs of Theorem 7 and Lemmas 4-8 of Shpitser and Pearl (2006b).
It is not known whether the rules implemented in do-search are sufficient for other more general identifiability problems since it is conceivable that some additional rules might exist that would be required to achieve completeness.One such generalization is the inclusion of missing data in the causal model, which we present in Section 4.However, if one were to show that do-calculus (or any other set of rules included in do-search) is complete for some special case of the generalized identifiability problem, then do-search would be complete for this problem as well.In the following sections we will use the term "identifiable by do-search" to refer to causal queries that can be identified by do-search.

Extension to missing data problems
The SCM framework can be extended to describe missing data mechanisms.For each variable V i , two special vertices are added to the causal graph.The vertex V * i is the observed proxy variable which is linked to the true variable V i via the missingness mechanism (Little and Rubin 1986;Mohan et al. 2013): where NA denotes a missing value and R V i is called the response indicator (of V i ).In other words, the variable V * i that is actually observed matches the true value V i if it is not missing (R V i = 1).We note that in this formulation, each true variable has its own response indicator, meaning that we do not consider shared indicators between variables or multiple indicators for a single variable.Figure 11 in Section 6.4 depicts some examples of graphs containing missing data mechanisms.Furthermore, if there is no missingness associated with a given variable V i meaning that it is fully observed, the corresponding response indicator R V i always has the value 1.The omission of a proxy variable and a response indicators of a specific variable from a graph encodes the assumption that the variable in question if fully observed.Note that intervention nodes are added for true variables and response indicators but not for proxy variables.On a symbolic level one could intervene on proxy variables, however we are only interested in interventions that keep Equation 2 intact.
The observed vertices of the causal diagram can be partitioned into three categories where V t is the set of true variables, V * is the set of proxy variables and V r is the set of response indicators.For any subset Z ⊂ V we define the same partition via The definition of the response indicator connects a proxy variable and a true variable.Typically this connection is only of interest for those variables V i that have missing data, meaning that P (R V i = 1) < 1.Furthermore, we often utilize a proxy variable corresponding to a specific true variable and conversely, a true variable corresponding to a specific proxy variable.We define this correspondence explicitly in the following way Similarly, given a set Z we often require the set of the response indicators that define the missingness mechanism for the true variables that are member of Z.This set is defined as follows It is important to note the difference between the sets Z r and R Z ; the first set denotes the set of response indicators that are members of Z while the second gives the corresponding response indicators for the true variables that are member of Z.
Our method is also capable of processing queries when the causal graph contains missing data mechanisms where the sets A i , B i and C i of some of the input distributions may be restricted to contain observed variables in V * ∪ V r .An active response indicator Similarly, for sets of response indicators R 1 Z denotes that all indicators in the set are active.Proxy variables are not explicitly shown in graphs for clarity.
Determining identifiability is challenging under missing data.As evidence of this, even some non-interventional queries require the application of do-calculus (Mohan and Pearl 2018).Furthermore, the rules used in the search of Table 2 are no longer sufficient and deriving the desired quantity necessitates the use of additional rules that stem from the definition of the proxy variables and the response indicator.Each new true variable also has a higher impact on computational complexity, since the corresponding response indicator and proxy variable are always added to the graph as well.Table 4 extends the set of rules of Table 2 to missing data problems by providing manipulations related to the missingness mechanism.Rules 7± and 8± perform conditioning using the chain rule.These rules are necessary in the case that set Y contains missing data mechanisms that have been enabled and thus cannot be marginalized over by using Rule 5.  Set X may only contain true variables and response indicators.The roles of the sets Z and R Z depend on the rule being applied (see Table 5).
Rules 9± are used to enable response indicators, which then facilitates the use of Rules 10±.These last two rules exchange proxy variables to their true counterparts when the corresponding response indicators are enabled.For example, under the conditions specified in Table 5, Rule 9+ can be applied on P (Y, X * | R X ) to first obtain P (Y, X * | R 1 X ) by enabling R X .Then, Rule 10+ can applied to this distribution to obtain P (Y, X | R 1 X ) by exchanging X * for X.
Similarly to Table 3, Table 5 outlines the valid subsets Z for applying the extended rules of Table 4.A major difference to the original validity and termination conditions is the addition of the missing data condition that outlines the additional requirements that must be satisfied when missingness mechanisms are present.For the rules that are shared by Tables 2 and 4, the missing data condition ensures that a true variable and its proxy counterpart never appear in the same term at the same time.For example, we cannot add an intervention on X to P (X * ).It also ensures that we do not carry out summation over enabled response indicators in the case of rules 4 and 5.When applying Rules 9±, the condition also ensures that we do not attempt to enable a response indicator that is already enabled.For Rules 10±, the conditions guarantee that a proxy can only be exchanged to its true counterpart if its corresponding response indicator is enabled and present in the input term.
Additional termination conditions also apply to the new rules and their correctness is easily verified.
Lemma 2. Let G be a semi-Markovian graph and let Y, X and W be disjoint subsets of V. Then all of the following are true: Proof.For (i), the set Y only has a single vertex, so it cannot have a non-empty subset.For (ii), the set W is empty so no subset Z ⊂ W can exist for the second input.For (iii), and the set W r is empty so no assignment to value 1 can be performed.Similarly for (iv), the set Y r is empty so no assignment to value 1 can be performed.For (v) the set of active response indicators R a is empty, so no transformation from proxy variables to true variables via the missingness mechanism in Equation 2 can take place.
The task of selecting a suitable heuristic becomes more difficult when missing data are involved with the identifiability problem.The heuristic approach of Section 3.2 is no longer directly applicable due to the relation between proxy variables, response indicators and true variables.The proximity function considers X and X * as entirely different variables despite their connection and does not prefer the inclusion of response indicators.If the heuristic is applied as such, the search path will often involve a large number of manipulations which in turn leads to complicated expressions.For these reasons we do not apply a heuristic to missing data problems, but expand terms in the order in which they were identified.The improvements described in Section 3.2 still apply.
It is straightforward to adapt do-search to the new extended set of rules.In the pseudocode shown in Algorithm 2, we simply replace all references to Tables 2 and 3 by references to Tables 4 and Tables 5, respectively.When the validity condition is checked, we also verify that the missing data condition holds.Lemma 2 guarantees the correctness of the new termination criteria.Theorem 1 is also valid when the sets A i are replaced by , since it may be possible to exchange some proxy variable to a true variable that is present in the set Y of the target P (Y | do(X), W).

The dosearch package
We implemented do-search (Algorithm 2) in C++ and constructed an R interface using the Rcpp package (Eddelbuettel and Francois 2011).This interface is provided by the R package dosearch.Calling the search from R is straightforward via the primary function that carries the name of package.
dosearch(data, query, graph, transportability The required inputs of the function are data, query and graph.Parameter data is used to encode the set P of known input distributions of Algorithm 2 as a character string, where each distribution is separated by a new line.For example, if we have access to a set of distributions P = {P (W ), P (Y | X), P (Z | do(X), W )}, we would write R> data <-" + P(W) + P(Y|X) + P(Z|do(X),W) + " The do(•)-operator can either precede or succeed conditioning variables, but it must appear only once in a given term, meaning that expressions such as P(Y|do(A),B,do(C)) are not allowed, but should instead be given as P(Y|B,do(A,C)) or P(Y|do(A,C),B).If variable sets are desired, each member of the set has to be included explicitly.
Parameter query is used to describe the target Q of Algorithm 2 as a character string, similarly as the data.If we are interested in identifying P (Y | do(X), W ) we would write R> query <-"P(Y|do(X),W)" Instead of describing distributions via text, it is also possible to use the following structure that encodes the role of each variable via a numeric vector: Given a distribution of the form P (A | do(B), C) and a variable V , a value 0 means that V ∈ A, value 1 means that V ∈ B and value 2 means that V ∈ C.This format can also be used to input data as a list of numeric vectors: Finally, graph encodes the semi-Markovian graph G of the causal model as a character string with each edge on its own line.A directed edge from X to Y is given as X -> Y and a bidirected edge between X and Y is given as X <-> Y. Intervention nodes should not be given explicitly, since they are added automatically after calling dosearch.Furthermore, only vertices with incoming or outgoing edges should be included in graph.A variable with no connected edges can still appear in the input distributions and is automatically added to the graph as well.As an example, we can encode the graph of Figure 2(a) with an added bidirected edge between X and Y as follows Alternatively, one may use igraph graphs (Csardi and Nepusz 2006) in the syntax of the causaleffect package or DAGs created using the dagitty package.R> library("igraph") R> graph <-graph.formula(X-+ Y, Z -+ X, Z -+ Y, X -+ Y, Y -+ X) R> graph <-set.edge.attribute(graph,"description", 4:5, "U") R> library("dagitty") R> graph <-dagitty("dag{X -> Y; Z -> X; Z -> Y; X <-> Y}") The next two optional parameters of dosearch, transportability and selection_bias, are used to denote those vertices of G that should be understood as either transportability nodes or selection bias nodes, respectively.Providing these parameters may increase search performance in relevant problems.Both of these parameters should be given as character strings, where individual variables are separated by a comma, for example transportability = "S,T".Parameter missing_data, as the name suggests, is used to define missingness mechanisms 2 as a character string, where individual mechanisms are separated by a comma.In order to describe that R X is the response indicator of X we would write R_X : X, which also implicitly defines that X* is the proxy variable of X. Proxy variables do not need to be manually described in graph as they are automatically constructed based on the missing_data argument.
The list control can be used to set various additional parameters that are not directly related to the identifiability problem itself, but more so to the output of the search and other auxiliary details, such as benchmarking and obtaining derivations such as Figure 2(c).One such control parameter determines whether to use the search heuristic or not (heuristic = FALSE by default).Documentation of the dosearch package contains detailed information on the full list of control parameters.
The return object of dosearch is a list with three components by default.The first component, identifiability, is a logical value that takes the value TRUE when the target distribution described by query is identifiable from the inputs of data.The second component, formula, is a character string describing the target distribution in terms of the inputs in L A T E X syntax if the target is identifiable.Otherwise this component is just an empty character string.The third component call contains the arguments of the original function call.

Simulations
Here we report the results of a simulation study to assess the running time performance of dosearch and the impact of the search space reduction techniques as well as the search heuristic outlined in Section 3.2.
Our synthetic simulation scenario consisted of 1000 semi-Markovian causal graphs of 10 vertices that were generated at random by first generating a random topological order of the vertices followed by a random lower triangular adjacency matrices for both directed and bidirected edges.Graphs without a directed path from X to Y were discarded.We sampled sequentially input distributions of the form P (A | do(B), C) at random by generating disjoint subsets such that A is always non-empty.This was continued until the target quantity P (Y | do(X)) was found to be identifiable by the search.Then for each graph, we recorded the search times for the specific set of inputs that first resulted in the query to be identified and for the last set such that the target was non-identifiable.In other words, each graph generates two simulation instances, one for an identifiable query and one for a non-identifiable query.This setting directly corresponds to the setting of partially overlapping experimental data sets discussed in Section 2.2 for which no other algorithmic solutions exist.
To understand the impact of the search heuristic and the various improvements, we compare four different search configurations: the basic do-search without the search heuristic or improvements1 , one that only uses the search heuristic, one that only uses the improvements of Section 3.2 and one that uses them both.
Figure 4 shows the search times of the configurations compared to the basic configuration for identifiable instances.Most importantly, a vast majority of instances (96%) are solved faster than the basic configuration when both heuristics and improvements are used.The average search time with both heuristics and improvements enabled was 31.5 seconds and 80.2 seconds for the basic configuration.The search heuristic provides the greatest benefit for these instances as can be seen from Figure 4(b).Using a heuristic can sometimes hinder performance by leading the search astray and by causing additional computational steps through the evaluation of the proximity function.For example, there is a small number of instances where the search is over ten times slower than the basic configuration when using a heuristic.Fortunately, there are several instances in the opposite direction, where the heuristic provides over one hundred fold reduction in search time.Curiously, even using the improvements sometimes results in slower search times.This is most likely due to the elimination of Rule 1 of do-calculus, since it may be the case that the basic search is able to use this rule to reach the target distribution faster.More importantly, Figure 4 that the improvements clearly benefit the search.Furthermore, the benefit tends to increase as the instances get harder.
Figure 5 shows the search times of the configurations for non-identifiable instances.Relying only on a search heuristic provides no benefit here, as expected.The improvements to the search are most valuable for these instances, and in this scenario every non-identifiable instance was solved faster than baseline using the improvements, and when applied with the heuristic only one non-identifiable instance was slower than baseline.The average search time with both heuristic and improvements enabled was 134.6 seconds and 182.5 seconds for the basic configuration.The almost zero second instances are a result of Theorem 1 when no search has to be performed in order to determine the instance to be non-identifiable.The benefit of the improvements tends to increase as the instances get harder also for these instances.
Finally we examined the average run time performance of do-search, with all improvements and heuristics enabled.We replicated the previously described simulation scenario with the same number of instances (1000) for graphs with up to 10 vertices.Figure 6 shows the boxplots of search times on a log-scale for graphs of different size, including both identifiable and non-identifiable instances.Note that for every graph size there are a number of easily solvable instances that show up as outliers in this plot.Instances with graphs of 10 vertices are solved routinely in under 100 seconds.In this plot, the running times increase exponentially with increasing graph size (or number of variables).

New causal effect identification results
We present a number of results for various identifiability problems to showcase the versatility of do-search with the accompanying R code for some specific examples.

Multiple data sources with partially overlapping variable sets
Earlier generalizations of the identifiability problem assume nested experiments or entire distributions with the exception of surrogate outcome identifiability (Tikka and Karvanen 2019) which also has its own intricate set of assumptions regarding the available distributions.
None of these assumptions are needed in do-search and it can be used to solve identifiability problems from completely arbitrary collections of input distributions.
We showcase identifiability from multiple data sources by three examples.The first example is the human resource management problem presented in the introduction and shown in Figure 1.The question of interest was the identifiability of P (Y | do(X)) from the data sources P (Y, B, E, X) and P (A, B, X).The answer can be provided with the following lines of R code: The result means that the causal effect is identifiable and the returned formula is By running an additional line of code R> dosearch(data, "p(y,b,e,x,a)", graph, control = list(heuristic = TRUE)) The query p(y,b,e,x,a) is non-identifiable.
we learn that the joint distribution P (Y, B, E, X, A) is not identifiable.This rules out the possibility to solve the problem with causaleffect which requires the joint distribution as an input.
In the second example we consider identifiability of P (Y 1 , Y 2 | do(X 1 , X 2 )) in the graph of Figure 7(a) from P (V), The target quantity is identifiable and do-search produces the following formula for it In the third example we consider identifiability of Again, the target quantity is identifiable and do-search outputs the following formula This example shows that a heuristic approach can also help us to find shorter formulas.If we run do-search again without the heuristic in this instance, the output formula is instead . We Figure 7: Graphs for the examples on identifiability problems combining both observational and experimental distributions.Some interesting examples are shown in Figure 11.Graphs (a) and (b) are the unique graphs that fulfill the conditions specified in Examples 2 and 3, respectively.Graph (c) is the graph with the smallest number of edges where marginals P (X) and P (Y ) can be identified but the joint distribution P (X, Y ) or causal effect P (Y | do(X)) cannot be identified by do-search.
In Graph (d), P (X), P (Y ), P (X, Y ) and P (Y | do(X)) are not identifiable by do-search but the conditional distribution P (Y | X) can be identified as follows In Equation 3, the numerator resembles the joint distribution  In Equation 4, the summation also goes over the cases where X * = NA and the distribution of Y must be estimated also on the condition that X is not observed.

Causal inference under case-control design
Case-control design (Breslow 1996) is commonly used in epidemiology to study risk factors of rare diseases.In the basic setup, a fixed number of disease cases and a fixed number of controls are selected for the risk factor measurements.When the disease is rare, this design leads to substantial savings in the sample size compared to simple random sampling.(5)  In typical applications response Y is binary but in the non-parametric formula of Equation 5response can be discrete or continuous.A more complicated example is shown in Figure 12(b) where the causal effect of risk factor X on disease endpoint Y fulfills the front-door criterion (Pearl 1995) with respect to mediator Z and the data are collected from a case-control design where the selection depends Y and there is occasional item non-response in X and Z.We observe data P (Y * , X * , Z * , R Y , R X , R Z ) and know the marginal distribution P (Y ) from other sources.Applying do-search we obtain the result \frac{\left(p(y)p(x,z|r_x = 1,y,r_y = 1,r_z = 1)\right)} {\sum_{y} \left(p(y)p(x,z|r_x = 1,y,r_y = 1,r_z = 1)\right)}

Discussion
The presented algorithm, do-search, removes the need for manual application of do-calculus, which is time-consuming and prone to errors.Systematic analyses such as the one in Section 6.4 are practically unreachable with manual application of do-calculus.Superiority of do-search over a simple forwards breadth-first search was attained through a combination of a search heuristic and a reduction of the search space.Some further approaches were attempted but later discarded as non-beneficial.These include caching separation criteria that hold in the graph after they are first evaluated, pre-computing valid subsets for each subset size and enumerating subsets in an order of increasing cardinality.
As the simulations showed, our intuitive heuristic yielded significant improvements in search performance.The proximity function defined in Section 3.2 uses only the information contained in the distributions themselves.One approach could be to also take the structure of the graph into account in the proximity function.Further study is needed for finding a heuristic that performs well when missing data mechanisms are present in the graph.
The scalability of do-search is limited due to vast search space of possibly identified causal effects.Currently, algorithms with polynomial complexity currently exist only for the simpler problems (see Table 1).However, based on the simulation results, do-search solves identifiability problems in graphs of ten vertices in under two minutes on average.By our observation, graphs typically analyzed in literature related to identifiability problems have fewer vertices.
The theoretical computational complexity of the general form of the causal identifiability problem defined in Section 2 remains an important and interesting question.
The search could also be used to obtain formulas that are in some sense simpler than those produced by existing identifiability algorithms.A simplification algorithm by Tikka and Karvanen (2017b) functions as a post-processing step after the identifying formula has already been obtained by the ID algorithm.Given a measure of simplicity, the search heuristic could be adjusted to find simple formulas directly without resorting to separate simplification procedures.In some specific scenarios, such as the standard causal effect identifiability problem, an approach known as pruning (Tikka and Karvanen 2018) could be incorporated into the search.Pruning refers to the removal of vertices from the graph, that are not required for determining identifiability.
Finally we note that identifiability has also been studied under the assumption that the functional relationships depicted by the causal model are linear (Angrist, Imbens, and Rubin 1996;van der Zander and Liskiewicz 2016;Chen, Kumor, and Bareinboim 2017) or nonparametric with additive error terms (Peters, Mooij, Janzing, and Schölkopf 2014;Peña and Bendtsen 2017) and when the causal graph is not completely known (Maathuis, Kalisch, and Bühlmann 2009;Entner, Hoyer, and Spirtes 2013;Hyttinen et al. 2015;Perković et al. 2015;Malinsky and Spirtes 2017;Jaber, Zhang, and Bareinboim 2018).Extending the search in these directions is an interesting line of future research.

Conclusion
We presented do-search: a do-calculus based search capable of solving identifiability problems for which no known solutions exist.This contribution is especially useful for researchers working in the field of causal inference to confirm theoretical results or to find counterexamples to identifiability claims.In practical terms, the search can also provide solutions to complicated problems such as combining transportability and selection bias, recovering from multiple bias sources or identifying causal quantities in the presence of missing data that cannot be solved by any other existing method.The R package dosearch providing an implementation of do-search is available on CRAN.

Figure 1 :
Figure 1: Graph for the example on human resource management in a company.

Figure 4 :Figure 5 :
Figure4shows the search times of the configurations compared to the basic configuration for identifiable instances.Most importantly, a vast majority of instances (96%) are solved faster than the basic configuration when both heuristics and improvements are used.The average search time with both heuristics and improvements enabled was 31.5 seconds and 80.2 seconds for the basic configuration.The search heuristic provides the greatest benefit for these instances as can be seen from Figure4(b).Using a heuristic can sometimes hinder performance by leading the search astray and by causing additional computational steps through the evaluation of the proximity function.For example, there is a small number of instances where the search is over ten times slower than the basic configuration when using a heuristic.Fortunately, there are several instances in the opposite direction, where the heuristic provides over one hundred fold reduction in search time.Curiously, even using the improvements sometimes results in slower search times.This is most likely due to the elimination of Rule 1 of do-calculus, since it may be the case that the basic search is able to use this rule to reach the target distribution faster.More importantly, Figure4(c) shows

Figure 6 :
Figure6: Boxplots of search times for both identifiable and non-identifiable instances in graphs of n = 4, . . ., 10 vertices.The vertical axis uses a logarithmic scaling.Instances where the search time was less than 10 −4 seconds were omitted for clarity.
data, query, graph, control = list(heuristic = TRUE)) \sum_{b,a}\left(p(a)\left(p(b|x,a)\sum_{e} \left(p(e)p(y|x,b,e)\right)\right)\right) R X are not independent.The denominator is the marginal of this pseudo joint distribution.In Graph (e), P (X), P (Y ) and P (X, Y ) are not identifiable by do-search but P (Y | X) and P (Y | do(X)) are identifiable and can be both estimated with 5 Graphs where there is neither X → Y nor Y → X.

Figure 10 :
Figure10: Venn diagrams indicating the number of graphs were different distributions can be identified do-search.The intersection of P (X) and P (Y | X) shows the number of graphs were P (X, Y ) can be identified.The total number of possible graphs is 3072 in both cases.
Figure12(a)shows the missingness graph for a situation where the inclusion to the study (indicator R Y ) depends on the disease endpoint Y .The risk factors X are measured for the subset R Y = 1 but occasionally the values are missing (indicator R X ).It is immediately seen that neither the causal effect P (Y | do(X)) nor conditional distribution P (Y | X) can be identified because of the edge Y → R Y .However, if the prevalence of the disease in the population, i.e., the marginal distribution P (Y ), is known, the causal effect P (Y | do(X)) can be identified.The result is provided by do-searchP (Y | do(X)) = P (Y )P (X | Y, R Y = 1, R X = 1) Y P (Y )P (X | Y , R Y = 1, R X = 1).

Figure 11 :
Figure 11: Missingness graphs used as example cases.Proxy variables are omitted for clarity.

Figure 12 :
Figure 12: Missingness graph for the case-control examples.
(Lee et al. 2019)ability, the input does not contain the passive observational distribution P (V) but consists instead of surrogate experiments on the sets {B i }(Lee et al. 2019)without the assumption of nested experiments.The assumption of entire distributions holds.
Table3tells us that when an observation Z is added using Rule 1+, it cannot be contained in any of the sets Y, X or W since they are already present in the term.Only observations that are present can be removed, which is why Z has to a subset of W when applying Rule 1−.We may skip the application of this rule if the set of observations is empty for the current term.The exchange of observations to experiments using Rule 2+ has similar requirements for set Z as Rule 1−.Exchanging experiments to observations using Rule 2− works in a similar fashion.Only experiments that are present can be exchanged which means that Z ⊆ X.This rule can be skipped if the set of experiments is empty.New experiments are added using Rule 3+ with Table 2 in every possible way.Target Q = P (Y | do(X), W), a semi-Markovian graph G and a set of known distributions P = {P 1 , . . ., P n }.Output: A formula F for Q in terms of P or NA 1: if Q ∈ P, return Q 2:if target is non-identifiable by Theorem 1, then return NA 3: let U be the set of unexpanded distributions, initially U := P 4: while U = ∅, do 5: Table 3 are not satisfied by P * , then continue 11: if an additional input is required that is not in P, then continue 12: if Rule 2± or 3± of Table 2 is applied and the corresponding d-separation Criterion 1 is not satisfied by G, then continue 13:

Table 4 :
Extended set of rules for missing data problems used to manipulate input distributions of the form P (Y | do(X), W).Rules 1±, 2±, 3±, 4, 5 and 6± are the same as in Table2.
For Rules 7± and 8±, the additional input has also been identified.The sets Y, X and W are disjoint.Sets Y and W may contain true variables, proxy variables and response indicators.

Table 5 :
The conditions for the enumerated subset Z for applying the rules of Table4to a term in the input column.
Here T = Y ∪ X ∪ W and the sets Y, X and W are those present in the input term P (Y | do(X), W).Active response indicators of the input are denoted by R a .For Rules 6±, 7± and 8±, the conditions specify valid variables of the second required term.Validity conditions for Rules 1±, 2±, 3±, 4, 5 and 6± are the same as in Table3.(i)If|Y| = 1, then Rules 7± ofTable 4 cannot be used.(ii) If W = ∅, then Rule 8− of Table 4 cannot be used.(iii) If W r = ∅ then Rule 9+ of Table 4 cannot be used.(iv) If Y r = ∅ then Rule 9− of Table 4 cannot be used.(v) If R a = ∅, then Rules 10± of Table 4 cannot be used.
can run these examples in R by writing

Combining transportability and selection bias
Bhattacharya et al. (2019)st(heuristic = TRUE))Input distributions that originate from multiple sources while being simultaneously affected by selection bias can be considered with do-search.This kind of problem cannot be solved with algorithms RC or TR mz of Table1.As an example we consider one source domain and a target domain with two input data sets: a biased distribution P (X, Y, Z | S) from the target domain and an unbiased experimental distribution P (Y, Z | do(X), T ) from the source domain.We evaluate the query P (Y | do(X)) in the graph of Figure8using these inputs.In the figure transportability node T is depicted as a gray square and selection bias node S is depicted as an open double circle.The query is identifiable and do-search outputs the as shown byBhattacharya et al. (2019)(Row 10 of Table1).It follows from the results ofBhattacharya et al. (2019)that the rules of Table4are not complete for missing data problems.Differently from the theorem byMohan et al. (2013)and the algorithms byShpitser et al. (2015)andBhattacharya et al. (2019), do-search can however address missing data problems where we consider identification of a marginal or a conditional distribution.In addition, do-search can address missing data problems with multiple input distributions.The queries P (X, Y ), P (X), P (Y ), P (Y | X) and P (Y | do(X)) were evaluated using dosearch in these 6144 graphs with the input distribution P (X * , Y * , R X , R Y ).The results are summarized by Venn diagrams in Figure10.The results are also available as a data set bivariate_missingness in the R package dosearch.Using this data set we are able to showcase examples on non-identifiability and find interesting special cases by direct evaluation of all possible bivariate missingness graphs.The first example relates non-identifiability to the of the number of edges present in the graph.
Example 1.Let K denote the number of edges in a bivariate missingness graph that does not have edge Y → X.The joint distribution P (X, Y ) is not identifiable by do-search if K > 5, marginal distribution P (X) is not identifiable by do-search if K > 9, marginal distribution P (Y ) and conditional distribution P (Y | X) are not identifiable by do-search if K > 8.The next example specifies the graph with the largest number of edges where both the joint distribution of X and Y and the causal effect of X on Y can be identified.Example 2. The graph in Figure11(a) is the only bivariate missingness graph that (i) has edge X → Y , (ii) has five edges, and (iii) allows for the identification of P (X, Y ) and P (Y | do(X)) by do-search.The third example specifies the graph with the largest number of edges where the marginal distributions are identifiable while the joint distribution and the causal effect of X on Y are non-identifiable.Example 3. The graph in Figure11(b) is the only bivariate missingness graph that (i) has five edges, and (ii) allows for the identification of P (X) and P (Y ), and (iii) does not allow for the identification of P (X, Y ) or P (Y | do(X)) by do-search.No bivariate missingness graph that has more than five edges fulfills the conditions (ii) and (iii).