diff --git a/paper/paper.md b/paper/paper.md index 190a5fa..eb07ebd 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -1,7 +1,8 @@ --- affiliations: - index: 1 - name: Harvard T.H. Chan School of Public Health, U.S.A. + name: Department of Biostatistics, Harvard T.H. Chan School of Public + Health authors: - affiliation: 1 corresponding: true @@ -11,90 +12,100 @@ authors: name: Nima S. Hejazi orcid: 0000-0002-7127-2789 bibliography: paper.bib -date: 2024-11-13 +date: 2024-11-20 tags: - Julia - statistics - causal inference - tables -title: "CausalTables.jl: Simulating and storing data for statistical +title: "`CausalTables.jl`: Simulating and storing data for statistical causal inference in Julia" toc-title: Table of contents --- # Summary -Estimating the strength of causal relationships between variables is a -problem of prime importance across scientific disciplines -- and one for -which many competing statistical methods are being developed. -CausalTables.jl provides tools to evaluate and compare statistical -causal inference methods in Julia. The package provides two main -functionalities. First, it implements a `CausalTable` interface for -storing data with partially-labeled causal structure in a -Tables.jl-compatible format. Second it provides a -`StructuralCausalModel` for randomly generating data with a given causal -structure, as well as computing ground truth parameters. When used -together, both functionalities allow users to more easily use and -benchmark the growing number of methods for causal inference in Julia. +Estimating the strength of causal relationships between variables is an +important problem across many scientific disciplines---and one for which +several frameworks have been developed. A variety of statistical methods +have been developed to estimate and obtain inference about causal +quantities, yet few tools readily support the comparison of candidate +approaches. `CausalTables.jl` provides tools to evaluate and compare +statistical causal inference methods in Julia. The package provides two +main functionalities. Firstly, it implements a `CausalTable` interface +for storing data with partially-labeled causal structure in a +`Tables.jl`-compatible format. Secondly, it introduces a +`StructuralCausalModel` for randomly generating data with a +user-specified causal structure while also supporting computing ground +truth parameters under the given experiment. Together, these +functionalities expand the Julia ecosystem by supporting the use and +benchmarking of the growing number of causal inference methods. # Statement of need -The field of causal inference helps scientists and decision-makers -understand cause-and-effect relationships between variables in data -[@hernan2020causal]. As interest in this field has grown across -disciplines, so too has the development of software tools for estimating -causal effects. While Julia packages for causal inference have begun to -emerge -- including TMLE.jl [@TMLE.jl] and CausalELM.jl [@CausalELM.jl] -for estimation and CausalInference.jl for graph discovery [@Schauer2024] --- the ecosystem is still in its infancy. Because new methods for causal -inference in various settings are being developed at a rapid pace, it is -important to have tools that make it easy to evaluate and compare their -performance. The goal of CausalTables.jl is to provide such a tool in -Julia. - -Currently, those attempting to benchmark causal inference methods in -Julia face two major challenges. First, packages often have inconsistent -interfaces. For example, some packages might require the user to provide -treatment and response variables as individual vectors, while others -might require the entire dataset in a Tables.jl-compliant format, with -treatment and response labeled via strings or symbols. CausalTables.jl -provides a `CausalTable` interface that simplifies packaging data and -auxiliary causal information together. - -The second major challenge is that benchmarking often requires -simulating data from a known Structural Causal Model (SCM) -[@pearl2009causality] and comparing estimated effects to some ground -truth. An SCM is a statistical model, typically defined as a sequence of -random draws, with each draw depending on the previous ones. Both the -potential outcomes and graph-based frameworks of causality can be -represented using SCMs. CausalTables.jl provides a simple way for users -to define their own SCM, draw random data from it, and compute or -approximate the true values of several common causal effect parameters. - -By addressing these two major challenges, CausalTables.jl helps simplify -and accelerate the development of tools for statistical causal inference -in Julia. The `CausalTable` interface extends Tables.jl, the most common -interface for accessing tabular data in Julia [@quinn2024tables]. The -SCM framework works in conjunction with Distributions.jl, the premier -Julia package for working with random variables -[@JSSv098i16; @Distributions.jl-2019]. By integrating seamlessly with -other common packages in the Julia ecosystem, CausalTables.jl ensures -both compatibility and ease of use for statisticians and students alike. +The quantitative science of causal inference has emerged over the past +three decades as a set of formalisms for studying cause-and-effect +relationships between variables from observed data +[@pearl2009causality; @hernan2020causal]; causal inference techniques +have helped applied scientists and decision-makers better understand +important phenomena in fields ranging from health and medicine to +politics and economics. As interest in causal inference has grown across +many disciplines, so too has the development of software tools for +estimating causal effects. While Julia packages for causal inference +have begun to emerge---including, for estimation, `TMLE.jl` [@TMLE.jl] +and `CausalELM.jl` [@CausalELM.jl], and, for causal discovery, +`CausalInference.jl` [@Schauer2024]---the ecosystem is still in its +infancy. New methods for causal inference are being developed at a rapid +pace, underscoring the need for tools designed to support and simplify +the evaluation and comparison of their performance. `CausalTables.jl` +aims to provide such a tool for the Julia language. Currently, attempts +to benchmark causal inference methods in Julia face two major +challenges. + +First, packages often have inconsistent APIs. For example, some packages +require the user to provide treatment and response variables as +individual vectors, while others require the entire dataset in a +`Tables.jl`-compliant format, with treatment and response variables +labeled via strings or symbols. `CausalTables.jl` provides a +`CausalTable` interface that simplifies packaging the data and auxiliary +causal knowledge together. Second, benchmarking of methods requires +simulating data for numerical experiments from a Structural Causal Model +(SCM) [@pearl2009causality] so as to compare candidate estimators to an +underlying ground truth (encoded via interventions on the SCM). An SCM +defines causal structure by envisaging a data-generating process as +random draws from a sequence of non-parametric structural equations, +with each draw depending on realizations from draws preceding it +temporally. `CausalTables.jl` provides a simple, user-friendly way to +define an SCM, sample data randomly from it, and compute or approximate +the underlying true values (determiend by the SCM) of several common +causal effect parameters. + +By addressing these two major challenges, `CausalTables.jl` simplifies +and accelerates the development of tools for statistical causal +inference in Julia. The `CausalTable` interface extends `Tables.jl`, the +most common interface for accessing tabular data in Julia +[@quinn2024tables]. The SCM framework operates in conjunction with +`Distributions.jl`, the premier Julia package for working with random +variables [@JSSv098i16; @Distributions.jl-2019]. By integrating +seamlessly with other commonly usedpackages in the Julia ecosystem, +`CausalTables.jl` ensures both compatibility and ease of use for +statisticians and applied scientists alike. # Instructional use cases -The standard causal inference problem is to estimate the effect of a +A standard causal inference problem is to estimate the effect of a treatment variable $A$ on a response variable $Y$ in the presence of confounders $W$. One can benchmark causal inference methods in two ways: either by imposing a causal structure on an existing dataset, or by drawing new data randomly from a programmatically-defined SCM. Wrapping an existing dataset with causal structure is easy. The -`CausalTable` constructor creates a Tables-compliant data structure -coupled with causal information about its data. Calling convenience -functions on this object allows users to perform common causal data -processing tasks. For instance, the `responseparents` function can be -used to select only variables upstream from the response. +`CausalTable` constructor creates a `Tables.jl`-compliant data +structure, coupled with causal structure about the data-generating +process. Calling convenience functions on this object allows users to +perform data processing tasks common in causal inference. For instance, +the `responseparents` function can be used to select only variables +upstream from the response. :::: {.cell execution_count="1"} ``` {.julia .cell-code} @@ -127,9 +138,14 @@ responseparents(ct_wrap) ::: :::: + + Simulating causal data for different settings is slightly more involved. -In the remainder of this section, we will present two examples use cases -of how CausalTables.jl can be used as a benchmarking tool. +In the remainder of this section, we will present two example use cases +of how `CausalTables.jl` can be used as a benchmarking tool. ## Example 1: Average Treatment Effect @@ -142,12 +158,13 @@ the following: `\begin{align*} W &\sim Beta(2, 4) \\ A &\sim Bernoulli(W) \\ Y &\sim Normal(A + W, 1) -\end{align*}`{=tex} To compute the ground truth ATE, we can define the -SCM above in CausalTables.jl by defining the sequence of random -variables to be drawn using the `@dgp` macro. Then, we create a -`StructuralCausalModel` object which labels the steps we want to -consider as treatment, response, and confounders, and randomly draw -datasets from it using the `rand` function. +\end{align*}`{=tex} To compute the ground truth ATE via +`CausalTables.jl`, we define the SCM above by enumerating the sequence +of random variables to be drawn using the `@dgp` macro. Then, we create +a `StructuralCausalModel` object which labels the steps we want to +consider as treatment, response, and confounders. Finally, we randomly +draw datasets from the newly instantiated `StructuralCausalModel` using +the `rand` function. ::: {.cell execution_count="1"} ``` {.julia .cell-code} @@ -169,10 +186,15 @@ ct = rand(scm, 500) # randomly draw from the SCM ``` ::: -At a high level, CausalTables.jl provides functions to approximate -ground truth values of common causal estimands such as the ATE (using -`ate`), along with their efficiency bound -- the lowest possible -variance achievable by a causal estimator of the given quantity. +Not only does `CausalTables.jl` provide functions to approximate ground +truth values of common causal estimands such as the ATE (using `ate`), +it also allows estimation of the corresponding efficiency bound---the +asymptotic variance lower bound---for a class of estimators (those that +are regulary and asymptotically linear) commonly used in causal +inference; this is a critical component as it facilitates the comparison +of candidate estimators not only in terms of their bias (average +distance from the ground truth) but also their efficiency. Below, we +demonstrate these for the example SCM given above. :::: {.cell execution_count="1"} ``` {.julia .cell-code} @@ -180,17 +202,22 @@ ate(scm) # average treatment effect ``` ::: {.cell-output .cell-output-display execution_count="1"} - (μ = 0.998, eff_bound = 2.002) + (μ = 0.999, eff_bound = 1.998) ::: :::: -CausalTables.jl also provides a low-level interface allowing users to + + +`CausalTables.jl` also provides a low-level interface allowing users to (1) apply common interventions to the treatment variable in a `CausalTable`, and (2) compute ground-truth conditional densities and -functions of them (mean, variance, et cetera) typically used to -construct causal estimators. For example, the code below computes the -difference in conditional means of $Y$ under treatment versus no -treatment. +functions of these (e.g., mean, variance), which typically arise as +nuisance parameters in the construction of estimators in causal +inference. For example, below, we compute the difference in the +conditional mean of $Y$ under treatment versus no treatment, the +difference of which is the ATE. :::: {.cell execution_count="1"} ``` {.julia .cell-code} @@ -204,11 +231,11 @@ outcome_reg = mean(conmean(scm, ct_treated, :Y) .- conmean(scm, ct_untreated, :Y ::: :::: -The above represents the ground-truth plug-in estimate of the individual -treatment effect (outcome regression) for each unit in the dataset. -Alternatively, one can also compute an inverse-probability weighted -(IPW) estimate with ground-truth weights using the `propensity` -function: +The above recovers an estimate of the ground-truth via plug-in estimates +based on the outcome regression (i.e., the conditional expectation of +the outcome, given treatment and covariates). Alternatively, one can +also compute an inverse probability weighted (IPW) estimate with +ground-truth weights using the `propensity` function: :::: {.cell execution_count="1"} ``` {.julia .cell-code} @@ -218,13 +245,13 @@ ipw = mean(y .* (2 * a .- 1) ./ propensity(scm, ct, :A)) ``` ::: {.cell-output .cell-output-display execution_count="1"} - 0.923 + 1.129 ::: :::: Finally, as an alternative, one can randomly generate a new -counterfactual response value for each unit in a CausalTable under a -given intervention using `draw_counterfactual`, and compute the ATE +counterfactual response value for each unit in a `CausalTable` under a +given intervention using `draw_counterfactual`, and then compute the ATE directly: :::: {.cell execution_count="1"} @@ -235,15 +262,15 @@ mean(y_treated .- y_untreated) ``` ::: {.cell-output .cell-output-display execution_count="1"} - 1.017 + 1.056 ::: :::: ## Example 2: Modified Treatment Policies -CausalTables.jl can be used for more than just binary treatments; it +`CausalTables.jl` is not limited to settings with binary treatments; it also supports more exotic estimands. Consider the following SCM, in -which the treatment $A$ takes on a continuous value. +which the treatment $A$ is continuous-valued. ::: {.cell execution_count="1"} ``` {.julia .cell-code} @@ -260,17 +287,20 @@ scm = StructuralCausalModel(dgp; ``` ::: -In the continuous treatment setting, we are often interested in the -effect of a *modified treatment policy* (MTP), which poses the question: -"how would the counterfactual outcome change had we applied some -intervention $d$ to the existing treatment?" [@Haneuse2013]. In this -setting, rather than estimating an ATE, we estimate an *average policy -effect* (APE) -- the difference between $Y$ under the natural treatment -versus the treatment upon which we have intervened. A common example is -the additive MTP $d(a) = a + 1$; when the relationship between $A$ and -$Y$ is linear, this is equivalent to fitting a linear regression, but -using CausalTables.jl, we can obtain approximate the ground truth even -when the relationship is nonlinear. +In the continuous treatment setting, a common causal estimand is the +effect of a *modified treatment policy* (MTP), which corresponds to the +question: "how would the counterfactual outcome change had an +intervention $d(a, w; \delta)$ been applied to the observed treatment +$a$?" [@Haneuse2013]. In this setting, rather than estimating an ATE, we +estimate an *average policy effect* (APE)---the difference between $Y$ +under the natural treatment value $a$ and the intervened-upon treatment +that results from the MTP $d(a, w; \delta)$. A common example is the +additive MTP $d(a, w; \delta) = a + \delta$; when the relationship +between $A$ and $Y$ is known to be linear, this is equivalent to the +slope in a linear regression model, but, using `CausalTables.jl`, we can +approximate the ground truth APE even when the relationship is +nonlinear. We demonstrate this below for an MTP indexed by the choice +$\delta = 1$. :::: {.cell execution_count="1"} ``` {.julia .cell-code} @@ -278,16 +308,18 @@ ape(scm, additive_mtp(1)) # average policy effect ``` ::: {.cell-output .cell-output-display execution_count="1"} - (μ = 2.500, eff_bound = 5.242) + (μ = 2.502, eff_bound = 5.252) ::: :::: -One strategy for estimating an APE is using a parametric outcome -regression (such as a linear model) that predicts $Y$ under the modified -treatment policy $A = d(a)$ and computes its average difference from the -naturally observed $Y$. We can use CausalTables.jl to see how well this -procedure would work if we knew the true value of the outcome regression -using the `intervene` and `conmean` functions: +One strategy for estimating an APE is to assume a parametric outcome +regression model (such as the general linear model with a specified +functional form), using this to predict the outcome $Y$ under the +modified treatment policy $d(A, W; +\delta): A \to A + \delta$; the average difference of these predictions +from the observed $Y$ yields the APE. We can use `CausalTables.jl` to +see how well this procedure would work if we knew the true value of the +outcome regression using the `intervene` and `conmean` functions: :::: {.cell execution_count="1"} ``` {.julia .cell-code} @@ -297,28 +329,28 @@ outcome_reg = mean(conmean(scm, ct_intervened, :Y) .- responsematrix(ct)) ``` ::: {.cell-output .cell-output-display execution_count="1"} - 2.330 + 2.502 ::: :::: -# Closing Remarks +# Closing remarks -The flexibility of CausalTables.jl allows users to easily extract -ground-truth values for any relevant aspect of a data generating +The flexibility of `CausalTables.jl` allows users to easily extract +ground truth values for any relevant aspect of a data-generating process. This supports benchmarking causal estimators of virtually any estimand that fits in the SCM framework, not just those mentioned in these two examples. While the package includes high-level functions to approximate several common causal estimands, users can also write their own interventions and use low-level functions such as `intervene`, `draw_counterfactual`, and `condensity` to approximate the ground truth -of custom causal estimands as well. CausalTables.jl will serve as a -useful tool for statisticians and other scientists seeking to evaluate -various causal inference methods using simulation studies. +of new causal estimands. Thus, `CausalTables.jl` serves as a useful tool +for scientists seeking to evaluate and benchmark various causal +inference methods in simulation studies. # Acknowledgements Salvador Balkus acknowledges support from the National Institute of -Environmental Health Sciences (award no.\~T32 ES007142) and the National -Science Foundation (award no.\~DGE 2140743). +Environmental Health Sciences (award no. T32 ES007142) and the National +Science Foundation (award no. DGE 2140743). # References {#references .unnumbered} diff --git a/src/CausalTables.jl b/src/CausalTables.jl index e377448..31d6af0 100644 --- a/src/CausalTables.jl +++ b/src/CausalTables.jl @@ -35,7 +35,7 @@ export nrow, ncol, subset, select, reject export replace, treatment, response, confounders, data export treatmentmatrix, responsematrix, confoundersmatrix export treatmentnames, responsenames, confoundernames -export treatmentparents, responseparents +export treatmentparents, responseparents, parents export adjacency_matrix, dependency_matrix # network_summary.jl diff --git a/src/causal_table.jl b/src/causal_table.jl index e27dc9c..3ed8e76 100644 --- a/src/causal_table.jl +++ b/src/causal_table.jl @@ -290,7 +290,7 @@ responsematrix(o::CausalTable) = Tables.matrix(response(o)) """ treatmentparents(o::CausalTable) -Selects the confounders from the given `CausalTable` object. +Selects all variables besides those in `o.treatment` and `o.response` from the given `CausalTable` object. # Arguments - `o::CausalTable`: The `CausalTable` object from which to extract the parent variables of the treatment. @@ -303,7 +303,7 @@ treatmentparents(o::CausalTable) = reject(o, union(o.treatment, o.response)) """ responseparents(o::CausalTable) -Selects the treatment and confounders from the given `CausalTable` object. +Selects all variables besides those in `o.response` from the given `CausalTable` object. # Arguments - `o::CausalTable`: The `CausalTable` object from which to extract the parent variables of the response. @@ -313,6 +313,30 @@ A new `CausalTable` containing only the confounders and treatment. """ responseparents(o::CausalTable) = reject(o, o.response) +""" + parents(o::CausalTable, symbol) + +Selects the variables that precede `symbol` causally from the CausalTable `o`. For instance, if `symbol` is in `o.response`, this function will return a CausalTable containing the symbols in `o.treatment` and `o.confounders`. + +Warning: If `symbol` is in `o.confounders`, then this function will return a CausalTable containing an empty `data` attribute. + +# Arguments +- `o::CausalTable`: The `CausalTable` object from which to extract the parent variables of `symbol`. +- `symbol`: The variable for which to extract the parent variables. + +# Returns +A new `CausalTable` containing only the parents of `symbol` +""" +function parents(o::CausalTable, symbol) + if symbol in o.treatment + return(treatmentparents(o)) + elseif symbol in o.response + return(responseparents(o)) + else + return(replace(o; data = (;))) + end +end + # Other getters """ data(o::CausalTable) diff --git a/test/runtests.jl b/test/runtests.jl index 4c52b6f..f10790a 100644 --- a/test/runtests.jl +++ b/test/runtests.jl @@ -77,6 +77,8 @@ end @test Tables.columnnames(CausalTables.response(coltbl2)) == (:Y, :U) @test Tables.columnnames(CausalTables.confounders(coltbl2)) == (:Z, :T) @test Tables.columnnames(CausalTables.treatmentparents(coltbl2)) == (:Z, :T) + @test Tables.columnnames(CausalTables.parents(coltbl2, :X)) == (:Z, :T) + @test Tables.columnnames(CausalTables.parents(coltbl2, :Z)) == () @test Tables.columnnames(CausalTables.responseparents(coltbl2)) == (:X, :Z, :S, :T) # Other convenience