Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to analyse Regression Kink Analysis Designs #264

Merged
merged 29 commits into from
Nov 9, 2023
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
15ef4ec
first stab at regression kink design
drbenvincent Oct 20, 2023
81ea342
first draft of kink design notebook done
drbenvincent Oct 20, 2023
e8905f3
add notebook to readthedocs
drbenvincent Oct 20, 2023
6b7e0e9
update regression kink design notebook
drbenvincent Nov 2, 2023
b4436c9
add entry into the glossary
drbenvincent Nov 2, 2023
64c0b99
update glossary entry
drbenvincent Nov 2, 2023
1aaa028
add link to glossary term in notebook
drbenvincent Nov 2, 2023
7847ede
fix typo in glossary
drbenvincent Nov 2, 2023
8ce77a1
add docstring to RegressionKink class
drbenvincent Nov 2, 2023
9cf6b51
add integration tests for regression kink design
drbenvincent Nov 2, 2023
6e47576
update UML diagram
drbenvincent Nov 2, 2023
5e099c2
update summary method to make this work + re-run example notebook
drbenvincent Nov 2, 2023
cc07b94
add comment to explain betas
drbenvincent Nov 6, 2023
dca2844
add comment about beta parameters for the quadratic example
drbenvincent Nov 6, 2023
e3c37be
remove redundant figure and re-run notebook
drbenvincent Nov 6, 2023
2f89a97
clarification about using same dataset in the basis function example
drbenvincent Nov 6, 2023
3ca5f87
show 2 decimal places in final az.plot_posterior + re-run notebook
drbenvincent Nov 6, 2023
59a8b3e
add another example to demonstrate the bandwidth parameter + add note…
drbenvincent Nov 6, 2023
0b70c9e
change title of example notebook
drbenvincent Nov 6, 2023
732707f
add regression kink design to README.md and index.rst
drbenvincent Nov 6, 2023
7e982c0
remove some commented out lines + remove parens
drbenvincent Nov 7, 2023
1ba0333
simplify the summary method
drbenvincent Nov 7, 2023
dab61dc
add input validation for bandwidth and epsilon parameters
drbenvincent Nov 7, 2023
2ad1ba3
test exceptions are raised for the new input validation
drbenvincent Nov 7, 2023
80537c6
make gradient change code more modular
drbenvincent Nov 7, 2023
b26a375
Merge branch 'main' into kink
drbenvincent Nov 7, 2023
255ea83
update UML diagrams
drbenvincent Nov 7, 2023
232944b
add tests for cp.pymc_experiments.RegressionKink._eval_gradient_change
drbenvincent Nov 7, 2023
4386f6b
remove kwarg being overwritten
drbenvincent Nov 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ repos:
exclude_types: [svg]
- id: check-yaml
- id: check-added-large-files
args: ['--maxkb=1500']
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.1.4
hooks:
Expand Down
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,25 @@ Regression discontinuity designs are used when treatment is applied to units acc

> The data, model fit, and counterfactual are plotted (top). Frequentist analysis shows the causal impact with the blue shaded region, but this is not shown in the Bayesian analysis to avoid a cluttered chart. Instead, the Bayesian analysis shows shaded Bayesian credible regions of the model fits. The Frequentist analysis visualises the point estimate of the causal impact, but the Bayesian analysis also plots the posterior distribution of the regression discontinuity effect (bottom).

### Regression kink designs

Regression discontinuity designs are used when treatment is applied to units according to a cutoff on a running variable, which is typically not time. By looking for the presence of a discontinuity at the precise point of the treatment cutoff then we can make causal claims about the potential impact of the treatment.

| Running variable | Outcome |
|-----------|-----------|
| $x_0$ | $y_0$ |
| $x_1$ | $y_0$ |
| $\ldots$ | $\ldots$ |
| $x_{N-1}$ | $y_{N-1}$ |
| $x_N$ | $y_N$ |


| Frequentist | Bayesian |
|--|--|
| coming soon | ![](docs/source/_static/regression_kink_pymc.svg) |

> The data and model fit. The Bayesian analysis shows the posterior mean with credible intervals (shaded regions). We also report the Bayesian $R^2$ on the data along with the posterior mean and credible intervals of the change in gradient at the kink point.

### Interrupted time series

Interrupted time series analysis is appropriate when you have a time series of observations which undergo treatment at a particular point in time. This kind of analysis has no control group and looks for the presence of a change in the outcome measure at or soon after the treatment time. Multiple predictors can be included.
Expand Down
230 changes: 222 additions & 8 deletions causalpy/pymc_experiments.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"""

import warnings
from typing import Optional, Union
from typing import Union

import arviz as az
import matplotlib.pyplot as plt
Expand Down Expand Up @@ -328,7 +328,7 @@
fontsize=LEGEND_FONT_SIZE,
)

return (fig, ax)
return fig, ax

Check warning on line 331 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L331

Added line #L331 was not covered by tests

def summary(self) -> None:
"""
Expand Down Expand Up @@ -424,7 +424,7 @@
ax[0].plot(
self.datapost.index, self.post_X, "-", c=[0.8, 0.8, 0.8], zorder=1
)
return (fig, ax)
return fig, ax

Check warning on line 427 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L427

Added line #L427 was not covered by tests


class DifferenceInDifferences(ExperimentalDesign):
Expand Down Expand Up @@ -794,7 +794,7 @@
model=None,
running_variable_name: str = "x",
epsilon: float = 0.001,
bandwidth: Optional[float] = None,
bandwidth: float = np.inf,
**kwargs,
):
super().__init__(model=model, **kwargs)
Expand All @@ -807,7 +807,7 @@
self.bandwidth = bandwidth
self._input_validation()

if self.bandwidth is not None:
if self.bandwidth is not np.inf:
fmin = self.treatment_threshold - self.bandwidth
fmax = self.treatment_threshold + self.bandwidth
filtered_data = self.data.query(f"{fmin} <= x <= {fmax}")
Expand Down Expand Up @@ -836,7 +836,7 @@
self.score = self.model.score(X=self.X, y=self.y)

# get the model predictions of the observed data
if self.bandwidth is not None:
if self.bandwidth is not np.inf:
xi = np.linspace(fmin, fmax, 200)
else:
xi = np.linspace(
Expand Down Expand Up @@ -903,7 +903,7 @@
self.data,
x=self.running_variable_name,
y=self.outcome_variable_name,
c="k", # hue="treated",
c="k",
ax=ax,
)

Expand Down Expand Up @@ -939,7 +939,7 @@
labels=labels,
fontsize=LEGEND_FONT_SIZE,
)
return (fig, ax)
return fig, ax

Check warning on line 942 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L942

Added line #L942 was not covered by tests

def summary(self) -> None:
"""
Expand All @@ -957,6 +957,220 @@
self.print_coefficients()


class RegressionKink(ExperimentalDesign):
"""
A class to analyse sharp regression kink experiments.

:param data:
A pandas dataframe
:param formula:
A statistical model formula
:param kink_point:
A scalar threshold value at which there is a change in the first derivative of
the assignment function
:param model:
A PyMC model
:param running_variable_name:
The name of the predictor variable that the kink_point is based upon
:param epsilon:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you suggested not having the notebook as a place for explaining the theory of kink designs, but i feel like the differences between tweaking at least one of epsilon/bandwidth parameters could be mentioned or shown.

It's fine if tweaking them isn't needed for your example, but it'd be good to hint at why they are there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an example to demonstrate use of the bandwidth parameter. And I've added an admonition box to explain what epsilon does.

A small scalar value which determines how far above and below the kink point to
evaluate the causal impact.
:param bandwidth:
Data outside of the bandwidth (relative to the discontinuity) is not used to fit
the model.
"""

def __init__(
self,
data: pd.DataFrame,
formula: str,
kink_point: float,
model=None,
running_variable_name: str = "x",
epsilon: float = 0.001,
bandwidth: float = np.inf,
**kwargs,
):
super().__init__(model=model, **kwargs)
self.expt_type = "Regression Kink"
self.data = data
self.formula = formula
self.running_variable_name = running_variable_name
self.kink_point = kink_point
self.epsilon = epsilon
self.bandwidth = bandwidth
self._input_validation()

if self.bandwidth is not np.inf:
fmin = self.kink_point - self.bandwidth
fmax = self.kink_point + self.bandwidth
filtered_data = self.data.query(f"{fmin} <= x <= {fmax}")
if len(filtered_data) <= 10:
warnings.warn(

Check warning on line 1009 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1009

Added line #L1009 was not covered by tests
f"Choice of bandwidth parameter has lead to only {len(filtered_data)} remaining datapoints. Consider increasing the bandwidth parameter.", # noqa: E501
UserWarning,
)
y, X = dmatrices(formula, filtered_data)
else:
y, X = dmatrices(formula, self.data)

self._y_design_info = y.design_info
self._x_design_info = X.design_info
self.labels = X.design_info.column_names
self.y, self.X = np.asarray(y), np.asarray(X)
self.outcome_variable_name = y.design_info.column_names[0]

COORDS = {"coeffs": self.labels, "obs_indx": np.arange(self.X.shape[0])}
self.model.fit(X=self.X, y=self.y, coords=COORDS)

# score the goodness of fit to all data
self.score = self.model.score(X=self.X, y=self.y)

# get the model predictions of the observed data
if self.bandwidth is not np.inf:
xi = np.linspace(fmin, fmax, 200)
else:
xi = np.linspace(
np.min(self.data[self.running_variable_name]),
np.max(self.data[self.running_variable_name]),
200,
)
self.x_pred = pd.DataFrame(
{self.running_variable_name: xi, "treated": self._is_treated(xi)}
)
(new_x,) = build_design_matrices([self._x_design_info], self.x_pred)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure there. is a reason but why is the output wrapped in braces?and then immediately into an array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_design_matrices returns both the x and y design matrices as a tuple, so we are just grabbing the x matrix here. And by default these are as dataframes. So you can either provide the return_type='matrix' argument to build_design_matrices, or manually do it yourself with np.asarray as I did here.

self.pred = self.model.predict(X=np.asarray(new_x))

# evaluate gradient change around kink point
mu_kink_left, mu_kink, mu_kink_right = self._probe_kink_point()
self.gradient_change = self._eval_gradient_change(
mu_kink_left, mu_kink, mu_kink_right, epsilon
)

@staticmethod
def _eval_gradient_change(mu_kink_left, mu_kink, mu_kink_right, epsilon):
"""Evaluate the gradient change at the kink point.
It works by evaluating the model below the kink point, at the kink point,
and above the kink point.
This is a static method for ease of testing.
"""
gradient_left = (mu_kink - mu_kink_left) / epsilon
gradient_right = (mu_kink_right - mu_kink) / epsilon
gradient_change = gradient_right - gradient_left
return gradient_change

def _probe_kink_point(self):
# Create a dataframe to evaluate predicted outcome at the kink point and either
# side
x_predict = pd.DataFrame(
{
self.running_variable_name: np.array(
[
self.kink_point - self.epsilon,
self.kink_point,
self.kink_point + self.epsilon,
]
),
"treated": np.array([0, 1, 1]),
}
)
(new_x,) = build_design_matrices([self._x_design_info], x_predict)
predicted = self.model.predict(X=np.asarray(new_x))
# extract predicted mu values
mu_kink_left = predicted["posterior_predictive"].sel(obs_ind=0)["mu"]
mu_kink = predicted["posterior_predictive"].sel(obs_ind=1)["mu"]
mu_kink_right = predicted["posterior_predictive"].sel(obs_ind=2)["mu"]
return mu_kink_left, mu_kink, mu_kink_right

def _input_validation(self):
"""Validate the input data and model formula for correctness"""
if "treated" not in self.formula:
raise FormulaException(

Check warning on line 1088 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1088

Added line #L1088 was not covered by tests
"A predictor called `treated` should be in the formula"
)

if _is_variable_dummy_coded(self.data["treated"]) is False:
raise DataException(

Check warning on line 1093 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1093

Added line #L1093 was not covered by tests
"""The treated variable should be dummy coded. Consisting of 0's and 1's only.""" # noqa: E501
)

if self.bandwidth <= 0:
raise ValueError("The bandwidth must be greater than zero.")

if self.epsilon <= 0:
raise ValueError("Epsilon must be greater than zero.")

def _is_treated(self, x):
"""Returns ``True`` if `x` is greater than or equal to the treatment threshold.""" # noqa: E501
return np.greater_equal(x, self.kink_point)

def plot(self):
"""
Plot the results
"""
fig, ax = plt.subplots()

Check warning on line 1111 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1111

Added line #L1111 was not covered by tests
# Plot raw data
sns.scatterplot(

Check warning on line 1113 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1113

Added line #L1113 was not covered by tests
self.data,
x=self.running_variable_name,
y=self.outcome_variable_name,
c="k", # hue="treated",
drbenvincent marked this conversation as resolved.
Show resolved Hide resolved
ax=ax,
)

# Plot model fit to data
h_line, h_patch = plot_xY(

Check warning on line 1122 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1122

Added line #L1122 was not covered by tests
self.x_pred[self.running_variable_name],
self.pred["posterior_predictive"].mu,
ax=ax,
plot_hdi_kwargs={"color": "C1"},
)
handles = [(h_line, h_patch)]
labels = ["Posterior mean"]

Check warning on line 1129 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1128-L1129

Added lines #L1128 - L1129 were not covered by tests

# create strings to compose title
title_info = f"{self.score.r2:.3f} (std = {self.score.r2_std:.3f})"
r2 = f"Bayesian $R^2$ on all data = {title_info}"
percentiles = self.gradient_change.quantile([0.03, 1 - 0.03]).values
ci = r"$CI_{94\%}$" + f"[{percentiles[0]:.2f}, {percentiles[1]:.2f}]"
grad_change = f"""

Check warning on line 1136 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1132-L1136

Added lines #L1132 - L1136 were not covered by tests
Change in gradient = {self.gradient_change.mean():.2f},
"""
ax.set(title=r2 + "\n" + grad_change + ci)

Check warning on line 1139 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1139

Added line #L1139 was not covered by tests
# Intervention line
ax.axvline(

Check warning on line 1141 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1141

Added line #L1141 was not covered by tests
x=self.kink_point,
ls="-",
lw=3,
color="r",
label="treatment threshold",
)
ax.legend(

Check warning on line 1148 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1148

Added line #L1148 was not covered by tests
handles=(h_tuple for h_tuple in handles),
labels=labels,
fontsize=LEGEND_FONT_SIZE,
)
return fig, ax

Check warning on line 1153 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1153

Added line #L1153 was not covered by tests

def summary(self) -> None:
"""
Print text output summarising the results
"""

print(

Check warning on line 1160 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1160

Added line #L1160 was not covered by tests
f"""
{self.expt_type:=^80}
Formula: {self.formula}
Running variable: {self.running_variable_name}
Kink point on running variable: {self.kink_point}

Results:
Change in slope at kink point = {self.gradient_change.mean():.2f}
"""
)
self.print_coefficients()

Check warning on line 1171 in causalpy/pymc_experiments.py

View check run for this annotation

Codecov / codecov/patch

causalpy/pymc_experiments.py#L1171

Added line #L1171 was not covered by tests


class PrePostNEGD(ExperimentalDesign):
"""
A class to analyse data from pretest/posttest designs
Expand Down
Loading