-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new multivariate interrupted time series functionality #357
Comments
Hello! This is great and could solve one of my measurement needs. Recently, Facebook impressions dropped due to a budget change, leading to a simultaneous drop in traffic from Facebook, direct, shopping, and brand search. There’s little doubt that Facebook impressions are the influencing factor, but I’m still figuring out the best way to measure and prove causality. Also, since spend and impressions are continuous with diminishing returns, I wanted to know if this model works with continuous data. Thank you! |
Thanks, good to know this might be of use @charlesdtdb. Can I just check, by continuous data, you mean that impressions are thought of as a graded/continuous treatment or causal driver of your other observations? My initial thoughts are that that problem would warrant a novel model. I have some ideas. Happy to either talk through to see if you'd like to create a custom model and submit a PR. Though if this is high priority/value, then it could be useful to consider a consulting engagement. |
Thanks for sharing @drbenvincent. This would be quite useful for analysing the impact of brand campaigns on a product portfolio by specifying a list of model formulas within the InterruptedTimeSeries function. Since each formula would represent a separate time series model for each product's sales, this would allow capturing the intervention's effect on all products simultaneously while accounting for potential correlations between their sales. Building on a limitation that you already pointed out, the model's performance can be sensitive to the choice of priors when dealing with covariance. Basically, inappropriate priors can introduce bias and distort the results in situations where the sales of one product influences the other. The model formula would also differ across categories like for a Bank vs. a CPG brand. Would love to explore this further and will share any other concerns if I come across any. |
Hi @KK8627. What could be pretty cool is if you were able to either share a dataset OR code up a simple simulated dataset with numpy for example. If you were able to keep the example simple, but also capture the core statistical nature of the datasets you deal with, then that could form part of the new documentation. It would also mean that the example could be grounded in a realistic case, even though it's simulated. If so, feel free to drop in some numpy code here. |
Certainly. Here you go. This would be different from my real-world data but might give an idea.
|
Thanks @KK8627. So I was thinking of something more like this which focusses on the covariance structure between sales of products. Note: this just simulates the pre-intervention period and the post-intervention counterfactual - it hasn't added on any causal impact of the intervention yet. This example also doesn't have any other predictor variables which influence product sales, though a more rich example could do (e.g. trend or seasonality components). mu = [0, 1, 2]
cov = [[1, 0.5, 0.3],
[0.5, 1, 0.4],
[0.3, 0.4, 1]]
num_points = 52*2
data = np.random.multivariate_normal(mu, cov, num_points)
df = pd.DataFrame(data, columns=['Product 1', 'Product 2', 'Product 3']) I would have guessed that the covariance structure would have been a more important feature. I got a bit confused by your example - |
Thanks for the catch @drbenvincent. Apologies, yes all variables were product sales (now edited column headers above). It seems that my brain is always looking to separate these. Also added seasonality and marketing variables for predictors to my original dummy dataset. For clarity, I was trying to create a time-series example with Granger causality where y (or sales of P3) impacts x1 (Sales of P1 but not P2). On further thought, I also think that it is unlikely that a product's sales within the same portfolio would directly influence another product's sales within the same time step (unless they're competing products). Since the intention is to showcase the PyMC's ability to handle causal relationships, this might not be the most realistic scenario for our specific use case. Your suggestion to focus on the covariance structure through the simulated data is a better fit for this use case. It would allow us to isolate the impact of covariance. Here's your example with an addition of the seasonality and marketing effects. Looking forward to your thoughts.
|
@drbenvincent - I also wanted to explain what I am trying to achieve out of this (I should have probably started with this note..lol). What I am looking to analyse is the impact of larger product purchases from a portfolio and its impact (beneficial or detrimental) on a smaller products within the same portfolio, and hence I assigned the Granger causality to only one of the two sales variables in the original dummy dataset but it could technically impact both. Here are some situational examples for context:
I'm not sure if anyone can create an ideal dataset matching all the situations above, but I do believe that the methodology can be applied to product bundling solutions, offering complementary services, managing price differentials or offering product substitutes as a business solution coming out of the model results. Please feel free to disagree and I look forward to your thoughts. Thanks again! |
Thanks @KK8627, this is very useful context. Having worked on related kinds of things on client projects I have some thoughts.
Very happy to consider the latter, but I think that would be a different GitHub issue which can exclusively focus on that. Let me which of the two options above sound more in line with your particular goals. |
hi @drbenvincent , thanks for your reply. I d love to chat and try to build a custom model. I am not a Bayesian expert, I am might be short of expertise to be able to do the PR request but it could be a good way for me to learn. Impressions are indeed being considered as a continuous treatment, meaning I am examining how different levels of impressions (rather than just the presence or absence) affect your outcome variables ( website traffic through other channels) |
Thanks @drbenvincent. It's most certainly a graded intervention (route 2) . I think we can consider route 1 in case of situations like limited time offers...but for most situations, I agree with the recommendation to take the main product sales as the primary target and then analyze reductions and lag in the other products. Certainly happy to move this to a new issue or as your see fit. This is a very helpful conversation and your inputs are greatly appreciated. Thanks again! |
Proposal
Typically, interrupted time series (ITS) designs are univariate in that there is a single outcome variable. An existing example in the docs examines the causal impact of the onset of covid upon the univariate outcome measure of excess deaths.
However, there are plenty of scenarios where we might want to consider multivariate outcomes. Examples might include:
The simplest approach would be to run multiple ITS models to assess the causal impact of the intervention upon each outcome independently. However, the main limitation of this is that it assumes the outcome measures are independent and it will fail to capture the joint effects. We also have to run multiple analyses rather than just one.
Boundary conditions of when to use this approach
The multivariate outcome approach is particularly useful when it is not appropriate to think about more complex causal structure between the different outcome measures. The example of a marketing intervention and assessing the impact upon the sales of multiple products is perhaps a good one where multivariate ITS could be useful. This would be appropriate if the intervention has widespread impacts upon the sales of multiple products, but there isn't (or we want to remain agnostic) some complex causal chain of influence between multiple products.
However, if the outcomes are less similar, then it might be possible to think up more complex structural models to describe the causal relationships between the various measures. For example, if the intervention is a public health measure.
General approach
The schematic shows roughly what I'm proposing. We have a number of outcome time series and a single point where an intervention takes place (though this could be relaxed). We can use all available current ITS modelling approaches by using the model formula approach. We'd them simply generate a counterfactual prediction for each time series.
We could either model the variance with individual sigma parameters, or we could use a covariance matrix.
API
We'd stick relatively closely to the current API. The main difference would be that we specify a list of model formulas:
There would be no requirement to have the RHS of the formulas the same. And we may want to pass in a list of models also.
EDIT: If we did want a single model to be applied to all time series, we could perhaps use
formulae
. @tomicapretto mentions...Extensions
LinearRegression
class. And that could include vector autoregression models where each series is influenced by past values of other series.I'd be very interested to hear if there is any interest in this kind of functionality, or if you think it doesn't add much.
The text was updated successfully, but these errors were encountered: