Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using SHAP on unseen data to understand model's predictions #2571

Open
ETTAN93 opened this issue Oct 24, 2024 · 0 comments
Open

Using SHAP on unseen data to understand model's predictions #2571

ETTAN93 opened this issue Oct 24, 2024 · 0 comments
Labels
question Further information is requested triage Issue waiting for triaging

Comments

@ETTAN93
Copy link

ETTAN93 commented Oct 24, 2024

Assuming I have a model that is initialized as such:

model_estimator = LightGBMModel(
    lags=None,
    lags_past_covariates=[-3,-2,-1].
    lags_future_covariates=[-3,-2,-1],
    output_chunk_length=3
)

Question 1:
I then create the ShapExplainer object by fitting it to the training set.

shap_explain = ShapExplainer(model_estimator)
explanations = shap_explain.summary_plot()

Assuming now I want to use the ShapExplainer to explain data in the unseen test set, what should be defined as the foreground series?
I tried not providing the foreground series since target lags is 0 but it seems like that is not possible. In that case, should the foreground series, past_covariates and future covariates be the same series passed to the model.predict function, i.e. the foreground_series should end at the first prediction timestamp t=0?

shap_explain.explain(
    foreground_series = hf_data_dict['target_hf'][: test_set_start_date],
    foreground_past_covariates = hf_data_dict['past_cov_hf'][test_set_start_date: end_date],
    foreground_future_covariates = hf_data_dict['future_cov_hf'][test_set_start_date: end_date],
    horizons = [3]
)

Or should the foreground_series start from the test_set_start_date?

shap_explain.explain(
    foreground_series = hf_data_dict['target_hf'][test_set_start_date: ],
    foreground_past_covariates = hf_data_dict['past_cov_hf'][test_set_start_date: end_date],
    foreground_future_covariates = hf_data_dict['future_cov_hf'][test_set_start_date: end_date],
    horizons = [3]
)

Question 2:
I tried training the SHAP explainer on different background series (train or test set) but I am getting back the exact same SHAP results. For example

split_date = pd.to_datetime('2023-02-28 23:59:00')
test_set_start_date = pd.to_datetime('2023-03-01 00:00:00')

#shap explainer trained on train set but used to explain test set
shap_explainer = ShapExplainer(lgbm_model)

df1 = shap_explainer.explain(
    foreground_series = hf_data_dict['target_hf'][: split_date],
    foreground_past_covariates = hf_data_dict['past_cov_hf'][test_set_start_date: ],
    foreground_future_covariates = hf_data_dict['future_cov_hf'][test_set_start_date: ],
    horizons = [3]
)

#shap explainer trained on test set and used to explain test set
shap_explainer_test = ShapExplainer(
    lgbm_model,
    background_series = hf_data_dict['target_hf'][: split_date],
    background_past_covariates = hf_data_dict['past_cov_hf'][test_set_start_date: ],
    background_future_covariates = local_lgbm_hf_output.hf_data_dict['future_cov_hf'][test_set_start_date: ]
)

df2 = shap_explainer_test.explain(
    foreground_series = hf_data_dict['target_hf'][:split_date],
    foreground_past_covariates = hf_data_dict['past_cov_hf'][test_set_start_date: ],
    foreground_future_covariates = hf_data_dict['future_cov_hf'][test_set_start_date: ],
    horizons = [3]
)

#shap explainer trained on test set but changing foreground series to same as past and future cov
shap_explainer_test = ShapExplainer(
    lgbm_model,
    background_series = hf_data_dict['target_hf'][test_set_start_date: ],
    background_past_covariates = hf_data_dict['past_cov_hf'][test_set_start_date: ],
    background_future_covariates = local_lgbm_hf_output.hf_data_dict['future_cov_hf'][test_set_start_date: ]
)

df3 = shap_explainer_test.explain(
    foreground_series = hf_data_dict['target_hf'][:split_date],
    foreground_past_covariates = hf_data_dict['past_cov_hf'][test_set_start_date: ],
    foreground_future_covariates = hf_data_dict['future_cov_hf'][test_set_start_date: ],
    horizons = [3]
)

It seems like all 3 dataframes return exactly the same value which is very weird to me as the ShapExplainer is trained with a different background series. Also when looking at the base_values, they are also exactly the same. Is there a bug somewhere in the implementation or am I using the function wrongly?

@ETTAN93 ETTAN93 added question Further information is requested triage Issue waiting for triaging labels Oct 24, 2024
@ETTAN93 ETTAN93 changed the title Using SHAPs on unseen data to understand model's predictions Using SHAP on unseen data to understand model's predictions Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triage Issue waiting for triaging
Projects
None yet
Development

No branches or pull requests

1 participant