Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data leakage #12

Open
FutureGoose opened this issue Oct 22, 2023 · 1 comment
Open

Data leakage #12

FutureGoose opened this issue Oct 22, 2023 · 1 comment

Comments

@FutureGoose
Copy link

Hi 🐍 Matt Harrison,

I'm thoroughly enjoying your book on XGBoost, but I noticed what might be data leakage during hyperparameter tuning. Specifically, on page 46, 47, 48 and 49, it seems both training and test data are used for model fitting.

If this approach is intentional, could you please clarify the rationale? I'd greatly appreciate your insights.

Thank you again for the excellent book.

Example from p. 47 with my inline comments:

# Reflecting on the below approach: Using GridSearchCV with both training and testing data combined 
# can introduce data leakage. By tuning hyperparameters this way, the model might be indirectly influenced 
# by the test data, leading to potentially biased selection of "best" hyperparameters. It's best practice 
# to perform hyperparameter tuning only on the training dataset to ensure that the chosen parameters 
# generalize well to unseen data.

from sklearn.model_selection import GridSearchCV
params = {
    'max_depth': [3, 5, 7, 8],
    'min_samples_leaf': [1, 3, 4, 5, 6],
    'min_samples_split': [2, 3, 4, 5, 6],
}
grid_search = GridSearchCV(estimator=tree.DecisionTreeClassifier(),
                          param_grid=params, cv=4, n_jobs=-1,
                          verbose=1, scoring='accuracy')
grid_search.fit(pd.concat([X_train, X_test]),
    pd.concat([kag_y_train, kag_y_test]))
@FutureGoose
Copy link
Author

FutureGoose commented Oct 28, 2023

On page 57, we encounter a similar problem where the model is being trained on both validation and training data:

7.5  Training the Number of Trees in the Forest
from yellowbrick.model_selection import validation_curve
fig, ax = plt.subplots(figsize=(10,4))    
viz = validation_curve(xgb.XGBClassifier(random_state=42),
    x=pd.concat([X_train, X_test], axis='index'),
    y=np.concatenate([y_train, y_test]),
    param_name='n_estimators', param_range=range(1, 100, 2),
    scoring='accuracy', cv=3, 
    ax=ax)                           

rf_xg29 = xgb.XGBRFClassifier(random_state=42, n_estimators=29)
rf_xg29.fit(X_train, y_train)
rf_xg29.score(X_test, y_test)
0.7480662983425415

EDIT: There's also some typos here: x=pd.concat([X_train, X_test], axis='index'),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant