Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add CombinatorialGapKFold #41

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

aldder
Copy link

@aldder aldder commented Feb 1, 2022

From "Advances in Financial Machine Learning" book by Marcos López de Prado
the implemented version of Combinatorial Cross Validation with Purging and Embargoing

image

explaining video: https://www.youtube.com/watch?v=hDQssGntmFA

@pep8speaks
Copy link

pep8speaks commented Feb 1, 2022

Hello @aldder! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 451:9: F841 local variable 'gap_before' is assigned to but never used
Line 451:21: F841 local variable 'gap_after' is assigned to but never used

Line 672:80: E501 line too long (102 > 79 characters)
Line 678:80: E501 line too long (103 > 79 characters)
Line 684:80: E501 line too long (115 > 79 characters)
Line 690:80: E501 line too long (119 > 79 characters)
Line 698:80: E501 line too long (106 > 79 characters)

Comment last updated at 2022-02-11 12:19:56 UTC

@WenjieZ
Copy link
Owner

WenjieZ commented Feb 5, 2022

Hi @aldder , please try to add some test cases in the test_split.py file.

@codecov
Copy link

codecov bot commented Feb 7, 2022

Codecov Report

Merging #41 (c7b2bed) into master (c05265a) will increase coverage by 0.37%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #41      +/-   ##
==========================================
+ Coverage   97.51%   97.88%   +0.37%     
==========================================
  Files           3        3              
  Lines         643      756     +113     
==========================================
+ Hits          627      740     +113     
  Misses         16       16              
Impacted Files Coverage Δ
tscv/__init__.py 100.00% <100.00%> (ø)
tscv/_split.py 94.50% <100.00%> (+0.72%) ⬆️
tscv/tests/test_split.py 99.78% <100.00%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c05265a...c7b2bed. Read the comment docs.

Copy link
Owner

@WenjieZ WenjieZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments and make changes accordingly.

tscv/_split.py Outdated
self.n_groups = N
self.test_splits = k

def split(self, X, y=None, groups=None):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical way of doing this is to redefine _iter_test_indices(self, X, y=None, groups=None) from the base class and generate only the test indices. The base class will take care of the rest. Please refer to the other derived classes and make modification accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that in order to check if the training set size is > 0 we need to compute the complement of test indices after their generation.
And if we already do this, there is no point in discarding this information to recalculate it after

Copy link
Owner

@WenjieZ WenjieZ Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check of the training set size (and the test set size in some cases) can be delayed and implemented in GapCrossValidator._iter_train_indices() and its 3 siblings. You can also implement it in GapCrossValidator.split() if you find it a hustle to implement it four times.

Copy link
Owner

@WenjieZ WenjieZ Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I don't find it necessary to check the non-emptiness of the training set. Some models/algorithms can output a default estimator/strategy given an empty training set (e.g., equal weight portfolio). If your model/algorithm requires a non-empty training set, it's probably a good idea to check it in the model/algorithm rather than in the cross-validator. That's why I didn't implement this check when I released the package.

That said, I probably forgot to check the non-emptiness of the test set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then, I will replace the implementation of the split method with _iter_test_indices

About the non-emptiness of the test set I think you prefer to implement it in the base class with a specific PR, so I won't touch it, ok?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your judgement is correct.

tscv/_split.py Outdated
n_splits : int
Returns the number of splitting iterations in the cross-validator.
"""
return len(list(combinations(range(self.n_groups), self.test_splits)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the combination number to generate the result directly rather than instantiating all combinations.

from inspect import signature
from scipy.special import comb
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add scipy in

TSCV/setup.py

Line 55 in 2abbc3d

install_requires=['numpy>=1.13.3', 'scikit-learn>=0.22']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants