Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add robustness to leading null values in series in 'ewma_by_time'. #72

Merged
merged 6 commits into from
Mar 25, 2024

Conversation

wbeardall
Copy link

Following up #71, I've added an ignore_nulls flag to 'ewma_by_time' with a default True value. The main design choices are as follows:

  1. When the sequence starts with a NaN, all values in the EWMA should also be NaN, up until the first non-NaN index, which should return its own value.

  2. A string of consecutive NaN values should all have the same output as proceeding non-NaN value, as no additional information has been added to the sequence.

@MarcoGorelli
Copy link
Collaborator

thanks for your pr

I think we need to distinguish nan with null here

…line with pl.ewma_mean. Added robustness to null values at the start of value arrays
@wbeardall wbeardall changed the title Added mechanism for ignoring consecutive NaN values in 'ewma_by_time'. Added mechanism for ignoring null values in 'ewma_by_time'. Mar 23, 2024
@MarcoGorelli
Copy link
Collaborator

Nice one, thanks

Something to discuss is ignore_nulls, because it's different to Polars' own ewm_mean. There, ignore_nulls determines how the weights are calculated (based on absolute positions or not), but the output always contains a value:

In [18]: >>> import polars as pl
    ...: >>> s = pl.Series([1.1, 2.5, 2.6, 2.1, None, 5.1])
    ...: >>> s.fill_nan(None).ewm_mean(alpha=.1, ignore_nulls=True)
Out[18]:
shape: (6,)
Series: '' [f64]
[
        1.1
        1.836842
        2.11845
        2.113085
        2.113085
        2.842473
]

In [19]: >>> import polars as pl
    ...: >>> s = pl.Series([1.1, 2.5, 2.6, 2.1, None, 5.1])
    ...: >>> s.fill_nan(None).ewm_mean(alpha=.1, ignore_nulls=False)
Out[19]:
shape: (6,)
Series: '' [f64]
[
        1.1
        1.836842
        2.11845
        2.113085
        2.113085
        2.902107
]

Here, on the other hand, the weights are determined based on times, rather than based on positions. If we use the same argument name, then it risks being quite confusing

Maybe this is a chance to revisit the ewm_mean API in Polars itself...I think it was copied from pandas a few years ago, although now it probably makes sense to make some improvements on it

@wbeardall
Copy link
Author

Yeah you're right about this inconsistency, I can see how this could be confusing... would it be better to call the kwarg fill_nulls or similar here? It's probably also worth expanding the documentation to explain comprehensively how this performs

@MarcoGorelli
Copy link
Collaborator

Does there need to be another kwarg for this? It's easy enough to just do .forward_fill, maybe the default should just be to preserve null values, and if anyone wants to forward-fill them, they can just do .forward_fill?

I've made an issue about this in Polars if you want to voice your opinion pola-rs/polars#15258

@wbeardall
Copy link
Author

I think you're right here, making the null filling explicit using .forward_fill seems better both in terms of expression clarity, user control and interface simplicity.

@wbeardall
Copy link
Author

That brings the question back to ewma_by_time though; should the default behaviour for ewma_by_time be that null values are not filled by default? This seems idiomatically correct, is trivial to implement, and removes the need for an additional kwarg here too. This would change my earlier PR back to simply making the current implementation robust to leading nulls.

@MarcoGorelli
Copy link
Collaborator

sounds good to me, thanks!

…to check for consistent behaviour of ewma_by_time for series with leading nulls.
@wbeardall wbeardall changed the title Added mechanism for ignoring null values in 'ewma_by_time'. Add robustness to leading null values in series in 'ewma_by_time'. Mar 24, 2024
@wbeardall
Copy link
Author

I've added the changes, and changed the heading to reflect this.

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! minor comment but looks good

Comment on lines 937 to 938
ignore_nulls
Ignore missing values when calculating weights.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

Comment on lines 24 to 27
for (idx, value) in values.iter().enumerate() {
match value {
Some(value) => {
prev_time = times.get(idx).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this going to panic if times[idx] is None? do we need to zip over both values and times here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will; I've fixed.

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @wbeardall !

@MarcoGorelli MarcoGorelli merged commit f9e083d into pola-rs:main Mar 25, 2024
14 checks passed
@MarcoGorelli
Copy link
Collaborator

@wbeardall would you be interested in trying to upstream this to Polars itself?

@wbeardall
Copy link
Author

Yes, schedule allowing! What is your timeline looking like for upstreaming polars-xdt to the main codebase?

@MarcoGorelli
Copy link
Collaborator

There's no schedule, this was just created to allow to quickly make features available that it's not clear should be included in Polars itself.

There's appetite for upstreaming this feature, as well as business days, but the rest isn't as clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants