-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving quantile #426
base: main
Are you sure you want to change the base?
Moving quantile #426
Conversation
a2b4455
to
77d109b
Compare
values. | ||
|
||
The quantile calculated in each window is equivalent to numpy's | ||
`"averaged_inverted_cdf"` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comments that the op only work on floating point features (or make some implicit conversion)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, I also support int as a valid input, however the output is converted to float
float -> float
double -> double
int -> float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Added a comment on the doc explaining the point above
f"Received {quantile}" | ||
) | ||
self.quantile = quantile | ||
super().__init__(input, window_length, sampling) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to have it at the top? If so, can you add a comment. If not, I would move it at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a reason but I had so many backs and forth with this one that I forgot why. Let me try to put at the top as well as change quantile to _quantile and see if anything breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work, the reason is that the init of the base class runs a self.check()
at the end and this needs all the attributes to be defined including the quantile
CustomHeap(std::function<bool(T, T)> compare) : compare(compare) {} | ||
|
||
void push(T value) { | ||
heap.push_back(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is invalidating all the iterators in "val_to_node" right? Can you add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't because heap is a double-linked list so all the existing nodes should remain in the same place. That's the reason why I'm using lists instead of vectors
auto it = std::prev(heap.end()); | ||
val_to_node[value] = it; | ||
// TODO: better sorting? | ||
while (it != heap.begin()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks expensive, can you add an explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, and I suspect this is where we spend the most time.
I need the heap to have the biggest (according to compare
) item at the top and this was the easiest way to ensure that, I'm open to suggestions for a better sorting algorithm or a better data structure for the heap (trees maybe?)
} else { | ||
auto value = heap.back(); | ||
heap.pop_back(); | ||
auto it = val_to_node.find(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there are multiple time the same value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a problem I forgot to add a comment addressing that. It's not a problem in this case because I'm storing the indices as the heap "values" so they are never going to repeat
val_to_node.erase(it); | ||
} else { | ||
// TODO: exception meant for debugging, remove it | ||
throw std::invalid_argument("removing a value that doesn't exists"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch forgot to remove this one, it was only there for debugging
void remove(T value) { | ||
auto it = val_to_node.find(value); | ||
if (it != val_to_node.end()) { | ||
heap.erase(it->second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, the heap is not valid anymore, right?
Can you add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment above, it shouldn't because this is a double-linked list so this node is gone, then the next of the previous node and the previous of the next node are updated but all the nodes should remain in the same place.
This PR implements a slow but exact (as in not an approximation) of the moving quantile.
Implementation considerations:
The implementation splits the window into smaller and bigger values split into 2 heaps while balancing them to keep the desired value near the top, for instance, if looking for the median the heaps will be balanced, if looking for the 0.25 percentile the smaller heap will containing 20% of values.
It uses custom heaps because we need to remove items from the heap (that may or may not be at the top) as they are removed from the windows, for this a mapping is maintained in memory.
In cases where the desired quantile is between 2 values, the average of the 2 closer values is used.
The result is equivalent to Numpy's
"averaged_inverted_cdf"
as tested in one of the unit tests.I'm going to measure memory usage shortly but this could be an issue with long windows.
Timing considerations:
moving_quantile
is significantly slower than other moving operations (~20X slower thansimple_moving_average
)df.rolling(window=w, center=False).quantile(q)
. It's slower with longer windows and extreme quantiles (these make on of the heaps bigger which makes inserts more costly). See plots belowNote that our solution also works on non-uniformly sampled data where pandas wouldn't.
Other changes:
quantile