Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improve efficiency of Histogram Distribution #405

Open
ShreeshaM07 opened this issue Jun 23, 2024 · 1 comment
Open

[ENH] Improve efficiency of Histogram Distribution #405

ShreeshaM07 opened this issue Jun 23, 2024 · 1 comment
Labels
maintenance Continuous integration, unit testing & package distribution

Comments

@ShreeshaM07
Copy link
Contributor

Describe the maintenance issue

The Histogram distribution implemented in #382, #335 takes 2 parameters bins and bin_mass which consists of ragged arrays in the 2D case. These ragged arrays are stored in a list as it is not possible to vectorize the ragged inputs easily. One way to vectorize them and make all of them the same shape is to pad them with 0s in case of bin_mass and pad bins with -np.inf and np.inf on the left and right side to make them equal in length.

But the problem with this approach of vectorizing is that it takes longer time than the current approach, as although the running times of the methods mean,var per se are improved by a factor of 5 but the time to pad the inputs in the above mentioned way itself is taking a lot of time which is giving worse efficiency results overall than the current approach.

Refer here to know more about the benchmarking of the Histogram Distribution.

The idea of taking the input from the user itself in this vectorized way with all the inputs padded with 0s and infs does not seem to be a very good idea as this would be very inconvenient for the user to pad them manually in cases where the lengths of the inputs vary by a big number and this would also not allow for tuple inputs in cases where the bins are of equal widths.

The Histogram Distribution inherits from the _BaseArrayDistribution which inherits the BaseDistribution with some overriding of private functions to accomodate the array distribuitons. Thoughts on ways of merging this with BaseDistribution without having to create a separate base class for arrays is also appreciated.

@ShreeshaM07 ShreeshaM07 added the maintenance Continuous integration, unit testing & package distribution label Jun 23, 2024
@fkiraly fkiraly changed the title [MNT] Improve efficiency of Histogram Distribution [ENH] Improve efficiency of Histogram Distribution Jun 23, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Jun 23, 2024

Indeed - there is also the open question how the parameters should be encoded, as an "inefficient" representation may mask any benfits from vectorization.

This is an open question, I would guess that pd.DataFrame with nested row index might be a way to go, since we can make use of groupby etc.

Less of a priority for now though, so parking thoughts in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Continuous integration, unit testing & package distribution
Projects
None yet
Development

No branches or pull requests

2 participants