Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed Improvement in binscatter Method #7

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jameshgrn
Copy link

Hey there! had to use this function on millions on points and decided to try and make it faster! The refactored code is expected to be faster due to several changes that eliminate unnecessary steps, reduce redundancy, and improve performance. Specifically, the changes include:

  • Replacing np.linspace with np.arange in _get_bins function to avoid casting to int, which can be computationally expensive for large arrays. Using np.arange also avoids the need to add 1 to n_bins, as np.arange already excludes the endpoint.
  • Using np.sort([x, y], axis=1) in place of separate sorts for x and y in get_binscatter_objects function. This change eliminates the need for a separate argsort step and reduces redundancy.
  • Using the same linear regression object for both x and y demeaning in get_binscatter_objects function reduces redundancy and avoids the need to fit two separate regression models.
  • Removing unnecessary checks for sortedness and using np.all instead of np.any in the check for negative differences in x eliminates unnecessary computation.
  • Passing numpy arrays directly to get_binscatter_objects rather than converting them via np.asarray avoids unnecessary copying and casting.

Overall, these changes make the code more efficient and faster without sacrificing functionality. The updated binscatter method passes all tests and is ready for review.

…mprove performance and readability. Replaced np.linspace with np.arange in _get_bins to avoid casting to int. Replaced np.sort with np.sort([x, y], axis=1) to sort both x and y arrays simultaneously, eliminating the need for a separate argsort step. Reduced redundancy by using the same linear regression object for both x and y demeaning. Removed unnecessary checks for sortedness and replaced np.any with np.all in the check for negative differences in x. Passed numpy arrays directly to get_binscatter_objects rather than converting them to arrays via np.asarray. Made minor documentation improvements. All tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant