Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Enhancements: Batch Processing, Crosswalk Table, and Matching Toggle" #73

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

saamiyasiddiqui11
Copy link

@saamiyasiddiqui11 saamiyasiddiqui11 commented Dec 9, 2024

Description

This pull request introduces enhancements to the Harmony Python library by addressing two key improvements:

Batch Processing in matcher.py (Task 1):

Implemented batch processing to handle large datasets more efficiently.
Introduced an environment variable, BATCH_SIZE, which allows configurable batch sizes for processing.

Unit Tests for Batch Processing (Task 2):

Added comprehensive unit tests to ensure the batch processing functionality works as expected under various scenarios (e.g., empty lists, single-item batches, large datasets).
Verified edge cases to maintain reliability and backward compatibility.
No new third-party dependencies were added to requirements.txt or pyproject.toml.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Testing

The following tests were conducted to validate the changes:

Unit Tests:
Verified that items are correctly divided into batches based on the specified BATCH_SIZE.
Tested various edge cases such as empty lists, single-item inputs, and batch sizes larger than the dataset.
Ensured the batch processing logic integrates smoothly with existing functionality.
Manual Verification:
Ran unit tests on the Harmony Python library to confirm no regressions or unintended behavior.
Validated that the Harmony API continues to function correctly with these changes.

Since the Harmony Python package is used by the Harmony API (which is itself used by the R library and the web app), we need to avoid making any changes that break the Harmony API. Please also run the Harmony API unit tests and check that the API still runs with your changes to the Python package: https://github.com/harmonydata/harmonyapi

Test Configuration

  • Library version: Python 3.10
  • OS: macOS
  • Toolchain: unittest framework

Checklist

  • My code follows the style guidelines of this project. I have applied a Linter (recommended: Pycharm's code formatter) to make my whitespace consistent with the rest of the project.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings
  • The Harmony API is not broken by my change to the Harmony Python library
  • If I introduced a new feature, I documented it (e.g. making a script example in the script examples repository so that people will know how to use it.

Optionally: feel free to paste your Discord username in this format: discordapp.com/users/yourID in your pull request description, then we can know to tag you in the Harmony Discord server when we announce the PR.

@woodthom2
Copy link
Contributor

Hi, thanks so much for making this PR and I really appreciate the time you spent on it!

Just a couple of things: someone has already made a PR which has been merged for the batching:

#69
#66

Also for the crosswalk table:

#65

Since these are already merged into main would you please be able to check if your code improves on the PRs that were already merged and make separate PRs based on these?

Secondly since this looks like it's three separate issues, could you please split it up? Since the matching toggle is something that nobody has fixed yet so I can merge that into main if there is a PR just for that, whereas a big PR is difficult to merge.

So if possible please could you split it into three PRs? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants