-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-3782] Bag-of-words Encoder for SP #2145
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2145 +/- ##
============================================
+ Coverage 71.09% 71.13% +0.04%
- Complexity 43451 43527 +76
============================================
Files 1450 1450
Lines 166331 166500 +169
Branches 32424 32464 +40
============================================
+ Hits 118252 118444 +192
+ Misses 38836 38809 -27
- Partials 9243 9247 +4 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
test.functions.federated.primitives.part5 seems to be a bit unstable, I had to rerun the test couple times and I have seen that the test occasionally failed in previous commits as well |
total = 0 | ||
j = 0 | ||
# set to 20 for benchmarking | ||
while(i < 30){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be set to "i < 1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test cases to improve code coverage
LGTM - thanks for the patch @e-strauss. During the merge I only fixed minor formatting things (use of this, empty lines) as well as removed assertions and unnecessary imports. |
This patch extends the bag-of-words operation to support transform apply and the distributed spark backend.
To support the bow encoder for sparse outputs I had to adapt the OutputMatrixPreProcessing step in the transform apply framework, since we need to know the number of non-zero values per row when allocating the output sparse matrix. This is easy for the other encoders, since they resulted in just one non-zero value per encoder.
In transform encode we gained this information about the number of non-zeroes in the Build phase. Since we don't have a build phase in transform apply, we don't dont have this information. To solve this issue, I added a simplified build-like phase in the OutputMatrixPreProcessing, which computes the #nnz for each bow encoder, if the output is sparse and the transformation involves a bow encoder. This phase is parallelized across the bow encoder columns.
The decision if the output is sparse is based on an estimation of the total #nnz of bow encoder, which is calculated based on a sample of the input.
In one of the spark test cases, where I cbind the input frame with itself before executing the transform encode, I encountered an alignment issue while the map-based frame append, which I could bypass for now by adding a breaker (while(FALSE). I am investigating the bug in a separate PR.