[SYSTEMDS-3782] Bag-of-words Encoder for SP #2145

e-strauss · 2024-11-21T22:49:00Z

This patch extends the bag-of-words operation to support transform apply and the distributed spark backend.

To support the bow encoder for sparse outputs I had to adapt the OutputMatrixPreProcessing step in the transform apply framework, since we need to know the number of non-zero values per row when allocating the output sparse matrix. This is easy for the other encoders, since they resulted in just one non-zero value per encoder.
In transform encode we gained this information about the number of non-zeroes in the Build phase. Since we don't have a build phase in transform apply, we don't dont have this information. To solve this issue, I added a simplified build-like phase in the OutputMatrixPreProcessing, which computes the #nnz for each bow encoder, if the output is sparse and the transformation involves a bow encoder. This phase is parallelized across the bow encoder columns.
The decision if the output is sparse is based on an estimation of the total #nnz of bow encoder, which is calculated based on a sample of the input.

In one of the spark test cases, where I cbind the input frame with itself before executing the transform encode, I encountered an alignment issue while the map-based frame append, which I could bypass for now by adding a breaker (while(FALSE). I am investigating the bug in a separate PR.

codecov · 2024-11-21T23:15:08Z

Codecov Report

Attention: Patch coverage is 92.19331% with 21 lines in your changes missing coverage. Please review.

Project coverage is 71.13%. Comparing base (08875cb) to head (02ce4bf).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...time/transform/encode/ColumnEncoderBagOfWords.java	86.66%	0 Missing and 10 partials ⚠️
...s/runtime/transform/encode/MultiColumnEncoder.java	94.73%	6 Missing and 3 partials ⚠️
...ntime/transform/encode/ColumnEncoderComposite.java	75.00%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2145      +/-   ##
============================================
+ Coverage     71.09%   71.13%   +0.04%     
- Complexity    43451    43527      +76     
============================================
  Files          1450     1450              
  Lines        166331   166500     +169     
  Branches      32424    32464      +40     
============================================
+ Hits         118252   118444     +192     
+ Misses        38836    38809      -27     
- Partials       9243     9247       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

e-strauss · 2024-11-22T17:41:21Z

test.functions.federated.primitives.part5 seems to be a bit unstable, I had to rerun the test couple times and I have seen that the test occasionally failed in previous commits as well

e-strauss · 2024-11-22T22:15:48Z

src/test/scripts/functions/transform/TransformFrameEncodeApplyBagOfWords.dml

+total = 0
+j = 0
+# set to 20 for benchmarking
+while(i < 30){


can be set to "i < 1"

e-strauss · 2024-11-22T22:22:39Z

...est/java/org/apache/sysds/test/functions/transform/ColumnEncoderMixedFunctionalityTests.java

test cases to improve code coverage

mboehm7 · 2024-11-24T14:06:23Z

LGTM - thanks for the patch @e-strauss. During the merge I only fixed minor formatting things (use of this, empty lines) as well as removed assertions and unnecessary imports.

e-strauss changed the title ~~[WIP}Bag of words encoder Spark backend~~ [SYSTEMDS-3782] Bag-of-words Encoder for CP [WIP] Nov 21, 2024

e-strauss changed the title ~~[SYSTEMDS-3782] Bag-of-words Encoder for CP [WIP]~~ [SYSTEMDS-3782] Bag-of-words Encoder for SP [WIP] Nov 21, 2024

[SYSTEMDS-3782] Bag-of-words encoder for SP

48cb3c5

e-strauss force-pushed the bow_spark branch from 02ce4bf to 48cb3c5 Compare November 22, 2024 18:00

e-strauss changed the title ~~[SYSTEMDS-3782] Bag-of-words Encoder for SP [WIP]~~ [SYSTEMDS-3782] Bag-of-words Encoder for SP Nov 22, 2024

e-strauss requested a review from mboehm7 November 22, 2024 21:21

e-strauss commented Nov 22, 2024

View reviewed changes

mboehm7 closed this in c21273c Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-3782] Bag-of-words Encoder for SP #2145

[SYSTEMDS-3782] Bag-of-words Encoder for SP #2145

e-strauss commented Nov 21, 2024 •

edited

Loading

codecov bot commented Nov 21, 2024 •

edited

Loading

e-strauss commented Nov 22, 2024

e-strauss Nov 22, 2024

e-strauss Nov 22, 2024

mboehm7 commented Nov 24, 2024

[SYSTEMDS-3782] Bag-of-words Encoder for SP #2145

[SYSTEMDS-3782] Bag-of-words Encoder for SP #2145

Conversation

e-strauss commented Nov 21, 2024 • edited Loading

codecov bot commented Nov 21, 2024 • edited Loading

Codecov Report

e-strauss commented Nov 22, 2024

e-strauss Nov 22, 2024

Choose a reason for hiding this comment

e-strauss Nov 22, 2024

Choose a reason for hiding this comment

mboehm7 commented Nov 24, 2024

e-strauss commented Nov 21, 2024 •

edited

Loading

codecov bot commented Nov 21, 2024 •

edited

Loading