AJ-1227: parse incoming PFBs into tables #387

calypsomatic · 2023-10-27T17:45:26Z

WIP

Reminder:

PRs merged into main will not automatically generate a PR in https://github.com/broadinstitute/terra-helmfile to update the WDS image deployed to kubernetes - this action will need to be triggered manually by running the following github action: https://github.com/DataBiosphere/terra-workspace-data-service/actions/workflows/tag.yml. Dont forget to provide a Jira Id when triggering the manual action, if no Jira ID is provided the action will not fully succeed.

After you manually trigger the github action (and it completes with no errors), you must go to the terra-helmfile repo and verify that this generated a PR that merged successfully.

The terra-helmfile PR merge will then generate a PR in leonardo. This will automerge if all tests pass, but if jenkins tests fail it will not; be sure to watch it to ensure it merges. To trigger jenkins retest simply comment on PR with "jenkins retest".

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java

calypsomatic · 2023-10-31T20:16:18Z

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java

+          for (Map.Entry<RecordType, List<Record>> recList : sortedRecords.entrySet()) {
+            RecordType recType = recList.getKey();
+            List<Record> rList = recList.getValue();
+            schema = inferer.inferTypes(records);


With a single record type, we only check the record type for the first batch, afterwards I suppose we fail if later records don't match the initial schema. Since a new record type could show up in any batch, I infer the type every time. This could be simplified by checking against the result map and not re-inferring, or potentially using java-pfb to figure out the schema ahead of time. Opinions?

I added some code around this to only change the schema the first time a given record type is seen. I also am pushing a lot of this work off to https://broadworkbench.atlassian.net/browse/AJ-1452 (I still need to add details to the Jira ticket)

service/src/main/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJob.java

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java

service/src/main/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJob.java

service/src/main/java/org/databiosphere/workspacedataservice/service/PfbStreamWriteHandler.java

davidangb · 2023-11-01T13:55:47Z

service/src/main/java/org/databiosphere/workspacedataservice/service/PfbStreamWriteHandler.java

+            RecordType.valueOf(genRec.get("name").toString()),
+            RecordAttributes.empty());
+    GenericRecord objectAttributes = (GenericRecord) genRec.get("object"); // contains attributes
+    Schema schema = objectAttributes.getSchema();


is the Schema dynamic per GenericRecord? I'm wondering if there's any way to get the schema once per record type instead of once per record, potentially optimize the flow a bit.

pushing this off to https://broadworkbench.atlassian.net/browse/AJ-1452 .

davidangb · 2023-11-01T13:56:31Z

service/src/main/java/org/databiosphere/workspacedataservice/service/PfbStreamWriteHandler.java

+      String fieldName = field.name();
+      Object value =
+          objectAttributes.get(fieldName) == null ? null : objectAttributes.get(fieldName);
+      attributes.putAttribute(fieldName, value);


this is where we'd need to convert the native PFB types into the types that Record expects, such as datetimes and BigDecimals, right?

It's figured out during BatchWriteService.consumeWriteStream, which calls inferTypes, but maybe if we did something here we could make that more reliable - or avoid calling inferTypes for PFBs altogether

I think it may end up being a two-step operation. Here, we translate from the Avro Java types e.g. Long to the expected WDS Java types e.g. BigDecimal, and then in inferTypes we notice that it's a BigDecimal and therefore would categorize as DataTypeMapping.NUMBER. At least that's my proposal.

pushing this off to https://broadworkbench.atlassian.net/browse/AJ-1452 .

service/src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJobTest.java

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java

jladieu · 2023-11-13T15:38:03Z

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java

+        // type, so this will result in a grouping of 1.
+        // TSV and JSON inputs are validated against the recordType argument. PFB inputs pass
+        // a null recordType argument so there is nothing to validate.
+        Map<RecordType, List<Record>> sortedRecords =


How do you feel about Guava data structures? This (and several other complex data structures in this PR) looks like it might lend itself well to being represented as a higher level data structure. In this case, I wonder if Multimap (ref: https://guava.dev/releases/19.0/api/docs/com/google/common/collect/Multimap.html) might simplify things.

switched to a multimap. I'm all for it, as long as it's readable to our engineers who may or may not have experience with Guava. I think it's a good call though and worth the exposure.

.../src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbRecordConverterTest.java

service/src/test/java/org/databiosphere/workspacedataservice/service/BatchWriteServiceTest.java

jladieu

Looking good! I had a couple comments/questions throughout.

The suggestions about using Guava stuff is completely optional and happy to punt that but it might be an opportunity to try some stuff out.

I am most interested in better understanding opType and the interactions it has with the code that has to know about it.

yuliadub · 2023-11-13T17:58:08Z

...ice/src/main/java/org/databiosphere/workspacedataservice/service/model/BatchWriteResult.java

+  }
+
+  public void increaseCount(RecordType recordType, int count) {
+    Preconditions.checkArgument(count >= 0, "Count cannot be negative");


would we want to throw or log an error here since having it ever be negative maybe is worth looking into?

this will throw a IllegalArgumentException if it is negative!

yuliadub · 2023-11-13T18:14:30Z

.../src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbRecordConverterTest.java

+
+    GenericData.Record objectAttributes = new GenericData.Record(myObjSchema);
+    objectAttributes.put("marco", "polo");
+    objectAttributes.put("pi", 3.14159);


thinking of this bug (https://broadworkbench.atlassian.net/browse/AJ-1292) should we test with more decimal places? maybe this is something to add in the future but since it came to my mind wanted to call this out

I think we should defer that to AJ-1452 - in this PR, I'm not trying to handle datatypes at all, but when we tackle datatype stuff we should add tests as necessary.

yuliadub

@davidangb Thank you for all the work on this between you and Bria, this is starting to make more sense in my brain!

jladieu · 2023-11-14T14:30:25Z

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java

@@ -105,7 +105,7 @@ private BatchWriteResult consumeWriteStream(
        // time we've seen this type, calculate a schema from its records and update the record type
        // as necessary. Then, write the records into the table.
        // TODO AJ-1452: for PFB imports, get schema from Avro, not from attribute values inference
-        for (RecordType recType : groupedRecords.keys()) {
+        for (RecordType recType : groupedRecords.keySet()) {


Just linking the thread talking about your odyssey to get to this change which is just brilliant!

sonarcloud · 2023-11-14T14:32:16Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
4 Code Smells

88.5% Coverage
0.0% Duplication

calypsomatic added 16 commits October 25, 2023 16:17

first outline of pfb parsing

bbf11d6

trying to figure out multiple record types

edda43c

trying to figure out multiple record types

6c5827a

outline of all the work to do

38c2931

Merge branch 'main' into aj-1227-parse-tables

98e1f12

don't break all the tests

bb54ac7

Merge branch 'main' into aj-1227-parse-tables

08439e8

fix broken stuff from merge

20ea694

numRecords and test across batch size

c1e81cf

minor improvements

fe4d4b9

log record updates

9937248

small refactor

927c77a

Merge branch 'main' into aj-1227-parse-tables

9e222dc

maybe not so broken

db861c7

object instead of string for attributes

20c4427

fine no helper method

9ca3634

calypsomatic commented Oct 31, 2023

View reviewed changes

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java Outdated Show resolved Hide resolved

calypsomatic commented Oct 31, 2023

View reviewed changes

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java Outdated Show resolved Hide resolved

calypsomatic commented Oct 31, 2023

View reviewed changes

fix types and schemas

f374794

yuliadub reviewed Oct 31, 2023

View reviewed changes

service/src/main/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJob.java Outdated Show resolved Hide resolved

yuliadub reviewed Oct 31, 2023

View reviewed changes

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java Outdated Show resolved Hide resolved

yuliadub reviewed Oct 31, 2023

View reviewed changes

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java Outdated Show resolved Hide resolved

davidangb reviewed Nov 1, 2023

View reviewed changes

service/src/main/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJob.java Outdated Show resolved Hide resolved

davidangb reviewed Nov 1, 2023

View reviewed changes

service/src/main/java/org/databiosphere/workspacedataservice/service/PfbStreamWriteHandler.java Outdated Show resolved Hide resolved

davidangb reviewed Nov 1, 2023

View reviewed changes

calypsomatic added 3 commits November 1, 2023 10:12

Merge branch 'main' into aj-1227-parse-tables

9e75f34

add BatchWriteResult

48f1825

fix empty map

d4bcfcf

jladieu reviewed Nov 13, 2023

View reviewed changes

service/src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJobTest.java Outdated Show resolved Hide resolved

jladieu reviewed Nov 13, 2023

View reviewed changes

service/src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJobTest.java Show resolved Hide resolved

PfbQuartzJob.findSnapshots now returns a Set

4147a7b

jladieu reviewed Nov 13, 2023

View reviewed changes

service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java Outdated Show resolved Hide resolved

jladieu reviewed Nov 13, 2023

View reviewed changes

.../src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbRecordConverterTest.java Show resolved Hide resolved

davidangb added 3 commits November 13, 2023 10:41

code comments

877274e

reword comment

a30bf42

catch narrower exception

3628bfc

jladieu reviewed Nov 13, 2023

View reviewed changes

service/src/test/java/org/databiosphere/workspacedataservice/service/BatchWriteServiceTest.java Show resolved Hide resolved

spring injection for resources

8140951

jladieu reviewed Nov 13, 2023

View reviewed changes

davidangb added 6 commits November 13, 2023 11:13

create PfbQuartzJob via helper method in test

bf9b5c7

helper method for JobExecutionContext mock

78cdffe

update code comment

992858c

rename sortedRecords to groupedRecords

022aa53

use Guava MultiMap

4df366d

add TODO comment

7d376d5

jladieu approved these changes Nov 13, 2023

View reviewed changes

yuliadub reviewed Nov 13, 2023

View reviewed changes

yuliadub approved these changes Nov 13, 2023

View reviewed changes

davidangb added 3 commits November 13, 2023 13:37

Merge branch 'main' into aj-1227-parse-tables

a71b658

Merge branch 'main' into aj-1227-parse-tables

3ecaea2

ImmutableMultimap.keySet(), not .keys()

8607cae

jladieu reviewed Nov 14, 2023

View reviewed changes

jladieu approved these changes Nov 14, 2023

View reviewed changes

davidangb merged commit 39538ba into main Nov 14, 2023
14 checks passed

davidangb deleted the aj-1227-parse-tables branch November 14, 2023 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AJ-1227: parse incoming PFBs into tables #387

AJ-1227: parse incoming PFBs into tables #387

calypsomatic commented Oct 27, 2023 •

edited

Loading

calypsomatic Oct 31, 2023

davidangb Nov 9, 2023

davidangb Nov 1, 2023

davidangb Nov 9, 2023

davidangb Nov 1, 2023

calypsomatic Nov 1, 2023

davidangb Nov 3, 2023

davidangb Nov 9, 2023

jladieu Nov 13, 2023

davidangb Nov 13, 2023

jladieu left a comment

yuliadub Nov 13, 2023

davidangb Nov 13, 2023

yuliadub Nov 13, 2023

davidangb Nov 13, 2023

yuliadub left a comment •

edited

Loading

jladieu Nov 14, 2023

sonarcloud bot commented Nov 14, 2023

AJ-1227: parse incoming PFBs into tables #387

AJ-1227: parse incoming PFBs into tables #387

Conversation

calypsomatic commented Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jladieu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuliadub left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Nov 14, 2023

calypsomatic commented Oct 27, 2023 •

edited

Loading

yuliadub left a comment •

edited

Loading