Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AJ-1227: parse incoming PFBs into tables #387

Merged
merged 57 commits into from
Nov 14, 2023
Merged

Conversation

calypsomatic
Copy link
Contributor

@calypsomatic calypsomatic commented Oct 27, 2023

WIP

Reminder:

PRs merged into main will not automatically generate a PR in https://github.com/broadinstitute/terra-helmfile to update the WDS image deployed to kubernetes - this action will need to be triggered manually by running the following github action: https://github.com/DataBiosphere/terra-workspace-data-service/actions/workflows/tag.yml. Dont forget to provide a Jira Id when triggering the manual action, if no Jira ID is provided the action will not fully succeed.

After you manually trigger the github action (and it completes with no errors), you must go to the terra-helmfile repo and verify that this generated a PR that merged successfully.

The terra-helmfile PR merge will then generate a PR in leonardo. This will automerge if all tests pass, but if jenkins tests fail it will not; be sure to watch it to ensure it merges. To trigger jenkins retest simply comment on PR with "jenkins retest".

for (Map.Entry<RecordType, List<Record>> recList : sortedRecords.entrySet()) {
RecordType recType = recList.getKey();
List<Record> rList = recList.getValue();
schema = inferer.inferTypes(records);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a single record type, we only check the record type for the first batch, afterwards I suppose we fail if later records don't match the initial schema. Since a new record type could show up in any batch, I infer the type every time. This could be simplified by checking against the result map and not re-inferring, or potentially using java-pfb to figure out the schema ahead of time. Opinions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some code around this to only change the schema the first time a given record type is seen. I also am pushing a lot of this work off to https://broadworkbench.atlassian.net/browse/AJ-1452 (I still need to add details to the Jira ticket)

RecordType.valueOf(genRec.get("name").toString()),
RecordAttributes.empty());
GenericRecord objectAttributes = (GenericRecord) genRec.get("object"); // contains attributes
Schema schema = objectAttributes.getSchema();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the Schema dynamic per GenericRecord? I'm wondering if there's any way to get the schema once per record type instead of once per record, potentially optimize the flow a bit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String fieldName = field.name();
Object value =
objectAttributes.get(fieldName) == null ? null : objectAttributes.get(fieldName);
attributes.putAttribute(fieldName, value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where we'd need to convert the native PFB types into the types that Record expects, such as datetimes and BigDecimals, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's figured out during BatchWriteService.consumeWriteStream, which calls inferTypes, but maybe if we did something here we could make that more reliable - or avoid calling inferTypes for PFBs altogether

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may end up being a two-step operation. Here, we translate from the Avro Java types e.g. Long to the expected WDS Java types e.g. BigDecimal, and then in inferTypes we notice that it's a BigDecimal and therefore would categorize as DataTypeMapping.NUMBER. At least that's my proposal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// type, so this will result in a grouping of 1.
// TSV and JSON inputs are validated against the recordType argument. PFB inputs pass
// a null recordType argument so there is nothing to validate.
Map<RecordType, List<Record>> sortedRecords =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about Guava data structures? This (and several other complex data structures in this PR) looks like it might lend itself well to being represented as a higher level data structure. In this case, I wonder if Multimap (ref: https://guava.dev/releases/19.0/api/docs/com/google/common/collect/Multimap.html) might simplify things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to a multimap. I'm all for it, as long as it's readable to our engineers who may or may not have experience with Guava. I think it's a good call though and worth the exposure.

Copy link
Contributor

@jladieu jladieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I had a couple comments/questions throughout.

The suggestions about using Guava stuff is completely optional and happy to punt that but it might be an opportunity to try some stuff out.

I am most interested in better understanding opType and the interactions it has with the code that has to know about it.

}

public void increaseCount(RecordType recordType, int count) {
Preconditions.checkArgument(count >= 0, "Count cannot be negative");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we want to throw or log an error here since having it ever be negative maybe is worth looking into?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will throw a IllegalArgumentException if it is negative!


GenericData.Record objectAttributes = new GenericData.Record(myObjSchema);
objectAttributes.put("marco", "polo");
objectAttributes.put("pi", 3.14159);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking of this bug (https://broadworkbench.atlassian.net/browse/AJ-1292) should we test with more decimal places? maybe this is something to add in the future but since it came to my mind wanted to call this out

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should defer that to AJ-1452 - in this PR, I'm not trying to handle datatypes at all, but when we tackle datatype stuff we should add tests as necessary.

Copy link
Contributor

@yuliadub yuliadub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidangb Thank you for all the work on this between you and Bria, this is starting to make more sense in my brain!

@@ -105,7 +105,7 @@ private BatchWriteResult consumeWriteStream(
// time we've seen this type, calculate a schema from its records and update the record type
// as necessary. Then, write the records into the table.
// TODO AJ-1452: for PFB imports, get schema from Avro, not from attribute values inference
for (RecordType recType : groupedRecords.keys()) {
for (RecordType recType : groupedRecords.keySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just linking the thread talking about your odyssey to get to this change which is just brilliant!

Copy link

sonarcloud bot commented Nov 14, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 4 Code Smells

88.5% 88.5% Coverage
0.0% 0.0% Duplication

@davidangb davidangb merged commit 39538ba into main Nov 14, 2023
14 checks passed
@davidangb davidangb deleted the aj-1227-parse-tables branch November 14, 2023 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants