-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AJ-1227: parse incoming PFBs into tables #387
Conversation
service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java
Outdated
Show resolved
Hide resolved
service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java
Outdated
Show resolved
Hide resolved
for (Map.Entry<RecordType, List<Record>> recList : sortedRecords.entrySet()) { | ||
RecordType recType = recList.getKey(); | ||
List<Record> rList = recList.getValue(); | ||
schema = inferer.inferTypes(records); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a single record type, we only check the record type for the first batch, afterwards I suppose we fail if later records don't match the initial schema. Since a new record type could show up in any batch, I infer the type every time. This could be simplified by checking against the result
map and not re-inferring, or potentially using java-pfb to figure out the schema ahead of time. Opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some code around this to only change the schema the first time a given record type is seen. I also am pushing a lot of this work off to https://broadworkbench.atlassian.net/browse/AJ-1452 (I still need to add details to the Jira ticket)
service/src/main/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJob.java
Outdated
Show resolved
Hide resolved
service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java
Outdated
Show resolved
Hide resolved
service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java
Outdated
Show resolved
Hide resolved
service/src/main/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJob.java
Outdated
Show resolved
Hide resolved
service/src/main/java/org/databiosphere/workspacedataservice/service/PfbStreamWriteHandler.java
Outdated
Show resolved
Hide resolved
RecordType.valueOf(genRec.get("name").toString()), | ||
RecordAttributes.empty()); | ||
GenericRecord objectAttributes = (GenericRecord) genRec.get("object"); // contains attributes | ||
Schema schema = objectAttributes.getSchema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the Schema
dynamic per GenericRecord
? I'm wondering if there's any way to get the schema once per record type instead of once per record, potentially optimize the flow a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushing this off to https://broadworkbench.atlassian.net/browse/AJ-1452 .
String fieldName = field.name(); | ||
Object value = | ||
objectAttributes.get(fieldName) == null ? null : objectAttributes.get(fieldName); | ||
attributes.putAttribute(fieldName, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where we'd need to convert the native PFB types into the types that Record
expects, such as datetimes and BigDecimals
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's figured out during BatchWriteService.consumeWriteStream
, which calls inferTypes
, but maybe if we did something here we could make that more reliable - or avoid calling inferTypes
for PFBs altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may end up being a two-step operation. Here, we translate from the Avro Java types e.g. Long
to the expected WDS Java types e.g. BigDecimal
, and then in inferTypes
we notice that it's a BigDecimal
and therefore would categorize as DataTypeMapping.NUMBER
. At least that's my proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushing this off to https://broadworkbench.atlassian.net/browse/AJ-1452 .
service/src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJobTest.java
Outdated
Show resolved
Hide resolved
service/src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbQuartzJobTest.java
Show resolved
Hide resolved
service/src/main/java/org/databiosphere/workspacedataservice/service/BatchWriteService.java
Outdated
Show resolved
Hide resolved
// type, so this will result in a grouping of 1. | ||
// TSV and JSON inputs are validated against the recordType argument. PFB inputs pass | ||
// a null recordType argument so there is nothing to validate. | ||
Map<RecordType, List<Record>> sortedRecords = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about Guava data structures? This (and several other complex data structures in this PR) looks like it might lend itself well to being represented as a higher level data structure. In this case, I wonder if Multimap
(ref: https://guava.dev/releases/19.0/api/docs/com/google/common/collect/Multimap.html) might simplify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to a multimap. I'm all for it, as long as it's readable to our engineers who may or may not have experience with Guava. I think it's a good call though and worth the exposure.
.../src/test/java/org/databiosphere/workspacedataservice/dataimport/PfbRecordConverterTest.java
Show resolved
Hide resolved
service/src/test/java/org/databiosphere/workspacedataservice/service/BatchWriteServiceTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I had a couple comments/questions throughout.
The suggestions about using Guava stuff is completely optional and happy to punt that but it might be an opportunity to try some stuff out.
I am most interested in better understanding opType
and the interactions it has with the code that has to know about it.
} | ||
|
||
public void increaseCount(RecordType recordType, int count) { | ||
Preconditions.checkArgument(count >= 0, "Count cannot be negative"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we want to throw or log an error here since having it ever be negative maybe is worth looking into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will throw a IllegalArgumentException
if it is negative!
|
||
GenericData.Record objectAttributes = new GenericData.Record(myObjSchema); | ||
objectAttributes.put("marco", "polo"); | ||
objectAttributes.put("pi", 3.14159); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking of this bug (https://broadworkbench.atlassian.net/browse/AJ-1292) should we test with more decimal places? maybe this is something to add in the future but since it came to my mind wanted to call this out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should defer that to AJ-1452 - in this PR, I'm not trying to handle datatypes at all, but when we tackle datatype stuff we should add tests as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidangb Thank you for all the work on this between you and Bria, this is starting to make more sense in my brain!
@@ -105,7 +105,7 @@ private BatchWriteResult consumeWriteStream( | |||
// time we've seen this type, calculate a schema from its records and update the record type | |||
// as necessary. Then, write the records into the table. | |||
// TODO AJ-1452: for PFB imports, get schema from Avro, not from attribute values inference | |||
for (RecordType recType : groupedRecords.keys()) { | |||
for (RecordType recType : groupedRecords.keySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just linking the thread talking about your odyssey to get to this change which is just brilliant!
Kudos, SonarCloud Quality Gate passed! |
WIP
Reminder:
PRs merged into main will not automatically generate a PR in https://github.com/broadinstitute/terra-helmfile to update the WDS image deployed to kubernetes - this action will need to be triggered manually by running the following github action: https://github.com/DataBiosphere/terra-workspace-data-service/actions/workflows/tag.yml. Dont forget to provide a Jira Id when triggering the manual action, if no Jira ID is provided the action will not fully succeed.
After you manually trigger the github action (and it completes with no errors), you must go to the terra-helmfile repo and verify that this generated a PR that merged successfully.
The terra-helmfile PR merge will then generate a PR in leonardo. This will automerge if all tests pass, but if jenkins tests fail it will not; be sure to watch it to ensure it merges. To trigger jenkins retest simply comment on PR with "jenkins retest".