Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify missingness of point estimates; add examples for determining output_type_id format #197

Merged
merged 21 commits into from
Oct 31, 2024

Conversation

zkamvar
Copy link
Member

@zkamvar zkamvar commented Oct 21, 2024

This modifies the description of output_type_id for point estimates to clearly state that the entry should be a missing value, either NA in R or None in Python.

This also states that this applies to schemas prior to v4.0.0 (see hubverse-org/schemas#109) and adds the python one-liner as suggested by @MKupperman in #198.

The preview for this PR is at https://hubdocs--197.org.readthedocs.build/en/197/user-guide/model-output.html

@zkamvar
Copy link
Member Author

zkamvar commented Oct 21, 2024

I would like to get another modeler perspective on this if it sounds clear: @trobacker

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
Copy link
Contributor

@micokoch micokoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anna made the technical comments, but from a readability angle, it looks good to me. It even made me realize I should uncapitalize a "parquet" which I had capitalized previously.

@zkamvar
Copy link
Member Author

zkamvar commented Oct 22, 2024

I've updated the text according to Anna's suggestions and what I found in #198

@elray1
Copy link
Contributor

elray1 commented Oct 22, 2024

I have not read all of this carefully. But I noticed Anna's comment above, "I believe also exists in python via pd.NA, is the closest thing to R's NA and I think that's actually what @MKupperman used in his successful exploration (but I might be wrong)."

Just noting that missing values in polars are best/most often/most generally represented using null (which is in practice set by specifying None). Again, haven't read this but just wanted to quickly mention it in case it's relevant to anything here.

@annakrystalli
Copy link
Member

It might indeed be relevant @elray1.

What we are interested here is what missing values will serialise in an interoperable way that will be correctly interepreted as a missing value by our software when read in. We will definitely have NAs in our files as we will always have R users writing to csv files so either way polars will need to deal with that. I believe it can though as you would expect the R version of the polars package to understand NAs. There are also arguments when reading in files that allow us to specify null values on read (in both python and R).

Having said all that, it's of course best not to make any assumptions and actually test to ensure our suggestions don't cause any problems to desired downstream functionality. So @elray1, would you be able to add what a successful test of polars functionality requirements would look like here 🙏? #198

@zkamvar
Copy link
Member Author

zkamvar commented Oct 23, 2024

This PR is intended to clarify that "NA" in the schema actually means missing data and not the character "NA". I've tested these assumptions in #198 (comment) and I think it satisfies this comment:

Having said all that, it's of course best not to make any assumptions and actually test to ensure our suggestions don't cause any problems to desired downstream functionality.

Copy link
Contributor

@micokoch micokoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks very good, as far as I can tell. I made some small capitalization suggestions to be consistent with the rest of hubDocs.

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
@zkamvar
Copy link
Member Author

zkamvar commented Oct 25, 2024

@annakrystalli, did I address your concerns?

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
Copy link
Member

@annakrystalli annakrystalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there from my perspective with some really useful and detailed additions!

Made a few corrections and a suggestion for moving a section and cleaning up some possible misconceptions.

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
Copy link
Contributor

@bsweger bsweger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the missing data question. I reviewed the text and tried running all of the code, so made some notes about those things.

As for the requested "Python perspective"... after reading these changes, I'm honestly not sure what the desired outputs are. If I'm understanding the write-up correct, there's a distinction between "NA" for something not required and NA/None for something that is optional and missing?

I might need to hop on a zoom to better understand what we're going for

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
Co-authored-by: Becky Sweger <[email protected]>
Co-authored-by: Anna Krystalli <[email protected]>
Copy link
Contributor

@bsweger bsweger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to make sure I'm understanding the problem space correctly! Based on that, I made a few more wording suggestions for your consideration (feel free to disregard)

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
Copy link
Contributor

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update, it looks all clear to me, I just made some minor comments.

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
@zkamvar
Copy link
Member Author

zkamvar commented Oct 30, 2024

Thank you @annakrystalli, @bsweger, @LucieContamin, and @micokoch for your comments!

As per @annakrystalli and @LucieContamin's suggestions, I have moved the documentation about the template generation and arrow schema closer to the examples. Specifically, I added the section about the arrow schema inside the example so that users do not have to look far to know what the arrow schema would look like.

As per @bsweger's suggestions (and #198 (comment)), I have attempted to focus the discussion for the modelers on the column type.

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Show resolved Hide resolved
Copy link
Member

@annakrystalli annakrystalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple of minor last minute corrections and added a suggested note about another useful function, hubData::coerce_to_hub_schema().

Other than that, all my comments have been resolved. Thanks @zkamvar !

@annakrystalli
Copy link
Member

FYI as a response to functions being relevant to modelers appearing across too many packages, I've made a proposal to re-export them to hubValidations here: hubverse-org/hubValidations#149

Input welcome

Copy link
Contributor

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, it looks all good to me. I just made 2 small comment, feel free to ignore

docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
docs/source/user-guide/model-output.md Outdated Show resolved Hide resolved
Co-authored-by: Anna Krystalli <[email protected]>
Co-authored-by: Lucie Contamin <[email protected]>
It made more sense for this function to be what you would use before
writing out the schema to parquet in R. It doesn't really do much for
Python folks because they aren't likely to go through R to submit their
models.
@zkamvar
Copy link
Member Author

zkamvar commented Oct 31, 2024

Thank you all for sticking with me through this!

@zkamvar zkamvar merged commit 8c1fa4e into main Oct 31, 2024
1 check passed
@zkamvar zkamvar deleted the znk/clarify-missingness branch October 31, 2024 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants