Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LANL 10/23 predictions #116

Merged
merged 9 commits into from
Oct 22, 2024
Merged

LANL 10/23 predictions #116

merged 9 commits into from
Oct 22, 2024

Conversation

MKupperman
Copy link
Contributor

LANL Theoretical Biology and Biophysics group predictions using CovTransformer (a transformer model for covid variant predictions) for 10/16 are included in this merge request.

We encountered an issue with "NA" not being correctly recognized as NA when performing parquet validation. Please email me if this is an issue on the back-end.

We have clipped our NA/NaN/Null predictions into 0, to comply with the normalization requirement that frequencies must sum to 1.

The views expressed here are those of the authors and does not necessarily represent the official views of the National Institutes of Health, Los Alamos National Laboratory, or the US Government."

@nickreich
Copy link
Member

nickreich commented Oct 17, 2024

Hi @MKupperman , thanks for the submission! We will need to investigate this validation error, as it may be a bug on our end. A few of us have been traveling this week, hence our delay in getting to look at your submission.

I did some diagnostics on your file, and here is what I see:

  • your file contains only mean output type (this should pass validation)
  • all of the values in the "output_type_id" column are "NA" (this should also pass validation, as this is the correct thing to do if the output type is mean)

When I read the file into R, it is seeing the output_type_id column as character and is seeing "NA" as the string "NA" rather than the special NA value. I think this is causing the validations to throw an error erroneously. We will check into this and try to fix it ASAP.

Tagging @zkamvar and @annakrystalli as our validation gurus for support here.

> tmp <- read_parquet("~/Downloads/2024-10-16-LANL-CovTransformer.parquet")
> View(tmp)
> sum(is.na(tmp$output_type_id))
[1] 0
> sum(tmp$output_type_id=="NA")
[1] 13104

@annakrystalli
Copy link

Hello @MKupperman !

I had a look and it is in fact a valid error. It seems your NAs have been encoded as "NA" character strings and not NA values (i.e. missing values)

If you convert your output_type_id values to NAs the check passes. See below:

repo_url <- "https://github.com/MKupperman/variant-nowcast-hub.git"
file_path <- "LANL-CovTransformer/2024-10-16-LANL-CovTransformer.parquet"

hub_path <- withr::local_tempdir()
# Clone the repository into the current temporary directory
gert::git_clone(url = repo_url, path = hub_path)

# Read File
tbl_chr <- hubValidations::read_model_out_file(file_path,
  hub_path = hub_path,
  coerce_types = "chr"
)
#> ℹ Updating superseded URL `Infectious-Disease-Modeling-hubs` to `hubverse-org`
tbl_chr
#> # A tibble: 13,104 × 7
#>    location clade value target_date output_type output_type_id nowcast_date
#>    <chr>    <chr> <chr> <chr>       <chr>       <chr>          <chr>       
#>  1 AL       24A   0     2024-09-15  mean        NA             2024-10-16  
#>  2 AL       24A   0     2024-09-16  mean        NA             2024-10-16  
#>  3 AL       24A   0     2024-09-17  mean        NA             2024-10-16  
#>  4 AL       24A   0     2024-09-18  mean        NA             2024-10-16  
#>  5 AL       24A   0     2024-09-19  mean        NA             2024-10-16  
#>  6 AL       24A   0     2024-09-20  mean        NA             2024-10-16  
#>  7 AL       24A   0     2024-09-21  mean        NA             2024-10-16  
#>  8 AL       24A   0     2024-09-22  mean        NA             2024-10-16  
#>  9 AL       24A   0     2024-09-23  mean        NA             2024-10-16  
#> 10 AL       24A   0     2024-09-24  mean        NA             2024-10-16  
#> # ℹ 13,094 more rows

hubValidations::check_tbl_values(tbl_chr,
  round_id = "2024-10-16",
  file_path = file_path, hub_path = hub_path
)
#> <error/check_error>
#> Error:
#> ! `tbl` contains invalid values/value combinations.  Column
#>   `output_type_id` contains invalid value "NA".

# Values in output)type_id column are not `NA`s
all(is.na(tbl_chr$output_type_id))
#> [1] FALSE
# Values in output)type_id column are actually characters
all(tbl_chr$output_type_id == "NA")
#> [1] TRUE

# Convert to  NA values
tbl_chr$output_type_id <- NA_character_
tbl_chr
#> # A tibble: 13,104 × 7
#>    location clade value target_date output_type output_type_id nowcast_date
#>    <chr>    <chr> <chr> <chr>       <chr>       <chr>          <chr>       
#>  1 AL       24A   0     2024-09-15  mean        <NA>           2024-10-16  
#>  2 AL       24A   0     2024-09-16  mean        <NA>           2024-10-16  
#>  3 AL       24A   0     2024-09-17  mean        <NA>           2024-10-16  
#>  4 AL       24A   0     2024-09-18  mean        <NA>           2024-10-16  
#>  5 AL       24A   0     2024-09-19  mean        <NA>           2024-10-16  
#>  6 AL       24A   0     2024-09-20  mean        <NA>           2024-10-16  
#>  7 AL       24A   0     2024-09-21  mean        <NA>           2024-10-16  
#>  8 AL       24A   0     2024-09-22  mean        <NA>           2024-10-16  
#>  9 AL       24A   0     2024-09-23  mean        <NA>           2024-10-16  
#> 10 AL       24A   0     2024-09-24  mean        <NA>           2024-10-16  
#> # ℹ 13,094 more rows

hubValidations::check_tbl_values(tbl_chr,
  round_id = "2024-10-16",
  file_path = file_path, hub_path = hub_path
)
#> <message/check_success>
#> Message:
#> `tbl` contains valid values/value combinations.

# Write the file back, re-read and check again
arrow::write_parquet(tbl_chr, fs::path(hub_path, "model-output", file_path))

tbl_chr <- hubValidations::read_model_out_file(file_path,
  hub_path = hub_path,
  coerce_types = "chr"
)
#> ℹ Updating superseded URL `Infectious-Disease-Modeling-hubs` to `hubverse-org`
hubValidations::check_tbl_values(tbl_chr,
  round_id = "2024-10-16",
  file_path = file_path, hub_path = hub_path
)
#> <message/check_success>
#> Message:
#> `tbl` contains valid values/value combinations.

Created on 2024-10-18 with reprex v2.1.0

@zkamvar
Copy link
Member

zkamvar commented Oct 18, 2024

@MKupperman, I have a couple of diagnostic questions for you:

  1. What did you use to create the model output (R, Python, Excel, Julia, small-batch hand crafted artisan spreadsheets, etc)?
  2. How did you create the parquet file? Did you write directly from the table in memory to parquet format or did you write to CSV and then convert that into a parquet file?
  3. Would you be able to share the code you use to create the output?

Thank you for the report!

@MKupperman
Copy link
Contributor Author

MKupperman commented Oct 21, 2024

Hi all, thanks for looking into this!

@zkamvar - here's some information.

  1. Model outputs are assembled in Python using Pandas.
  2. Specifically, we form a list of dictionaries (one dictionary per row, setting "NA" for the output_type_id), and invoke pandas.DataFrame.from_dict. We then serialize the dataframe to a parquet file using the Pandas method, to_parquet, with the pyarrow engine.
  3. I can't share the full file, but here's a MWE for the process we used.
import pandas as pd

# Define the data as a list of dictionaries
data = [
    {"location": "AL", "clade": "24E", "value": 0.0, "target_date": "2024-09-22", "output_type": "mean", "output_type_id": "NA", "nowcast_date": "2024-10-23"},
    {"location": "AL", "clade": "24E", "value": 0.0, "target_date": "2024-09-23", "output_type": "mean", "output_type_id": "NA", "nowcast_date": "2024-10-23"}
]

# Convert the list of dictionaries to a pandas DataFrame
df_mwe = pd.DataFrame(data)

# Save the DataFrame
df_mwe.to_parquet("file.parquet")

If we coerce to pd.NA (or None), the corresponding dtype that the validation tool receives is vctrs_unspecified, rather than chr. Similarly, using np.nan to encode gives a "double" (numeric) data type, instead of char. Seems like an issue to fix on the backend.

For now, we'll add some R to the pipeline and do the conversion there, as @annakrystalli suggested.

@zkamvar
Copy link
Member

zkamvar commented Oct 21, 2024

Thank you for the MWE, @MKupperman! I've opened hubverse-org/hubDocs#198 to track the issue.

@MKupperman
Copy link
Contributor Author

Adding 10/23 submissions to the PR. Should pass integration tests now.

@MKupperman MKupperman changed the title LANL 10/16 predictions LANL 10/23 predictions Oct 21, 2024
@IsaacMacarthur IsaacMacarthur requested review from IsaacMacarthur and removed request for IsaacMacarthur October 22, 2024 00:39
@IsaacMacarthur IsaacMacarthur merged commit 07ff0e9 into reichlab:main Oct 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants