LANL 10/23 predictions #116

MKupperman · 2024-10-16T21:52:01Z

LANL Theoretical Biology and Biophysics group predictions using CovTransformer (a transformer model for covid variant predictions) for 10/16 are included in this merge request.

We encountered an issue with "NA" not being correctly recognized as NA when performing parquet validation. Please email me if this is an issue on the back-end.

We have clipped our NA/NaN/Null predictions into 0, to comply with the normalization requirement that frequencies must sum to 1.

The views expressed here are those of the authors and does not necessarily represent the official views of the National Institutes of Health, Los Alamos National Laboratory, or the US Government."

nickreich · 2024-10-17T22:25:20Z

Hi @MKupperman , thanks for the submission! We will need to investigate this validation error, as it may be a bug on our end. A few of us have been traveling this week, hence our delay in getting to look at your submission.

I did some diagnostics on your file, and here is what I see:

your file contains only mean output type (this should pass validation)
all of the values in the "output_type_id" column are "NA" (this should also pass validation, as this is the correct thing to do if the output type is mean)

When I read the file into R, it is seeing the output_type_id column as character and is seeing "NA" as the string "NA" rather than the special NA value. I think this is causing the validations to throw an error erroneously. We will check into this and try to fix it ASAP.

Tagging @zkamvar and @annakrystalli as our validation gurus for support here.

> tmp <- read_parquet("~/Downloads/2024-10-16-LANL-CovTransformer.parquet")
> View(tmp)
> sum(is.na(tmp$output_type_id))
[1] 0
> sum(tmp$output_type_id=="NA")
[1] 13104

annakrystalli · 2024-10-18T08:51:24Z

Hello @MKupperman !

I had a look and it is in fact a valid error. It seems your NAs have been encoded as "NA" character strings and not NA values (i.e. missing values)

If you convert your output_type_id values to NAs the check passes. See below:

repo_url <- "https://github.com/MKupperman/variant-nowcast-hub.git"
file_path <- "LANL-CovTransformer/2024-10-16-LANL-CovTransformer.parquet"

hub_path <- withr::local_tempdir()
# Clone the repository into the current temporary directory
gert::git_clone(url = repo_url, path = hub_path)

# Read File
tbl_chr <- hubValidations::read_model_out_file(file_path,
  hub_path = hub_path,
  coerce_types = "chr"
)
#> ℹ Updating superseded URL `Infectious-Disease-Modeling-hubs` to `hubverse-org`
tbl_chr
#> # A tibble: 13,104 × 7
#>    location clade value target_date output_type output_type_id nowcast_date
#>    <chr>    <chr> <chr> <chr>       <chr>       <chr>          <chr>       
#>  1 AL       24A   0     2024-09-15  mean        NA             2024-10-16  
#>  2 AL       24A   0     2024-09-16  mean        NA             2024-10-16  
#>  3 AL       24A   0     2024-09-17  mean        NA             2024-10-16  
#>  4 AL       24A   0     2024-09-18  mean        NA             2024-10-16  
#>  5 AL       24A   0     2024-09-19  mean        NA             2024-10-16  
#>  6 AL       24A   0     2024-09-20  mean        NA             2024-10-16  
#>  7 AL       24A   0     2024-09-21  mean        NA             2024-10-16  
#>  8 AL       24A   0     2024-09-22  mean        NA             2024-10-16  
#>  9 AL       24A   0     2024-09-23  mean        NA             2024-10-16  
#> 10 AL       24A   0     2024-09-24  mean        NA             2024-10-16  
#> # ℹ 13,094 more rows

hubValidations::check_tbl_values(tbl_chr,
  round_id = "2024-10-16",
  file_path = file_path, hub_path = hub_path
)
#> <error/check_error>
#> Error:
#> ! `tbl` contains invalid values/value combinations.  Column
#>   `output_type_id` contains invalid value "NA".

# Values in output)type_id column are not `NA`s
all(is.na(tbl_chr$output_type_id))
#> [1] FALSE
# Values in output)type_id column are actually characters
all(tbl_chr$output_type_id == "NA")
#> [1] TRUE

# Convert to  NA values
tbl_chr$output_type_id <- NA_character_
tbl_chr
#> # A tibble: 13,104 × 7
#>    location clade value target_date output_type output_type_id nowcast_date
#>    <chr>    <chr> <chr> <chr>       <chr>       <chr>          <chr>       
#>  1 AL       24A   0     2024-09-15  mean        <NA>           2024-10-16  
#>  2 AL       24A   0     2024-09-16  mean        <NA>           2024-10-16  
#>  3 AL       24A   0     2024-09-17  mean        <NA>           2024-10-16  
#>  4 AL       24A   0     2024-09-18  mean        <NA>           2024-10-16  
#>  5 AL       24A   0     2024-09-19  mean        <NA>           2024-10-16  
#>  6 AL       24A   0     2024-09-20  mean        <NA>           2024-10-16  
#>  7 AL       24A   0     2024-09-21  mean        <NA>           2024-10-16  
#>  8 AL       24A   0     2024-09-22  mean        <NA>           2024-10-16  
#>  9 AL       24A   0     2024-09-23  mean        <NA>           2024-10-16  
#> 10 AL       24A   0     2024-09-24  mean        <NA>           2024-10-16  
#> # ℹ 13,094 more rows

hubValidations::check_tbl_values(tbl_chr,
  round_id = "2024-10-16",
  file_path = file_path, hub_path = hub_path
)
#> <message/check_success>
#> Message:
#> `tbl` contains valid values/value combinations.

# Write the file back, re-read and check again
arrow::write_parquet(tbl_chr, fs::path(hub_path, "model-output", file_path))

tbl_chr <- hubValidations::read_model_out_file(file_path,
  hub_path = hub_path,
  coerce_types = "chr"
)
#> ℹ Updating superseded URL `Infectious-Disease-Modeling-hubs` to `hubverse-org`
hubValidations::check_tbl_values(tbl_chr,
  round_id = "2024-10-16",
  file_path = file_path, hub_path = hub_path
)
#> <message/check_success>
#> Message:
#> `tbl` contains valid values/value combinations.

^{Created on 2024-10-18 with reprex v2.1.0}

zkamvar · 2024-10-18T13:47:07Z

@MKupperman, I have a couple of diagnostic questions for you:

What did you use to create the model output (R, Python, Excel, Julia, small-batch hand crafted artisan spreadsheets, etc)?
How did you create the parquet file? Did you write directly from the table in memory to parquet format or did you write to CSV and then convert that into a parquet file?
Would you be able to share the code you use to create the output?

Thank you for the report!

MKupperman · 2024-10-21T16:16:15Z

Hi all, thanks for looking into this!

@zkamvar - here's some information.

Model outputs are assembled in Python using Pandas.
Specifically, we form a list of dictionaries (one dictionary per row, setting "NA" for the output_type_id), and invoke pandas.DataFrame.from_dict. We then serialize the dataframe to a parquet file using the Pandas method, to_parquet, with the pyarrow engine.
I can't share the full file, but here's a MWE for the process we used.

import pandas as pd

# Define the data as a list of dictionaries
data = [
    {"location": "AL", "clade": "24E", "value": 0.0, "target_date": "2024-09-22", "output_type": "mean", "output_type_id": "NA", "nowcast_date": "2024-10-23"},
    {"location": "AL", "clade": "24E", "value": 0.0, "target_date": "2024-09-23", "output_type": "mean", "output_type_id": "NA", "nowcast_date": "2024-10-23"}
]

# Convert the list of dictionaries to a pandas DataFrame
df_mwe = pd.DataFrame(data)

# Save the DataFrame
df_mwe.to_parquet("file.parquet")

If we coerce to pd.NA (or None), the corresponding dtype that the validation tool receives is vctrs_unspecified, rather than chr. Similarly, using np.nan to encode gives a "double" (numeric) data type, instead of char. Seems like an issue to fix on the backend.

For now, we'll add some R to the pipeline and do the conversion there, as @annakrystalli suggested.

zkamvar · 2024-10-21T19:03:50Z

Thank you for the MWE, @MKupperman! I've opened hubverse-org/hubDocs#198 to track the issue.

MKupperman · 2024-10-21T21:02:18Z

Adding 10/23 submissions to the PR. Should pass integration tests now.

MKupperman and others added 4 commits October 15, 2024 16:45

DRAFT: CovTransformer v1.0, 10/16/24 predictions

9617ce6

Merge branch 'reichlab:main' into main

5b07cc4

Make compliant with spec

377c689

clean up old values

05f73ca

nickreich mentioned this pull request Oct 18, 2024

Testing a submission with samples and means present. #117

Closed

zkamvar mentioned this pull request Oct 18, 2024

Add validate_model_tbl() convenience function to validate data frame before writing a file. hubverse-org/hubValidations#130

Open

Merge branch 'reichlab:main' into main

cd43da6

zkamvar mentioned this pull request Oct 22, 2024

Test and document how to produce typed NA/missing values in Python. hubverse-org/hubDocs#198

Open

MKupperman added 3 commits October 21, 2024 14:41

bugfix last submission

f624d9a

LANL Submission for 10/23

4142766

fix typos in metadata

ff9adac

Remove offending date

10e66cc

MKupperman changed the title ~~LANL 10/16 predictions~~ LANL 10/23 predictions Oct 21, 2024

IsaacMacarthur requested review from IsaacMacarthur and removed request for IsaacMacarthur October 22, 2024 00:39

IsaacMacarthur approved these changes Oct 22, 2024

View reviewed changes

IsaacMacarthur merged commit 07ff0e9 into reichlab:main Oct 22, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LANL 10/23 predictions #116

LANL 10/23 predictions #116

MKupperman commented Oct 16, 2024

nickreich commented Oct 17, 2024 •

edited

Loading

annakrystalli commented Oct 18, 2024

zkamvar commented Oct 18, 2024

MKupperman commented Oct 21, 2024 •

edited by zkamvar

Loading

zkamvar commented Oct 21, 2024

MKupperman commented Oct 21, 2024

LANL 10/23 predictions #116

LANL 10/23 predictions #116

Conversation

MKupperman commented Oct 16, 2024

nickreich commented Oct 17, 2024 • edited Loading

annakrystalli commented Oct 18, 2024

zkamvar commented Oct 18, 2024

MKupperman commented Oct 21, 2024 • edited by zkamvar Loading

zkamvar commented Oct 21, 2024

MKupperman commented Oct 21, 2024

nickreich commented Oct 17, 2024 •

edited

Loading

MKupperman commented Oct 21, 2024 •

edited by zkamvar

Loading