-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LANL 10/23 predictions #116
Conversation
Hi @MKupperman , thanks for the submission! We will need to investigate this validation error, as it may be a bug on our end. A few of us have been traveling this week, hence our delay in getting to look at your submission. I did some diagnostics on your file, and here is what I see:
When I read the file into R, it is seeing the output_type_id column as character and is seeing "NA" as the string "NA" rather than the special Tagging @zkamvar and @annakrystalli as our validation gurus for support here.
|
Hello @MKupperman ! I had a look and it is in fact a valid error. It seems your If you convert your repo_url <- "https://github.com/MKupperman/variant-nowcast-hub.git"
file_path <- "LANL-CovTransformer/2024-10-16-LANL-CovTransformer.parquet"
hub_path <- withr::local_tempdir()
# Clone the repository into the current temporary directory
gert::git_clone(url = repo_url, path = hub_path)
# Read File
tbl_chr <- hubValidations::read_model_out_file(file_path,
hub_path = hub_path,
coerce_types = "chr"
)
#> ℹ Updating superseded URL `Infectious-Disease-Modeling-hubs` to `hubverse-org`
tbl_chr
#> # A tibble: 13,104 × 7
#> location clade value target_date output_type output_type_id nowcast_date
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AL 24A 0 2024-09-15 mean NA 2024-10-16
#> 2 AL 24A 0 2024-09-16 mean NA 2024-10-16
#> 3 AL 24A 0 2024-09-17 mean NA 2024-10-16
#> 4 AL 24A 0 2024-09-18 mean NA 2024-10-16
#> 5 AL 24A 0 2024-09-19 mean NA 2024-10-16
#> 6 AL 24A 0 2024-09-20 mean NA 2024-10-16
#> 7 AL 24A 0 2024-09-21 mean NA 2024-10-16
#> 8 AL 24A 0 2024-09-22 mean NA 2024-10-16
#> 9 AL 24A 0 2024-09-23 mean NA 2024-10-16
#> 10 AL 24A 0 2024-09-24 mean NA 2024-10-16
#> # ℹ 13,094 more rows
hubValidations::check_tbl_values(tbl_chr,
round_id = "2024-10-16",
file_path = file_path, hub_path = hub_path
)
#> <error/check_error>
#> Error:
#> ! `tbl` contains invalid values/value combinations. Column
#> `output_type_id` contains invalid value "NA".
# Values in output)type_id column are not `NA`s
all(is.na(tbl_chr$output_type_id))
#> [1] FALSE
# Values in output)type_id column are actually characters
all(tbl_chr$output_type_id == "NA")
#> [1] TRUE
# Convert to NA values
tbl_chr$output_type_id <- NA_character_
tbl_chr
#> # A tibble: 13,104 × 7
#> location clade value target_date output_type output_type_id nowcast_date
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AL 24A 0 2024-09-15 mean <NA> 2024-10-16
#> 2 AL 24A 0 2024-09-16 mean <NA> 2024-10-16
#> 3 AL 24A 0 2024-09-17 mean <NA> 2024-10-16
#> 4 AL 24A 0 2024-09-18 mean <NA> 2024-10-16
#> 5 AL 24A 0 2024-09-19 mean <NA> 2024-10-16
#> 6 AL 24A 0 2024-09-20 mean <NA> 2024-10-16
#> 7 AL 24A 0 2024-09-21 mean <NA> 2024-10-16
#> 8 AL 24A 0 2024-09-22 mean <NA> 2024-10-16
#> 9 AL 24A 0 2024-09-23 mean <NA> 2024-10-16
#> 10 AL 24A 0 2024-09-24 mean <NA> 2024-10-16
#> # ℹ 13,094 more rows
hubValidations::check_tbl_values(tbl_chr,
round_id = "2024-10-16",
file_path = file_path, hub_path = hub_path
)
#> <message/check_success>
#> Message:
#> `tbl` contains valid values/value combinations.
# Write the file back, re-read and check again
arrow::write_parquet(tbl_chr, fs::path(hub_path, "model-output", file_path))
tbl_chr <- hubValidations::read_model_out_file(file_path,
hub_path = hub_path,
coerce_types = "chr"
)
#> ℹ Updating superseded URL `Infectious-Disease-Modeling-hubs` to `hubverse-org`
hubValidations::check_tbl_values(tbl_chr,
round_id = "2024-10-16",
file_path = file_path, hub_path = hub_path
)
#> <message/check_success>
#> Message:
#> `tbl` contains valid values/value combinations. Created on 2024-10-18 with reprex v2.1.0 |
@MKupperman, I have a couple of diagnostic questions for you:
Thank you for the report! |
Hi all, thanks for looking into this! @zkamvar - here's some information.
import pandas as pd
# Define the data as a list of dictionaries
data = [
{"location": "AL", "clade": "24E", "value": 0.0, "target_date": "2024-09-22", "output_type": "mean", "output_type_id": "NA", "nowcast_date": "2024-10-23"},
{"location": "AL", "clade": "24E", "value": 0.0, "target_date": "2024-09-23", "output_type": "mean", "output_type_id": "NA", "nowcast_date": "2024-10-23"}
]
# Convert the list of dictionaries to a pandas DataFrame
df_mwe = pd.DataFrame(data)
# Save the DataFrame
df_mwe.to_parquet("file.parquet") If we coerce to For now, we'll add some R to the pipeline and do the conversion there, as @annakrystalli suggested. |
Thank you for the MWE, @MKupperman! I've opened hubverse-org/hubDocs#198 to track the issue. |
Adding 10/23 submissions to the PR. Should pass integration tests now. |
LANL Theoretical Biology and Biophysics group predictions using CovTransformer (a transformer model for covid variant predictions) for 10/16 are included in this merge request.
We encountered an issue with "NA" not being correctly recognized as NA when performing parquet validation. Please email me if this is an issue on the back-end.
We have clipped our NA/NaN/Null predictions into 0, to comply with the normalization requirement that frequencies must sum to 1.