Test and document how to produce typed NA/missing values in Python. #198

zkamvar · 2024-10-21T19:03:06Z

As mentioned in reichlab/variant-nowcast-hub#116 (comment):

If we coerce to pd.NA (or None), the corresponding dtype that the validation tool receives is vctrs_unspecified, rather than chr. Similarly, using np.nan to encode gives a "double" (numeric) data type, instead of char. Seems like an issue to fix on the backend.

This might be addressed partially in hubverse-org/schemas#109, but I wonder if it's possible to catch vctrs_unspecified and convert them to characters since we know those are going to always be missing values.

The text was updated successfully, but these errors were encountered:

annakrystalli · 2024-10-21T19:18:50Z

Adding a relevant comment I made in response to a PR comment by @MKupperman

It seems casting to missing values in python is changing or even removing the column datatype. This also happens in R when setting a column's content to all NAs which changes the data type to logical (Boolean) by default. In R we can set an all NA column to a specific data type though by using a typed version of NA, e.g. To conserve a character column we could assing NA_character_ instead of NA. Is there something equivalent in python?

Originally posted by @annakrystalli in reichlab/variant-nowcast-hub#117 (comment)

Overall I think trying to cast the column data type in python before writing if possible would be preferable. I haven't tested it but while handling it in hubValidations would be possible, it may well cause problems when opening and especially filtering on such columns through hubData.

annakrystalli · 2024-10-21T19:23:11Z

I think this is possible from my read of these docs but I'm not python versed enough to be sure https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values

MKupperman · 2024-10-21T20:27:50Z

That was my previous thought, using pd.NA tokens instead of "NA" (since the column schema suggests it should be "NA").

I worked on this for a bit, and found that if you cast the pd.NA to string types, it correctly preserves the NA characters that the R check is expecting. It's a 1-liner,

df["output_type_id"] = df["output_type_id"].astype("string")

A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix.

zkamvar · 2024-10-21T21:40:58Z

Thank you for the investigation @MKupperman! I had tried something similar with astype("str") and got bupkis (but not an error 🫤). I'm glad to know this works!

A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix.

The documentation needs an overhaul on that section and the next iteration of the schema will likely fix that by clarifying that the output_type_id for point estimates is a missing/None/null type: hubverse-org/schemas#109

annakrystalli · 2024-10-22T08:23:07Z

This is great! Thanks @MKupperman for the investigation!

@zkamvar we should also document the importance of retaining the required column data type too with tips on how to do so in different languages.

For completeness I think we should explore all the potentials available to python users for recording missing values, e.g. NaN and None also check what the output looks like when:

casting and writing pd.NAs to csv from python.
casting and writing python NaNs to parquet
casting and writing python NaNs to csv
casting and writing python Nones to parquet
casting and writing python Nones to csv

For complete downstream samity:

Mix and match any of the above that pass validation in example model output files and ensure they do not cause problems for hubData::connect_hub() or hubData::connect_model_output()

It seems the experimental pd.NA is the closest to Rs and likely what we want to promote but it would be good to get our heads round what else works/doesn't work/folks might use and get ahead of the game in our documentation.

Overall I'm going to rename and move this issue to hubDocs which seems the more appropriate location for it now.

zkamvar · 2024-10-22T21:06:56Z

I did a little experiment and it seems like python tolerates None fairly well in the conversions.

The only sticking thing is that pandas with a numpy_nullable dtype backend does not like to read integer columns with missing data. Instead, it will silently convert those columns to floats.

import polars as pl
import pandas as pd
import numpy as np
import math as math

# Define the data as a list of dictionaries
data = [
    {"NA": pd.NA, "None": None, "math.nan": math.nan, "np.nan": np.nan, "floatnan": float('nan')},
    {"NA": pd.NA, "None": None, "math.nan": math.nan, "np.nan": np.nan, "floatnan": float('nan')},
]

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(data)

df["NA"] = df["NA"].astype("string")
df["None"] = df["None"].astype("string")
# Save the DataFrame
df.to_csv("string.csv", index = False, na_rep = "NA")
df.to_parquet("string.parquet")

df["NA"] = df["NA"].astype("float")
df["None"] = df["None"].astype("float")
# Save the DataFrame
df.to_csv("float.csv", index = False, na_rep = "NA")
df.to_parquet("float.parquet")

df["NA"] = df["NA"].astype("Int64")
df["None"] = df["None"].astype("Int64")
# Save the DataFrame
df.to_csv("integer.csv", index = False, na_rep = "NA")
df.to_parquet("integer.parquet")

# polars is different because it wants a dictionary of lists
pldf = pl.DataFrame({
    "string": ["a", None],
    "integer": [1, None],
    "float": [1.0, None],
    "bool": [True, None],
})
pldf.write_csv("polars.csv", null_value = "NA")
pldf.write_parquet("polars.parquet")
pol = pl.read_parquet("polars.parquet")
pdnp = pd.read_parquet("polars.parquet")
pdar = pd.read_parquet("polars.parquet", dtype_backend = "pyarrow")
print("round trip data")
print("polars:")
print(pol)
print("pandas (with numpy_nullable dytpe):")
print(pdnp)
print(pdnp.dtypes)
print("pandas (with pyarrow dytpe):")
print(pdar)
print(pdar.dtypes)

This is the output:

$ python test.py 
round trip data
polars:
shape: (2, 4)
┌────────┬─────────┬───────┬──────┐
│ string ┆ integer ┆ float ┆ bool │
│ ---    ┆ ---     ┆ ---   ┆ ---  │
│ str    ┆ i64     ┆ f64   ┆ bool │
╞════════╪═════════╪═══════╪══════╡
│ a      ┆ 1       ┆ 1.0   ┆ true │
│ null   ┆ null    ┆ null  ┆ null │
└────────┴─────────┴───────┴──────┘
pandas (with numpy_nullable dytpe):
  string  integer  float  bool
0      a      1.0    1.0  True
1   None      NaN    NaN  None
string      object
integer    float64
float      float64
bool        object
dtype: object
pandas (with pyarrow dytpe):
  string  integer  float  bool
0      a        1    1.0  True
1   <NA>     <NA>   <NA>  <NA>
string     large_string[pyarrow]
integer           int64[pyarrow]
float            double[pyarrow]
bool               bool[pyarrow]
dtype: object

annakrystalli · 2024-10-23T08:44:55Z

Nice one @zkamvar ! I was wondering actually whether we should put together some example files in an example hub where we actually write some of these out and can then test whether they:

correctly validate
can be successfully accessed and filtered on
other downstream tools can correctly parse in both R and python (e.g. polars see @elray1's comment)

So far the results of your experiment make me wonder if we should restrict output type ID columns to be one of "double" or "character" for all output types in the schema (currently we allow "double", "character" or "integer".

annakrystalli · 2024-10-23T09:29:53Z

I think this mini example hub should follow v4 schema and probably live in hubUtils inst to be available more broadly for examples and testing.

zkamvar · 2024-10-23T18:33:21Z

So far the results of your experiment make me wonder if we should restrict output type ID columns to be one of "double" or "character" for all output types in the schema (currently we allow "double", "character" or "integer".

My mistake. I was using astype("int64") instead of astype("Int64"). I updated the example.

I found that the output files above are all able to produce missing data that is interoperable with R and Python.

It seems the experimental pd.NA is the closest to Rs and likely what we want to promote

I would avoid trying to design the data structure in a way that makes it more compatible with R at the expense of natural semantics in other languages. Instead, we should focus on the data formats themselves and how they represent missing data.

For JSON and Arrow, that's a single null value. For CSV, it's more nuanced (Python, Julia, and R can all understand that an empty cell---two juxtaposed commas---represents missing data, but R established the explicit convention of labelling missing data as NA, which prevents an empty string , , from being misinterpreted), but NA is a safe standard as it's in all the documentation.

elray1 · 2024-10-30T13:27:44Z

I used the following code to read in a csv format file and create data frames in both polars and pandas which passed validations. This is working with the example submission csv file in this folder: https://github.com/elray1/FluSight-forecast-hub/tree/main/auxiliary-data

import pandas as pd
import polars as pl

submission_pl = pl.read_csv(
    "auxiliary-data/2024-11-16-example-submission.csv",
    ignore_errors=True
)

submission_pl = submission_pl.with_columns(
    pl.col("reference_date").str.to_date("%Y-%m-%d"),
    pl.col("target_end_date").replace("NA", None).str.to_date("%Y-%m-%d")
)

submission_pd.write_parquet("auxiliary-data/2024-11-16-example-submission.parquet")



submission_pd = pd.read_csv(
    "auxiliary-data/2024-11-16-example-submission.csv"
)

for col in ["reference_date", "target_end_date"]:
    submission_pd[col] = pd.to_datetime(submission_pd[col]).dt.date

submission_pd["horizon"] = submission_pd["horizon"].astype('Int64')

submission_pd.to_parquet("auxiliary-data/2024-11-16-example-submission.parquet")

bsweger · 2024-10-30T19:09:22Z

I did a little experiment and it seems like python tolerates None fairly well in the conversions.

Nice investigation! Chiming in to say that @zkamvar helped me frame the problem like this:

How missing data is encoded in a model output file is a concern of the language being used to generate it
The concern of modelers is not really the representation of missing data--it's the data type of the column that contains it.

If that framing is correct, the updated docs and other artifacts should focus on the second item

The only sticking thing is that pandas with a numpy_nullable dtype backend does not like to read integer columns with missing data. Instead, it will silently convert those columns to floats.

That's a PITA and also a good reason to focus people on getting their column data types correct

zkamvar mentioned this issue Oct 21, 2024

LANL 10/23 predictions reichlab/variant-nowcast-hub#116

Merged

zkamvar mentioned this issue Oct 21, 2024

clarify missingness of point estimates; add examples for determining output_type_id format #197

Merged

annakrystalli transferred this issue from hubverse-org/hubValidations Oct 22, 2024

github-project-automation bot added this to hubverse Development overview Oct 22, 2024

github-project-automation bot moved this to Todo in hubverse Development overview Oct 22, 2024

annakrystalli transferred this issue from hubverse-org/schemas Oct 22, 2024

annakrystalli changed the title ~~Pyarrow cannot produce character NA types.~~ Test and document how to produce typed NA/missing values in Python. Oct 22, 2024

annakrystalli linked a pull request Oct 22, 2024 that will close this issue

clarify missingness of point estimates; add examples for determining output_type_id format #197

Merged

annakrystalli removed a link to a pull request Oct 22, 2024

clarify missingness of point estimates; add examples for determining output_type_id format #197

Merged

annakrystalli mentioned this issue Oct 22, 2024

Null point estimate output type IDs in configs instead of NAs hubverse-org/schemas#109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test and document how to produce typed NA/missing values in Python. #198

Test and document how to produce typed NA/missing values in Python. #198

zkamvar commented Oct 21, 2024

annakrystalli commented Oct 21, 2024 •

edited

Loading

annakrystalli commented Oct 21, 2024 •

edited

Loading

MKupperman commented Oct 21, 2024

zkamvar commented Oct 21, 2024

annakrystalli commented Oct 22, 2024 •

edited by zkamvar

Loading

zkamvar commented Oct 22, 2024 •

edited

Loading

annakrystalli commented Oct 23, 2024 •

edited

Loading

annakrystalli commented Oct 23, 2024

zkamvar commented Oct 23, 2024

elray1 commented Oct 30, 2024 •

edited by zkamvar

Loading

bsweger commented Oct 30, 2024

Test and document how to produce typed NA/missing values in Python. #198

Test and document how to produce typed NA/missing values in Python. #198

Comments

zkamvar commented Oct 21, 2024

annakrystalli commented Oct 21, 2024 • edited Loading

annakrystalli commented Oct 21, 2024 • edited Loading

MKupperman commented Oct 21, 2024

zkamvar commented Oct 21, 2024

annakrystalli commented Oct 22, 2024 • edited by zkamvar Loading

zkamvar commented Oct 22, 2024 • edited Loading

annakrystalli commented Oct 23, 2024 • edited Loading

annakrystalli commented Oct 23, 2024

zkamvar commented Oct 23, 2024

elray1 commented Oct 30, 2024 • edited by zkamvar Loading

bsweger commented Oct 30, 2024

annakrystalli commented Oct 21, 2024 •

edited

Loading

annakrystalli commented Oct 21, 2024 •

edited

Loading

annakrystalli commented Oct 22, 2024 •

edited by zkamvar

Loading

zkamvar commented Oct 22, 2024 •

edited

Loading

annakrystalli commented Oct 23, 2024 •

edited

Loading

elray1 commented Oct 30, 2024 •

edited by zkamvar

Loading