Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test and document how to produce typed NA/missing values in Python. #198

Open
zkamvar opened this issue Oct 21, 2024 · 11 comments
Open

Test and document how to produce typed NA/missing values in Python. #198

zkamvar opened this issue Oct 21, 2024 · 11 comments

Comments

@zkamvar
Copy link
Member

zkamvar commented Oct 21, 2024

As mentioned in reichlab/variant-nowcast-hub#116 (comment):

If we coerce to pd.NA (or None), the corresponding dtype that the validation tool receives is vctrs_unspecified, rather than chr. Similarly, using np.nan to encode gives a "double" (numeric) data type, instead of char. Seems like an issue to fix on the backend.

This might be addressed partially in hubverse-org/schemas#109, but I wonder if it's possible to catch vctrs_unspecified and convert them to characters since we know those are going to always be missing values.

@annakrystalli
Copy link
Member

annakrystalli commented Oct 21, 2024

Adding a relevant comment I made in response to a PR comment by @MKupperman

It seems casting to missing values in python is changing or even removing the column datatype. This also happens in R when setting a column's content to all NAs which changes the data type to logical (Boolean) by default. In R we can set an all NA column to a specific data type though by using a typed version of NA, e.g. To conserve a character column we could assing NA_character_ instead of NA. Is there something equivalent in python?

Originally posted by @annakrystalli in reichlab/variant-nowcast-hub#117 (comment)

Overall I think trying to cast the column data type in python before writing if possible would be preferable. I haven't tested it but while handling it in hubValidations would be possible, it may well cause problems when opening and especially filtering on such columns through hubData.

@annakrystalli
Copy link
Member

annakrystalli commented Oct 21, 2024

I think this is possible from my read of these docs but I'm not python versed enough to be sure https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values

@MKupperman
Copy link

That was my previous thought, using pd.NA tokens instead of "NA" (since the column schema suggests it should be "NA").

I worked on this for a bit, and found that if you cast the pd.NA to string types, it correctly preserves the NA characters that the R check is expecting. It's a 1-liner,

df["output_type_id"] = df["output_type_id"].astype("string")

A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix.

@zkamvar
Copy link
Member Author

zkamvar commented Oct 21, 2024

Thank you for the investigation @MKupperman! I had tried something similar with astype("str") and got bupkis (but not an error 🫤). I'm glad to know this works!

A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix.

The documentation needs an overhaul on that section and the next iteration of the schema will likely fix that by clarifying that the output_type_id for point estimates is a missing/None/null type: hubverse-org/schemas#109

@annakrystalli
Copy link
Member

annakrystalli commented Oct 22, 2024

This is great! Thanks @MKupperman for the investigation!

@zkamvar we should also document the importance of retaining the required column data type too with tips on how to do so in different languages.

For completeness I think we should explore all the potentials available to python users for recording missing values, e.g. NaN and None also check what the output looks like when:

  • casting and writing pd.NAs to csv from python.
  • casting and writing python NaNs to parquet
  • casting and writing python NaNs to csv
  • casting and writing python Nones to parquet
  • casting and writing python Nones to csv

For complete downstream samity:

  • Mix and match any of the above that pass validation in example model output files and ensure they do not cause problems for hubData::connect_hub() or hubData::connect_model_output()

It seems the experimental pd.NA is the closest to Rs and likely what we want to promote but it would be good to get our heads round what else works/doesn't work/folks might use and get ahead of the game in our documentation.

Overall I'm going to rename and move this issue to hubDocs which seems the more appropriate location for it now.

@annakrystalli annakrystalli transferred this issue from hubverse-org/hubValidations Oct 22, 2024
@annakrystalli annakrystalli transferred this issue from hubverse-org/schemas Oct 22, 2024
@annakrystalli annakrystalli changed the title Pyarrow cannot produce character NA types. Test and document how to produce typed NA/missing values in Python. Oct 22, 2024
@zkamvar
Copy link
Member Author

zkamvar commented Oct 22, 2024

I did a little experiment and it seems like python tolerates None fairly well in the conversions.

The only sticking thing is that pandas with a numpy_nullable dtype backend does not like to read integer columns with missing data. Instead, it will silently convert those columns to floats.

import polars as pl
import pandas as pd
import numpy as np
import math as math

# Define the data as a list of dictionaries
data = [
    {"NA": pd.NA, "None": None, "math.nan": math.nan, "np.nan": np.nan, "floatnan": float('nan')},
    {"NA": pd.NA, "None": None, "math.nan": math.nan, "np.nan": np.nan, "floatnan": float('nan')},
]

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(data)

df["NA"] = df["NA"].astype("string")
df["None"] = df["None"].astype("string")
# Save the DataFrame
df.to_csv("string.csv", index = False, na_rep = "NA")
df.to_parquet("string.parquet")

df["NA"] = df["NA"].astype("float")
df["None"] = df["None"].astype("float")
# Save the DataFrame
df.to_csv("float.csv", index = False, na_rep = "NA")
df.to_parquet("float.parquet")

df["NA"] = df["NA"].astype("Int64")
df["None"] = df["None"].astype("Int64")
# Save the DataFrame
df.to_csv("integer.csv", index = False, na_rep = "NA")
df.to_parquet("integer.parquet")

# polars is different because it wants a dictionary of lists
pldf = pl.DataFrame({
    "string": ["a", None],
    "integer": [1, None],
    "float": [1.0, None],
    "bool": [True, None],
})
pldf.write_csv("polars.csv", null_value = "NA")
pldf.write_parquet("polars.parquet")
pol = pl.read_parquet("polars.parquet")
pdnp = pd.read_parquet("polars.parquet")
pdar = pd.read_parquet("polars.parquet", dtype_backend = "pyarrow")
print("round trip data")
print("polars:")
print(pol)
print("pandas (with numpy_nullable dytpe):")
print(pdnp)
print(pdnp.dtypes)
print("pandas (with pyarrow dytpe):")
print(pdar)
print(pdar.dtypes)

This is the output:

$ python test.py 
round trip data
polars:
shape: (2, 4)
┌────────┬─────────┬───────┬──────┐
│ string ┆ integer ┆ float ┆ bool │
│ ---    ┆ ---     ┆ ---   ┆ ---  │
│ str    ┆ i64     ┆ f64   ┆ bool │
╞════════╪═════════╪═══════╪══════╡
│ a      ┆ 1       ┆ 1.0   ┆ true │
│ null   ┆ null    ┆ null  ┆ null │
└────────┴─────────┴───────┴──────┘
pandas (with numpy_nullable dytpe):
  string  integer  float  bool
0      a      1.0    1.0  True
1   None      NaN    NaN  None
string      object
integer    float64
float      float64
bool        object
dtype: object
pandas (with pyarrow dytpe):
  string  integer  float  bool
0      a        1    1.0  True
1   <NA>     <NA>   <NA>  <NA>
string     large_string[pyarrow]
integer           int64[pyarrow]
float            double[pyarrow]
bool               bool[pyarrow]
dtype: object

@annakrystalli
Copy link
Member

annakrystalli commented Oct 23, 2024

Nice one @zkamvar ! I was wondering actually whether we should put together some example files in an example hub where we actually write some of these out and can then test whether they:

  • correctly validate
  • can be successfully accessed and filtered on
  • other downstream tools can correctly parse in both R and python (e.g. polars see @elray1's comment)

So far the results of your experiment make me wonder if we should restrict output type ID columns to be one of "double" or "character" for all output types in the schema (currently we allow "double", "character" or "integer".

@annakrystalli
Copy link
Member

I think this mini example hub should follow v4 schema and probably live in hubUtils inst to be available more broadly for examples and testing.

@zkamvar
Copy link
Member Author

zkamvar commented Oct 23, 2024

So far the results of your experiment make me wonder if we should restrict output type ID columns to be one of "double" or "character" for all output types in the schema (currently we allow "double", "character" or "integer".

My mistake. I was using astype("int64") instead of astype("Int64"). I updated the example.

I found that the output files above are all able to produce missing data that is interoperable with R and Python.

It seems the experimental pd.NA is the closest to Rs and likely what we want to promote

I would avoid trying to design the data structure in a way that makes it more compatible with R at the expense of natural semantics in other languages. Instead, we should focus on the data formats themselves and how they represent missing data.

For JSON and Arrow, that's a single null value. For CSV, it's more nuanced (Python, Julia, and R can all understand that an empty cell---two juxtaposed commas---represents missing data, but R established the explicit convention of labelling missing data as NA, which prevents an empty string , , from being misinterpreted), but NA is a safe standard as it's in all the documentation.

@elray1
Copy link
Contributor

elray1 commented Oct 30, 2024

I used the following code to read in a csv format file and create data frames in both polars and pandas which passed validations. This is working with the example submission csv file in this folder: https://github.com/elray1/FluSight-forecast-hub/tree/main/auxiliary-data

import pandas as pd
import polars as pl

submission_pl = pl.read_csv(
    "auxiliary-data/2024-11-16-example-submission.csv",
    ignore_errors=True
)

submission_pl = submission_pl.with_columns(
    pl.col("reference_date").str.to_date("%Y-%m-%d"),
    pl.col("target_end_date").replace("NA", None).str.to_date("%Y-%m-%d")
)

submission_pd.write_parquet("auxiliary-data/2024-11-16-example-submission.parquet")



submission_pd = pd.read_csv(
    "auxiliary-data/2024-11-16-example-submission.csv"
)

for col in ["reference_date", "target_end_date"]:
    submission_pd[col] = pd.to_datetime(submission_pd[col]).dt.date

submission_pd["horizon"] = submission_pd["horizon"].astype('Int64')

submission_pd.to_parquet("auxiliary-data/2024-11-16-example-submission.parquet")

@bsweger
Copy link
Contributor

bsweger commented Oct 30, 2024

I did a little experiment and it seems like python tolerates None fairly well in the conversions.

Nice investigation! Chiming in to say that @zkamvar helped me frame the problem like this:

  • How missing data is encoded in a model output file is a concern of the language being used to generate it
  • The concern of modelers is not really the representation of missing data--it's the data type of the column that contains it.

If that framing is correct, the updated docs and other artifacts should focus on the second item

The only sticking thing is that pandas with a numpy_nullable dtype backend does not like to read integer columns with missing data. Instead, it will silently convert those columns to floats.

That's a PITA and also a good reason to focus people on getting their column data types correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

5 participants