-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test and document how to produce typed NA/missing values in Python. #198
Comments
Adding a relevant comment I made in response to a PR comment by @MKupperman
Originally posted by @annakrystalli in reichlab/variant-nowcast-hub#117 (comment) Overall I think trying to cast the column data type in python before writing if possible would be preferable. I haven't tested it but while handling it in |
I think this is possible from my read of these docs but I'm not python versed enough to be sure https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values |
That was my previous thought, using I worked on this for a bit, and found that if you cast the pd.NA to string types, it correctly preserves the NA characters that the R check is expecting. It's a 1-liner, df["output_type_id"] = df["output_type_id"].astype("string") A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix. |
Thank you for the investigation @MKupperman! I had tried something similar with
The documentation needs an overhaul on that section and the next iteration of the schema will likely fix that by clarifying that the |
This is great! Thanks @MKupperman for the investigation! @zkamvar we should also document the importance of retaining the required column data type too with tips on how to do so in different languages. For completeness I think we should explore all the potentials available to python users for recording missing values, e.g.
For complete downstream samity:
It seems the experimental Overall I'm going to rename and move this issue to |
I did a little experiment and it seems like python tolerates The only sticking thing is that pandas with a import polars as pl
import pandas as pd
import numpy as np
import math as math
# Define the data as a list of dictionaries
data = [
{"NA": pd.NA, "None": None, "math.nan": math.nan, "np.nan": np.nan, "floatnan": float('nan')},
{"NA": pd.NA, "None": None, "math.nan": math.nan, "np.nan": np.nan, "floatnan": float('nan')},
]
# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(data)
df["NA"] = df["NA"].astype("string")
df["None"] = df["None"].astype("string")
# Save the DataFrame
df.to_csv("string.csv", index = False, na_rep = "NA")
df.to_parquet("string.parquet")
df["NA"] = df["NA"].astype("float")
df["None"] = df["None"].astype("float")
# Save the DataFrame
df.to_csv("float.csv", index = False, na_rep = "NA")
df.to_parquet("float.parquet")
df["NA"] = df["NA"].astype("Int64")
df["None"] = df["None"].astype("Int64")
# Save the DataFrame
df.to_csv("integer.csv", index = False, na_rep = "NA")
df.to_parquet("integer.parquet")
# polars is different because it wants a dictionary of lists
pldf = pl.DataFrame({
"string": ["a", None],
"integer": [1, None],
"float": [1.0, None],
"bool": [True, None],
})
pldf.write_csv("polars.csv", null_value = "NA")
pldf.write_parquet("polars.parquet")
pol = pl.read_parquet("polars.parquet")
pdnp = pd.read_parquet("polars.parquet")
pdar = pd.read_parquet("polars.parquet", dtype_backend = "pyarrow")
print("round trip data")
print("polars:")
print(pol)
print("pandas (with numpy_nullable dytpe):")
print(pdnp)
print(pdnp.dtypes)
print("pandas (with pyarrow dytpe):")
print(pdar)
print(pdar.dtypes) This is the output:
|
Nice one @zkamvar ! I was wondering actually whether we should put together some example files in an example hub where we actually write some of these out and can then test whether they:
So far the results of your experiment make me wonder if we should restrict output type ID columns to be one of |
I think this mini example hub should follow v4 schema and probably live in |
My mistake. I was using I found that the output files above are all able to produce missing data that is interoperable with R and Python.
I would avoid trying to design the data structure in a way that makes it more compatible with R at the expense of natural semantics in other languages. Instead, we should focus on the data formats themselves and how they represent missing data. For JSON and Arrow, that's a single |
I used the following code to read in a csv format file and create data frames in both polars and pandas which passed validations. This is working with the example submission csv file in this folder: https://github.com/elray1/FluSight-forecast-hub/tree/main/auxiliary-data import pandas as pd
import polars as pl
submission_pl = pl.read_csv(
"auxiliary-data/2024-11-16-example-submission.csv",
ignore_errors=True
)
submission_pl = submission_pl.with_columns(
pl.col("reference_date").str.to_date("%Y-%m-%d"),
pl.col("target_end_date").replace("NA", None).str.to_date("%Y-%m-%d")
)
submission_pd.write_parquet("auxiliary-data/2024-11-16-example-submission.parquet")
submission_pd = pd.read_csv(
"auxiliary-data/2024-11-16-example-submission.csv"
)
for col in ["reference_date", "target_end_date"]:
submission_pd[col] = pd.to_datetime(submission_pd[col]).dt.date
submission_pd["horizon"] = submission_pd["horizon"].astype('Int64')
submission_pd.to_parquet("auxiliary-data/2024-11-16-example-submission.parquet") |
Nice investigation! Chiming in to say that @zkamvar helped me frame the problem like this:
If that framing is correct, the updated docs and other artifacts should focus on the second item
That's a PITA and also a good reason to focus people on getting their column data types correct |
As mentioned in reichlab/variant-nowcast-hub#116 (comment):
This might be addressed partially in hubverse-org/schemas#109, but I wonder if it's possible to catch
vctrs_unspecified
and convert them to characters since we know those are going to always be missing values.The text was updated successfully, but these errors were encountered: