Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite is failing with Vegafusion 2 #3701

Open
MarcoGorelli opened this issue Nov 24, 2024 · 14 comments · Fixed by #3702
Open

Test suite is failing with Vegafusion 2 #3701

MarcoGorelli opened this issue Nov 24, 2024 · 14 comments · Fixed by #3702
Assignees

Comments

@MarcoGorelli
Copy link
Contributor

What happened?

Running the test suite with the latest versions of all dependencies results in:

=========================== short test summary info ============================
FAILED tests/test_transformed_data.py::test_primitive_chart_examples[False-natural_disasters.py-686-cols28] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_compound_chart_examples[True-falkensee.py-all_rows3-all_cols3] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_compound_chart_examples[False-falkensee.py-all_rows3-all_cols3] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_primitive_chart_examples[True-natural_disasters.py-686-cols28] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

Full logs:

============================= test session starts ============================== platform linux -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0 rootdir: /home/runner/work/narwhals/narwhals/altair configfile: pyproject.toml plugins: anyio-4.6.2.post1, cov-6.0.0, xdist-3.6.1 created: 4/4 workers 4 workers [1738 items]

........................................................................ [ 4%]
........................................................................ [ 8%]
........................................................................ [ 12%]
........................................................................ [ 16%]
........................................................................ [ 20%]
........................................................................ [ 24%]
........................................................................ [ 28%]
........................................................................ [ 33%]
........................................................................ [ 37%]
........................................................................ [ 41%]
........................................................................ [ 45%]
........................................................................ [ 49%]
........................................................................ [ 53%]
........................................................................ [ 57%]
..........................................................F............. [ 62%]
.........................................F.............................. [ 66%]
..s.ssss...F............................................................ [ 70%]
.....F................X................................................. [ 74%]
........................................................................ [ 78%]
.X....X................................................................. [ 82%]
........................................................................ [ 86%]
........................................................................ [ 91%]
..............X......................................................... [ 95%]
.......................X.........x...................................... [ 99%]
.......... [100%]
=================================== FAILURES ===================================
_____ test_primitive_chart_examples[False-natural_disasters.py-686-cols28] _____
[gw2] linux -- Python 3.12.7 /opt/hostedtoolcache/Python/3.12.7/x64/bin/python

filename = 'natural_disasters.py', rows = 686, cols = ['Deaths', 'Year']
to_reconstruct = False

@ignore_DataFrameGroupBy
@pytest.mark.skipif(vf is None, reason="vegafusion not installed")
# fmt: off
@pytest.mark.parametrize("filename,rows,cols", [
    ("annual_weather_heatmap.py", 366, ["monthdate_date_end", "max_temp_max"]),
    ("anscombe_plot.py", 44, ["Series", "X", "Y"]),
    ("bar_chart_sorted.py", 6, ["site", "sum_yield"]),
    ("bar_chart_faceted_compact.py", 27, ["p", "p_end"]),
    ("beckers_barley_facet.py", 120, ["year", "site"]),
    ("beckers_barley_wrapped_facet.py", 120, ["site", "median_yield"]),
    ("bump_chart.py", 96, ["rank", "yearmonth_date"]),
    ("comet_chart.py", 120, ["variety", "delta"]),
    ("diverging_stacked_bar_chart.py", 40, ["value", "percentage_start"]),
    ("donut_chart.py", 6, ["value_start", "value_end"]),
    ("gapminder_bubble_plot.py", 187, ["income", "population"]),
    ("grouped_bar_chart2.py", 9, ["Group", "Value_start"]),
    ("hexbins.py", 84, ["xFeaturePos", "mean_temp_max"]),
    pytest.param("histogram_heatmap.py", 378, ["bin_maxbins_40_Rotten_Tomatoes_Rating", "__count"], marks=slow),
    ("histogram_scatterplot.py", 64, ["bin_maxbins_10_Rotten_Tomatoes_Rating", "__count"]),
    pytest.param("interactive_legend.py", 1708, ["sum_count_start", "series"], marks=slow),
    ("iowa_electricity.py", 51, ["net_generation_start", "year"]),
    ("isotype.py", 37, ["animal", "x"]),
    ("isotype_grid.py", 100, ["row", "col"]),
    ("lasagna_plot.py", 492, ["yearmonthdate_date", "sum_price"]),
    ("layered_area_chart.py", 51, ["source", "net_generation"]),
    ("layered_bar_chart.py", 51, ["source", "net_generation"]),
    ("layered_histogram.py", 113, ["bin_maxbins_100_Measurement"]),
    ("line_chart_with_cumsum.py", 52, ["cumulative_wheat"]),
    ("line_custom_order.py", 55, ["miles", "gas"]),
    pytest.param("line_percent.py", 30, ["sex", "perc"], marks=slow),
    ("line_with_log_scale.py", 15, ["year", "sum_people"]),
    ("multifeature_scatter_plot.py", 150, ["petalWidth", "species"]),
    ("natural_disasters.py", 686, ["Deaths", "Year"]),
    ("normalized_stacked_area_chart.py", 51, ["source", "net_generation_start"]),
    ("normalized_stacked_bar_chart.py", 60, ["site", "sum_yield_start"]),
    ("parallel_coordinates.py", 600, ["key", "value"]),
    ("percentage_of_total.py", 5, ["PercentOfTotal", "TotalTime"]),
    ("pie_chart.py", 6, ["category", "value_start"]),
    ("pyramid.py", 3, ["category", "value_start"]),
    ("stacked_bar_chart_sorted_segments.py", 60, ["variety", "site"]),
    ("stem_and_leaf.py", 100, ["stem", "leaf"]),
    pytest.param("streamgraph.py", 1708, ["series", "sum_count"], marks=slow),
    ("top_k_items.py", 10, ["rank", "IMDB_Rating_start"]),
    ("top_k_letters.py", 9, ["rank", "letters"]),
    pytest.param("top_k_with_others.py", 10, ["ranked_director", "mean_aggregate_gross"], marks=slow),
    ("area_faceted.py", 492, ["date", "price"]),
    ("distributions_faceted_histogram.py", 20, ["Origin", "__count"]),
    ("us_population_over_time.py", 38, ["sex", "people_start"]),
    ("us_population_over_time_facet.py", 285, ["year", "sum_people"]),
    ("wilkinson-dot-plot.py", 21, ["data", "id"]),
    ("window_rank.py", 12, ["team", "diff"]),
])
# fmt: on
@pytest.mark.parametrize("to_reconstruct", [True, False])
def test_primitive_chart_examples(filename, rows, cols, to_reconstruct):
    source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
    chart = eval_block(source, strict=True)
    if to_reconstruct:
        # When reconstructing a Chart, Altair uses different classes
        # then what might have been originally used. See
        # https://github.com/hex-inc/vegafusion/issues/354 for more info.
        chart = alt.Chart.from_dict(chart.to_dict())
  df = chart.transformed_data()

tests/test_transformed_data.py:82:


altair/vegalite/v5/api.py:4058: in transformed_data
return transformed_data(self, row_limit=row_limit, exclude=exclude)
altair/utils/_transformed_data.py:138: in transformed_data
datasets, _ = vf.runtime.pre_transform_datasets(


self = VegaFusionRuntime(cache_capacity=64, worker_threads=4)
spec = {'$schema': 'https://vega.github.io/schema/vega/v5.json', 'axes': [{'grid': False, 'labelFlush': True, 'labelOverlap':...: {'grid': False}, 'axisY': {'domain': False, 'offset': 10, 'ticks': False}, 'style': {'cell': {'stroke': None}}}, ...}
datasets = [('data_0', ())], local_tz = 'UTC', default_input_tz = None
row_limit = None, inline_datasets = {}, trim_unused_columns = False
dataset_format = 'auto'

def pre_transform_datasets(
    self,
    spec: Union[dict[str, Any], str],
    datasets: list[Union[str, tuple[str, list[int]]]],
    local_tz: str | None = None,
    default_input_tz: str | None = None,
    row_limit: int | None = None,
    inline_datasets: dict[str, DataFrameLike] | None = None,
    trim_unused_columns: bool = False,
    dataset_format: DatasetFormat = "auto",
) -> tuple[list[DataFrameLike], list[PreTransformWarning]]:
    """
    Extract the fully evaluated form of the requested datasets from a Vega
    specification.

    Args:
        spec: A Vega specification dict or JSON string.
        datasets: A list with elements that are either:

            * The name of a top-level dataset as a string
            * A two-element tuple where the first element is the name of a dataset
              as a string and the second element is the nested scope of the dataset
              as a list of integers
        local_tz: Name of timezone to be considered local. E.g.
            ``'America/New_York'``. Defaults to the value of vf.get_local_tz(),
            which defaults to the system timezone if one can be determined.
        default_input_tz: Name of timezone (e.g. ``'America/New_York'``) that naive
            datetime strings should be interpreted in. Defaults to ``local_tz``.
        row_limit: Maximum number of dataset rows to include in the returned
            datasets. If exceeded, datasets will be truncated to this number of
            rows and a RowLimitExceeded warning will be included in the resulting
            warnings list.
        inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification
            using the following url syntax 'vegafusion+dataset://{dataset_name}'
            or 'table://{dataset_name}'.
        trim_unused_columns: If True, unused columns are removed from returned
            datasets.
        dataset_format: Format for returned datasets. One of:

            * ``"auto"``: (default) Infer the result type based on the types of
              inline datasets. If no inline datasets are provided, return type will
              depend on installed packages.
            * ``"polars"``: polars.DataFrame
            * ``"pandas"``: pandas.DataFrame
            * ``"pyarrow"``: pyarrow.Table
            * ``"arro3"``: arro3.Table

    Returns:
        tuple[list[DataFrameLike], list[PreTransformWarning]]:
        Two-element tuple of

        * List of pandas DataFrames corresponding to the input datasets list
        * A list of warnings as dictionaries. Each warning dict has a 'type'
          key indicating the warning type, and a 'message' key containing a
          description of the warning.
    """
    local_tz = local_tz or get_local_tz()

    # Build input variables
    pre_tx_vars = parse_variables(datasets)

    # Serialize inline datasets
    inline_arrow_dataset = self._import_inline_datasets(
        inline_datasets,
        inline_dataset_usage=get_inline_column_usage(spec)
        if trim_unused_columns
        else None,
    )
  values, warnings = self.runtime.pre_transform_datasets(
        spec,
        pre_tx_vars,
        local_tz=local_tz,
        default_input_tz=default_input_tz,
        row_limit=row_limit,
        inline_datasets=inline_arrow_dataset,
    )

E ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/vegafusion/runtime.py:535: ValueError
_____ test_compound_chart_examples[True-falkensee.py-all_rows3-all_cols3] ______
[gw2] linux -- Python 3.12.7 /opt/hostedtoolcache/Python/3.12.7/x64/bin/python

filename = 'falkensee.py', all_rows = [2, 38, 38]
all_cols = [['event'], ['population'], ['population']], to_reconstruct = True

@pytest.mark.skipif(vf is None, reason="vegafusion not installed")
# fmt: off
@pytest.mark.parametrize("filename,all_rows,all_cols", [
    ("errorbars_with_std.py", [10, 10], [["upper_yield"], ["extent_yield"]]),
    ("candlestick_chart.py", [44, 44], [["low"], ["close"]]),
    ("co2_concentration.py", [713, 7, 7], [["first_date"], ["scaled_date"], ["end"]]),
    ("falkensee.py", [2, 38, 38], [["event"], ["population"], ["population"]]),
    ("heat_lane.py", [10, 10], [["bin_count_start"], ["y2"]]),
    ("histogram_responsive.py", [20, 20], [["__count"], ["__count"]]),
    ("histogram_with_a_global_mean_overlay.py", [9, 1], [["__count"], ["mean_IMDB_Rating"]]),
    ("horizon_graph.py", [20, 20], [["x"], ["ny"]]),
    pytest.param("interactive_cross_highlight.py", [64, 64, 13], [["__count"], ["__count"], ["Major_Genre"]], marks=slow),
    ("interval_selection.py", [123, 123], [["price_start"], ["date"]]),
    ("layered_chart_with_dual_axis.py", [12, 12], [["month_date"], ["average_precipitation"]]),
    ("layered_heatmap_text.py", [9, 9], [["Cylinders"], ["mean_horsepower"]]),
    ("multiline_highlight.py", [560, 560], [["price"], ["date"]]),
    ("multiline_tooltip.py", [300, 300, 300, 0, 300], [["x"], ["y"], ["y"], ["x"], ["x"]]),
    ("pie_chart_with_labels.py", [6, 6], [["category"], ["value"]]),
    ("radial_chart.py", [6, 6], [["values"], ["values_start"]]),
    ("scatter_linked_table.py", [392, 14, 14, 14], [["Year"], ["Year"], ["Year"], ["Year"]]),
    ("scatter_marginal_hist.py", [34, 150, 27], [["__count"], ["species"], ["__count"]]),
    pytest.param(
        "scatter_with_layered_histogram.py",
        [2, 19],
        [["gender"], ["__count"]],
        marks=(slow, pytest.mark.xfail(
            XDIST_ENABLED,
            reason="Possibly `numpy` conflict with `xdist`.\n"
            "Very intermittent, but only affects `to_reconstruct=False`."
        )),
    ),
    ("scatter_with_minimap.py", [1461, 1461], [["date"], ["date"]]),
    ("scatter_with_rolling_mean.py", [1461, 1461], [["date"], ["rolling_mean"]]),
    ("seattle_weather_interactive.py", [1461, 5], [["date"], ["__count"]]),
    ("select_detail.py", [20, 1000], [["id"], ["x"]]),
    ("simple_scatter_with_errorbars.py", [5, 5], [["x"], ["upper_ymin"]]),
    ("stacked_bar_chart_with_text.py", [60, 60], [["site"], ["site"]]),
    ("us_employment.py", [120, 1, 2], [["month"], ["president"], ["president"]]),
    ("us_population_pyramid_over_time.py", [19, 38, 19], [["gender"], ["year"], ["gender"]]),
])
# fmt: on
@pytest.mark.parametrize("to_reconstruct", [True, False])
def test_compound_chart_examples(filename, all_rows, all_cols, to_reconstruct):
    source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
    chart = eval_block(source, strict=True)
    if to_reconstruct:
        # When reconstructing a Chart, Altair uses different classes
        # then what might have been originally used. See
        # https://github.com/hex-inc/vegafusion/issues/354 for more info.
        chart = alt.Chart.from_dict(chart.to_dict())

    assert isinstance(chart, (alt.LayerChart, alt.ConcatChart, alt.HConcatChart, alt.VConcatChart))
  dfs = chart.transformed_data()

tests/test_transformed_data.py:142:


altair/vegalite/v5/api.py:4688: in transformed_data
return transformed_data(self, row_limit=row_limit, exclude=exclude)
altair/utils/_transformed_data.py:138: in transformed_data
datasets, _ = vf.runtime.pre_transform_datasets(


self = VegaFusionRuntime(cache_capacity=64, worker_threads=4)
spec = {'$schema': 'https://vega.github.io/schema/vega/v5.json', 'axes': [{'aria': False, 'domain': False, 'grid': True, 'gri...Finite(+datum["year"]))) && isValid(datum["population"]) && isFinite(+datum["population"])', 'type': 'filter'}]}], ...}
datasets = [('data_0', ()), ('data_1', ()), ('data_2', ())], local_tz = 'UTC'
default_input_tz = None, row_limit = None, inline_datasets = {}
trim_unused_columns = False, dataset_format = 'auto'

def pre_transform_datasets(
    self,
    spec: Union[dict[str, Any], str],
    datasets: list[Union[str, tuple[str, list[int]]]],
    local_tz: str | None = None,
    default_input_tz: str | None = None,
    row_limit: int | None = None,
    inline_datasets: dict[str, DataFrameLike] | None = None,
    trim_unused_columns: bool = False,
    dataset_format: DatasetFormat = "auto",
) -> tuple[list[DataFrameLike], list[PreTransformWarning]]:
    """
    Extract the fully evaluated form of the requested datasets from a Vega
    specification.

    Args:
        spec: A Vega specification dict or JSON string.
        datasets: A list with elements that are either:

            * The name of a top-level dataset as a string
            * A two-element tuple where the first element is the name of a dataset
              as a string and the second element is the nested scope of the dataset
              as a list of integers
        local_tz: Name of timezone to be considered local. E.g.
            ``'America/New_York'``. Defaults to the value of vf.get_local_tz(),
            which defaults to the system timezone if one can be determined.
        default_input_tz: Name of timezone (e.g. ``'America/New_York'``) that naive
            datetime strings should be interpreted in. Defaults to ``local_tz``.
        row_limit: Maximum number of dataset rows to include in the returned
            datasets. If exceeded, datasets will be truncated to this number of
            rows and a RowLimitExceeded warning will be included in the resulting
            warnings list.
        inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification
            using the following url syntax 'vegafusion+dataset://{dataset_name}'
            or 'table://{dataset_name}'.
        trim_unused_columns: If True, unused columns are removed from returned
            datasets.
        dataset_format: Format for returned datasets. One of:

            * ``"auto"``: (default) Infer the result type based on the types of
              inline datasets. If no inline datasets are provided, return type will
              depend on installed packages.
            * ``"polars"``: polars.DataFrame
            * ``"pandas"``: pandas.DataFrame
            * ``"pyarrow"``: pyarrow.Table
            * ``"arro3"``: arro3.Table

    Returns:
        tuple[list[DataFrameLike], list[PreTransformWarning]]:
        Two-element tuple of

        * List of pandas DataFrames corresponding to the input datasets list
        * A list of warnings as dictionaries. Each warning dict has a 'type'
          key indicating the warning type, and a 'message' key containing a
          description of the warning.
    """
    local_tz = local_tz or get_local_tz()

    # Build input variables
    pre_tx_vars = parse_variables(datasets)

    # Serialize inline datasets
    inline_arrow_dataset = self._import_inline_datasets(
        inline_datasets,
        inline_dataset_usage=get_inline_column_usage(spec)
        if trim_unused_columns
        else None,
    )
  values, warnings = self.runtime.pre_transform_datasets(
        spec,
        pre_tx_vars,
        local_tz=local_tz,
        default_input_tz=default_input_tz,
        row_limit=row_limit,
        inline_datasets=inline_arrow_dataset,
    )

E ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/vegafusion/runtime.py:535: ValueError
_____ test_compound_chart_examples[False-falkensee.py-all_rows3-all_cols3] _____
[gw3] linux -- Python 3.12.7 /opt/hostedtoolcache/Python/3.12.7/x64/bin/python

filename = 'falkensee.py', all_rows = [2, 38, 38]
all_cols = [['event'], ['population'], ['population']], to_reconstruct = False

@pytest.mark.skipif(vf is None, reason="vegafusion not installed")
# fmt: off
@pytest.mark.parametrize("filename,all_rows,all_cols", [
    ("errorbars_with_std.py", [10, 10], [["upper_yield"], ["extent_yield"]]),
    ("candlestick_chart.py", [44, 44], [["low"], ["close"]]),
    ("co2_concentration.py", [713, 7, 7], [["first_date"], ["scaled_date"], ["end"]]),
    ("falkensee.py", [2, 38, 38], [["event"], ["population"], ["population"]]),
    ("heat_lane.py", [10, 10], [["bin_count_start"], ["y2"]]),
    ("histogram_responsive.py", [20, 20], [["__count"], ["__count"]]),
    ("histogram_with_a_global_mean_overlay.py", [9, 1], [["__count"], ["mean_IMDB_Rating"]]),
    ("horizon_graph.py", [20, 20], [["x"], ["ny"]]),
    pytest.param("interactive_cross_highlight.py", [64, 64, 13], [["__count"], ["__count"], ["Major_Genre"]], marks=slow),
    ("interval_selection.py", [123, 123], [["price_start"], ["date"]]),
    ("layered_chart_with_dual_axis.py", [12, 12], [["month_date"], ["average_precipitation"]]),
    ("layered_heatmap_text.py", [9, 9], [["Cylinders"], ["mean_horsepower"]]),
    ("multiline_highlight.py", [560, 560], [["price"], ["date"]]),
    ("multiline_tooltip.py", [300, 300, 300, 0, 300], [["x"], ["y"], ["y"], ["x"], ["x"]]),
    ("pie_chart_with_labels.py", [6, 6], [["category"], ["value"]]),
    ("radial_chart.py", [6, 6], [["values"], ["values_start"]]),
    ("scatter_linked_table.py", [392, 14, 14, 14], [["Year"], ["Year"], ["Year"], ["Year"]]),
    ("scatter_marginal_hist.py", [34, 150, 27], [["__count"], ["species"], ["__count"]]),
    pytest.param(
        "scatter_with_layered_histogram.py",
        [2, 19],
        [["gender"], ["__count"]],
        marks=(slow, pytest.mark.xfail(
            XDIST_ENABLED,
            reason="Possibly `numpy` conflict with `xdist`.\n"
            "Very intermittent, but only affects `to_reconstruct=False`."
        )),
    ),
    ("scatter_with_minimap.py", [1461, 1461], [["date"], ["date"]]),
    ("scatter_with_rolling_mean.py", [1461, 1461], [["date"], ["rolling_mean"]]),
    ("seattle_weather_interactive.py", [1461, 5], [["date"], ["__count"]]),
    ("select_detail.py", [20, 1000], [["id"], ["x"]]),
    ("simple_scatter_with_errorbars.py", [5, 5], [["x"], ["upper_ymin"]]),
    ("stacked_bar_chart_with_text.py", [60, 60], [["site"], ["site"]]),
    ("us_employment.py", [120, 1, 2], [["month"], ["president"], ["president"]]),
    ("us_population_pyramid_over_time.py", [19, 38, 19], [["gender"], ["year"], ["gender"]]),
])
# fmt: on
@pytest.mark.parametrize("to_reconstruct", [True, False])
def test_compound_chart_examples(filename, all_rows, all_cols, to_reconstruct):
    source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
    chart = eval_block(source, strict=True)
    if to_reconstruct:
        # When reconstructing a Chart, Altair uses different classes
        # then what might have been originally used. See
        # https://github.com/hex-inc/vegafusion/issues/354 for more info.
        chart = alt.Chart.from_dict(chart.to_dict())

    assert isinstance(chart, (alt.LayerChart, alt.ConcatChart, alt.HConcatChart, alt.VConcatChart))
  dfs = chart.transformed_data()

tests/test_transformed_data.py:142:


altair/vegalite/v5/api.py:4688: in transformed_data
return transformed_data(self, row_limit=row_limit, exclude=exclude)
altair/utils/_transformed_data.py:138: in transformed_data
datasets, _ = vf.runtime.pre_transform_datasets(


self = VegaFusionRuntime(cache_capacity=64, worker_threads=4)
spec = {'$schema': 'https://vega.github.io/schema/vega/v5.json', 'axes': [{'aria': False, 'domain': False, 'grid': True, 'gri...m["population"])', 'type': 'filter'}], 'url': 'vegafusion+dataset://table_7c2f0057_a249_4dc8_8e0c_bd46ba873edb'}], ...}
datasets = [('source_0', ()), ('source_1', ()), ('source_2', ())]
local_tz = 'UTC', default_input_tz = None, row_limit = None
inline_datasets = {'table_0d3e98ef_595c_466b_8231_3067287bafae': start end event
0 1933 1945 Nazi Rule
1 ...9 40179
33 2010 40511
34 2011 40465
35 2012 40905
36 2013 41258
37 2014 41777}
trim_unused_columns = False, dataset_format = 'auto'

def pre_transform_datasets(
    self,
    spec: Union[dict[str, Any], str],
    datasets: list[Union[str, tuple[str, list[int]]]],
    local_tz: str | None = None,
    default_input_tz: str | None = None,
    row_limit: int | None = None,
    inline_datasets: dict[str, DataFrameLike] | None = None,
    trim_unused_columns: bool = False,
    dataset_format: DatasetFormat = "auto",
) -> tuple[list[DataFrameLike], list[PreTransformWarning]]:
    """
    Extract the fully evaluated form of the requested datasets from a Vega
    specification.

    Args:
        spec: A Vega specification dict or JSON string.
        datasets: A list with elements that are either:

            * The name of a top-level dataset as a string
            * A two-element tuple where the first element is the name of a dataset
              as a string and the second element is the nested scope of the dataset
              as a list of integers
        local_tz: Name of timezone to be considered local. E.g.
            ``'America/New_York'``. Defaults to the value of vf.get_local_tz(),
            which defaults to the system timezone if one can be determined.
        default_input_tz: Name of timezone (e.g. ``'America/New_York'``) that naive
            datetime strings should be interpreted in. Defaults to ``local_tz``.
        row_limit: Maximum number of dataset rows to include in the returned
            datasets. If exceeded, datasets will be truncated to this number of
            rows and a RowLimitExceeded warning will be included in the resulting
            warnings list.
        inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification
            using the following url syntax 'vegafusion+dataset://{dataset_name}'
            or 'table://{dataset_name}'.
        trim_unused_columns: If True, unused columns are removed from returned
            datasets.
        dataset_format: Format for returned datasets. One of:

            * ``"auto"``: (default) Infer the result type based on the types of
              inline datasets. If no inline datasets are provided, return type will
              depend on installed packages.
            * ``"polars"``: polars.DataFrame
            * ``"pandas"``: pandas.DataFrame
            * ``"pyarrow"``: pyarrow.Table
            * ``"arro3"``: arro3.Table

    Returns:
        tuple[list[DataFrameLike], list[PreTransformWarning]]:
        Two-element tuple of

        * List of pandas DataFrames corresponding to the input datasets list
        * A list of warnings as dictionaries. Each warning dict has a 'type'
          key indicating the warning type, and a 'message' key containing a
          description of the warning.
    """
    local_tz = local_tz or get_local_tz()

    # Build input variables
    pre_tx_vars = parse_variables(datasets)

    # Serialize inline datasets
    inline_arrow_dataset = self._import_inline_datasets(
        inline_datasets,
        inline_dataset_usage=get_inline_column_usage(spec)
        if trim_unused_columns
        else None,
    )
  values, warnings = self.runtime.pre_transform_datasets(
        spec,
        pre_tx_vars,
        local_tz=local_tz,
        default_input_tz=default_input_tz,
        row_limit=row_limit,
        inline_datasets=inline_arrow_dataset,
    )

E ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/vegafusion/runtime.py:535: ValueError
_____ test_primitive_chart_examples[True-natural_disasters.py-686-cols28] ______
[gw0] linux -- Python 3.12.7 /opt/hostedtoolcache/Python/3.12.7/x64/bin/python

filename = 'natural_disasters.py', rows = 686, cols = ['Deaths', 'Year']
to_reconstruct = True

@ignore_DataFrameGroupBy
@pytest.mark.skipif(vf is None, reason="vegafusion not installed")
# fmt: off
@pytest.mark.parametrize("filename,rows,cols", [
    ("annual_weather_heatmap.py", 366, ["monthdate_date_end", "max_temp_max"]),
    ("anscombe_plot.py", 44, ["Series", "X", "Y"]),
    ("bar_chart_sorted.py", 6, ["site", "sum_yield"]),
    ("bar_chart_faceted_compact.py", 27, ["p", "p_end"]),
    ("beckers_barley_facet.py", 120, ["year", "site"]),
    ("beckers_barley_wrapped_facet.py", 120, ["site", "median_yield"]),
    ("bump_chart.py", 96, ["rank", "yearmonth_date"]),
    ("comet_chart.py", 120, ["variety", "delta"]),
    ("diverging_stacked_bar_chart.py", 40, ["value", "percentage_start"]),
    ("donut_chart.py", 6, ["value_start", "value_end"]),
    ("gapminder_bubble_plot.py", 187, ["income", "population"]),
    ("grouped_bar_chart2.py", 9, ["Group", "Value_start"]),
    ("hexbins.py", 84, ["xFeaturePos", "mean_temp_max"]),
    pytest.param("histogram_heatmap.py", 378, ["bin_maxbins_40_Rotten_Tomatoes_Rating", "__count"], marks=slow),
    ("histogram_scatterplot.py", 64, ["bin_maxbins_10_Rotten_Tomatoes_Rating", "__count"]),
    pytest.param("interactive_legend.py", 1708, ["sum_count_start", "series"], marks=slow),
    ("iowa_electricity.py", 51, ["net_generation_start", "year"]),
    ("isotype.py", 37, ["animal", "x"]),
    ("isotype_grid.py", 100, ["row", "col"]),
    ("lasagna_plot.py", 492, ["yearmonthdate_date", "sum_price"]),
    ("layered_area_chart.py", 51, ["source", "net_generation"]),
    ("layered_bar_chart.py", 51, ["source", "net_generation"]),
    ("layered_histogram.py", 113, ["bin_maxbins_100_Measurement"]),
    ("line_chart_with_cumsum.py", 52, ["cumulative_wheat"]),
    ("line_custom_order.py", 55, ["miles", "gas"]),
    pytest.param("line_percent.py", 30, ["sex", "perc"], marks=slow),
    ("line_with_log_scale.py", 15, ["year", "sum_people"]),
    ("multifeature_scatter_plot.py", 150, ["petalWidth", "species"]),
    ("natural_disasters.py", 686, ["Deaths", "Year"]),
    ("normalized_stacked_area_chart.py", 51, ["source", "net_generation_start"]),
    ("normalized_stacked_bar_chart.py", 60, ["site", "sum_yield_start"]),
    ("parallel_coordinates.py", 600, ["key", "value"]),
    ("percentage_of_total.py", 5, ["PercentOfTotal", "TotalTime"]),
    ("pie_chart.py", 6, ["category", "value_start"]),
    ("pyramid.py", 3, ["category", "value_start"]),
    ("stacked_bar_chart_sorted_segments.py", 60, ["variety", "site"]),
    ("stem_and_leaf.py", 100, ["stem", "leaf"]),
    pytest.param("streamgraph.py", 1708, ["series", "sum_count"], marks=slow),
    ("top_k_items.py", 10, ["rank", "IMDB_Rating_start"]),
    ("top_k_letters.py", 9, ["rank", "letters"]),
    pytest.param("top_k_with_others.py", 10, ["ranked_director", "mean_aggregate_gross"], marks=slow),
    ("area_faceted.py", 492, ["date", "price"]),
    ("distributions_faceted_histogram.py", 20, ["Origin", "__count"]),
    ("us_population_over_time.py", 38, ["sex", "people_start"]),
    ("us_population_over_time_facet.py", 285, ["year", "sum_people"]),
    ("wilkinson-dot-plot.py", 21, ["data", "id"]),
    ("window_rank.py", 12, ["team", "diff"]),
])
# fmt: on
@pytest.mark.parametrize("to_reconstruct", [True, False])
def test_primitive_chart_examples(filename, rows, cols, to_reconstruct):
    source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
    chart = eval_block(source, strict=True)
    if to_reconstruct:
        # When reconstructing a Chart, Altair uses different classes
        # then what might have been originally used. See
        # https://github.com/hex-inc/vegafusion/issues/354 for more info.
        chart = alt.Chart.from_dict(chart.to_dict())
  df = chart.transformed_data()

tests/test_transformed_data.py:82:


altair/vegalite/v5/api.py:4058: in transformed_data
return transformed_data(self, row_limit=row_limit, exclude=exclude)
altair/utils/_transformed_data.py:138: in transformed_data
datasets, _ = vf.runtime.pre_transform_datasets(


self = VegaFusionRuntime(cache_capacity=64, worker_threads=4)
spec = {'$schema': 'https://vega.github.io/schema/vega/v5.json', 'axes': [{'grid': False, 'labelFlush': True, 'labelOverlap':...: {'grid': False}, 'axisY': {'domain': False, 'offset': 10, 'ticks': False}, 'style': {'cell': {'stroke': None}}}, ...}
datasets = [('data_0', ())], local_tz = 'UTC', default_input_tz = None
row_limit = None, inline_datasets = {}, trim_unused_columns = False
dataset_format = 'auto'

def pre_transform_datasets(
    self,
    spec: Union[dict[str, Any], str],
    datasets: list[Union[str, tuple[str, list[int]]]],
    local_tz: str | None = None,
    default_input_tz: str | None = None,
    row_limit: int | None = None,
    inline_datasets: dict[str, DataFrameLike] | None = None,
    trim_unused_columns: bool = False,
    dataset_format: DatasetFormat = "auto",
) -> tuple[list[DataFrameLike], list[PreTransformWarning]]:
    """
    Extract the fully evaluated form of the requested datasets from a Vega
    specification.

    Args:
        spec: A Vega specification dict or JSON string.
        datasets: A list with elements that are either:

            * The name of a top-level dataset as a string
            * A two-element tuple where the first element is the name of a dataset
              as a string and the second element is the nested scope of the dataset
              as a list of integers
        local_tz: Name of timezone to be considered local. E.g.
            ``'America/New_York'``. Defaults to the value of vf.get_local_tz(),
            which defaults to the system timezone if one can be determined.
        default_input_tz: Name of timezone (e.g. ``'America/New_York'``) that naive
            datetime strings should be interpreted in. Defaults to ``local_tz``.
        row_limit: Maximum number of dataset rows to include in the returned
            datasets. If exceeded, datasets will be truncated to this number of
            rows and a RowLimitExceeded warning will be included in the resulting
            warnings list.
        inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification
            using the following url syntax 'vegafusion+dataset://{dataset_name}'
            or 'table://{dataset_name}'.
        trim_unused_columns: If True, unused columns are removed from returned
            datasets.
        dataset_format: Format for returned datasets. One of:

            * ``"auto"``: (default) Infer the result type based on the types of
              inline datasets. If no inline datasets are provided, return type will
              depend on installed packages.
            * ``"polars"``: polars.DataFrame
            * ``"pandas"``: pandas.DataFrame
            * ``"pyarrow"``: pyarrow.Table
            * ``"arro3"``: arro3.Table

    Returns:
        tuple[list[DataFrameLike], list[PreTransformWarning]]:
        Two-element tuple of

        * List of pandas DataFrames corresponding to the input datasets list
        * A list of warnings as dictionaries. Each warning dict has a 'type'
          key indicating the warning type, and a 'message' key containing a
          description of the warning.
    """
    local_tz = local_tz or get_local_tz()

    # Build input variables
    pre_tx_vars = parse_variables(datasets)

    # Serialize inline datasets
    inline_arrow_dataset = self._import_inline_datasets(
        inline_datasets,
        inline_dataset_usage=get_inline_column_usage(spec)
        if trim_unused_columns
        else None,
    )
  values, warnings = self.runtime.pre_transform_datasets(
        spec,
        pre_tx_vars,
        local_tz=local_tz,
        default_input_tz=default_input_tz,
        row_limit=row_limit,
        inline_datasets=inline_arrow_dataset,
    )

E ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/vegafusion/runtime.py:535: ValueError
=========================== short test summary info ============================
FAILED tests/test_transformed_data.py::test_primitive_chart_examples[False-natural_disasters.py-686-cols28] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_compound_chart_examples[True-falkensee.py-all_rows3-all_cols3] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_compound_chart_examples[False-falkensee.py-all_rows3-all_cols3] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_primitive_chart_examples[True-natural_disasters.py-686-cols28] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

What would you like to happen instead?

To have tests green as 🥦

Which version of Altair are you using?

5.6.0dev

@mattijn
Copy link
Contributor

mattijn commented Nov 24, 2024

Linking #3631 (comment). If I inspect https://github.com/vega/altair/actions/runs/11995228691/job/33438600170 it seems vegafusion is not being installed as a dependency.

@dangotbanned
Copy link
Member

@mattijn
Copy link
Contributor

mattijn commented Nov 24, 2024

Reproducible:

import altair as alt
from vega_datasets import data

source = data.disasters.url

chart = alt.Chart(source).transform_filter(
    alt.datum.Entity != 'All natural disasters'
).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1,
    strokeOpacity=0.4
).encode(
    x=alt.X('Year:T', title=None, scale=alt.Scale(domain=['1899','2018'])),
    y=alt.Y(
        'Entity:N',
        sort=alt.EncodingSortField(field="Deaths", op="sum", order='descending'),
        title=None
    ),
    size=alt.Size('Deaths:Q',
        scale=alt.Scale(range=[0, 2500]),
        legend=alt.Legend(title='Deaths', clipHeight=30, format='s')
    ),
    color=alt.Color('Entity:N', legend=None),
    tooltip=[
        "Entity:N", 
        alt.Tooltip("Year:T", format='%Y'), 
        alt.Tooltip("Deaths:Q", format='~s')
    ],
).properties(
    width=450,
    height=320,
    title=alt.Title(
        text="Global Deaths from Natural Disasters (1900-2017)",
        subtitle="The size of the bubble represents the total death count per year, by type of disaster",
        anchor='start'
    )
).configure_axisY(
    domain=False,
    ticks=False,
    offset=10
).configure_axisX(
    grid=False,
).configure_view(
    stroke=None
)
chart.transformed_data()
VegaFusionRuntime.pre_transform_datasets(self, spec, datasets, local_tz, default_input_tz, row_limit, inline_datasets, trim_unused_columns, dataset_format)
    527 # Serialize inline datasets
    528 inline_arrow_dataset = self._import_inline_datasets(
    529     inline_datasets,
    530     inline_dataset_usage=get_inline_column_usage(spec)
    531     if trim_unused_columns
    532     else None,
    533 )
--> 535 values, warnings = self.runtime.pre_transform_datasets(
    536     spec,
    537     pre_tx_vars,
    538     local_tz=local_tz,
    539     default_input_tz=default_input_tz,
    540     row_limit=row_limit,
    541     inline_datasets=inline_arrow_dataset,
    542 )
    544 def normalize_timezones(
    545     dfs: list[nw.DataFrame[IntoFrameT] | nw.LazyFrame[IntoFrameT]],
    546 ) -> list[DataFrameLike]:
    547     # Convert to `local_tz` (or, set to UTC and then convert if starting
    548     # from time-zone-naive data), then extract the native DataFrame to return.
    549     processed_datasets = []

ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

@dangotbanned
Copy link
Member

dangotbanned commented Nov 24, 2024

#3701 (comment)

@mattijn does this work if you change to:

"Year:Q"

Or provide a format string somehow?

I'm not sure what mini-language DataFusion is using, but it could be https://docs.rs/chrono-wasi07/latest/chrono/format/strftime/index.html

Edit

Yeah it is chrono

https://github.com/apache/datafusion/blob/d9abdadda066808345b5d9f7ba234a51b8bb2d9c/Cargo.toml#L96

@jonmmease
Copy link
Contributor

on my phone, but this should work if you provide a d3/vega time format string. I believe you can provide this in the chart constructor.

The reason it's happening is that I switched to using datafusion's native time parsing logic for simplicity and performance, and I wasn't able to get chrono to automatically parse just the year as a date. I'll have time later today to look at this if still needed.

@dangotbanned
Copy link
Member

dangotbanned commented Nov 24, 2024

on my phone, but this should work if you provide a d3/vega time format string. I believe you can provide this in the chart constructor.

The reason it's happening is that I switched to using datafusion's native time parsing logic for simplicity and performance, and I wasn't able to get chrono to automatically parse just the year as a date. I'll have time later today to look at this if still needed.

If there is any possibility to solve this on the python-side, some bits that may be helpful

But I think since these example are coming from a url, this route might not work out

Side note

@MarcoGorelli I feel like I remember reading some narwhals code recently that had some kind of universal parsing of strftime-like dialects.
Am I imagining this, or is there some function that I've been unable to find?

@mattijn
Copy link
Contributor

mattijn commented Nov 24, 2024

For the record, it is not because of the altar v5.5.0 release, but because of the vegafusion v2.0.0 release which was released near simultaneous.

The issue can be reproduced with the following minimal specification:

import altair as alt
source = alt.Data(values=[{'Year': '1900'}])
chart = alt.Chart(source).mark_tick().encode(x='Year:T')
chart.transformed_data()

And when switching to 'Year:Q' the issue disappears.

@MarcoGorelli
Copy link
Contributor Author

@MarcoGorelli I feel like I remember reading some narwhals code recently that had some kind of universal parsing of strftime-like dialects.

nw.Expr.str.to_datetime and nw.Series.str.to_datetime can auto-infer some common formats, but not just a single year (but perhaps they should 🤔 currently Polars doesn't infer that one either)

@dangotbanned
Copy link
Member

dangotbanned commented Nov 24, 2024

Working on a PR now #3702

@jonmmease
Copy link
Contributor

jonmmease commented Nov 24, 2024

Yes, this is a VegaFusion 2 thing, not an Altair 5.5 thing. As I mentioned above:

The reason it's happening is that I switched to using DataFusion's native time parsing logic for simplicity and performance, and I wasn't able to get chrono to automatically parse just the year as a date. I'll have time later today to look at this if still needed.

This can be addressed by specifying the date format for the column as %Y.

source = alt.UrlData(
    data.disasters.url,
    format=alt.DataFormat(parse={"Year": "date:%Y"})
)

I'm torn on whether to make all date parsing a little less efficient by always checking for this as a special case (with a \d{4} regular expression) in VegaFusion. It's not a hard fix for our tests, and I can mark it prominently as a breaking change with instructions in the VegaFusion and Altair 5 changelogs. Open to suggestions.

Separately, as we look at reworking the vega_datasets logic, I wonder if something like data.disasters.url could return alt.UrlData so that we have the ability to specifying format strings like this.

@dangotbanned
Copy link
Member

I'm torn on whether to make all date parsing a little less efficient by always checking for this as a special case (with a \d{4} regular expression) in VegaFusion. It's not a hard fix for our tests, and I can mark it prominently as a breaking change with instructions in the VegaFusion and Altair 5 changelogs. Open to suggestions.

@jonmmease IMO I don't think these should be interpreted as dates without a format specifier.

A 4 digit number (string?) is so ambiguous

dangotbanned added a commit that referenced this issue Nov 24, 2024
Somewhat of a drive-by, but I fixed the `ruff` directives so they apply to only the parameterize blocks

All other changes are to resolve rule violations that were hidden

#3701
@joelostblom
Copy link
Contributor

joelostblom commented Nov 24, 2024

I think it would be great if the default in altair was that four digit integers would be understood as dates/years when using :T explicitly (as for digit str are currently). I like that this is already the default behavior of Vegafusion as per vega/vegafusion#402 (comment), so that we don't need to use any of the workarounds in #3140. Possibly we will have this by default in Vega 6 vega/vega#1681.

@jonmmease
Copy link
Contributor

Edit to the above. There was an error in VegaFusion 2.0.0 where it wasn't handling the custom date parse format. In 2.0.1, this URL format works:

source = alt.UrlData(
    data.disasters.url,
    format=alt.DataFormat(parse={"Year": "date:%Y"})
)

Rather than xfail the tests, how would folks feel about updating the examples to provide an explicit date format like this?

@dangotbanned dangotbanned reopened this Nov 25, 2024
@dangotbanned
Copy link
Member

dangotbanned commented Nov 25, 2024

Edit to the above. There was an error in VegaFusion 2.0.0 where it wasn't handling the custom date parse format. In 2.0.1, this URL format works:

source = alt.UrlData(
data.disasters.url,
format=alt.DataFormat(parse={"Year": "date:%Y"})
)

Rather than xfail the tests, how would folks feel about updating the examples to provide an explicit date format like this?

@jonmmease I like the end result here, the main tweak I'd add is to use alt.CsvDataFormat instead of alt.DataFormat.
Since that hints at the parse parameter.

https://github.com/vega/altair/blob/922eac35da764d3faae928b870a2c70d0cf1435b/altair/vegalite/v5/schema/core.py#L5513-L5559

Side note

This does make me wonder if there would be any value in being able to specify this in a less verbose way.
E.g. maybe for altair.datasets.url we'd want an option that doesn't require as much nesting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants