Test suite is failing with Vegafusion 2 #3701

MarcoGorelli · 2024-11-24T10:34:26Z

What happened?

Running the test suite with the latest versions of all dependencies results in:

=========================== short test summary info ============================
FAILED tests/test_transformed_data.py::test_primitive_chart_examples[False-natural_disasters.py-686-cols28] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_compound_chart_examples[True-falkensee.py-all_rows3-all_cols3] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_compound_chart_examples[False-falkensee.py-all_rows3-all_cols3] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_primitive_chart_examples[True-natural_disasters.py-686-cols28] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

Full logs:

============================= test session starts ============================== platform linux -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0 rootdir: /home/runner/work/narwhals/narwhals/altair configfile: pyproject.toml plugins: anyio-4.6.2.post1, cov-6.0.0, xdist-3.6.1 created: 4/4 workers 4 workers [1738 items]

........................................................................ [ 4%]
........................................................................ [ 8%]
........................................................................ [ 12%]
........................................................................ [ 16%]
........................................................................ [ 20%]
........................................................................ [ 24%]
........................................................................ [ 28%]
........................................................................ [ 33%]
........................................................................ [ 37%]
........................................................................ [ 41%]
........................................................................ [ 45%]
........................................................................ [ 49%]
........................................................................ [ 53%]
........................................................................ [ 57%]
..........................................................F............. [ 62%]
.........................................F.............................. [ 66%]
..s.ssss...F............................................................ [ 70%]
.....F................X................................................. [ 74%]
........................................................................ [ 78%]
.X....X................................................................. [ 82%]
........................................................................ [ 86%]
........................................................................ [ 91%]
..............X......................................................... [ 95%]
.......................X.........x...................................... [ 99%]
.......... [100%]
=================================== FAILURES ===================================
_____ test_primitive_chart_examples[False-natural_disasters.py-686-cols28] _____
[gw2] linux -- Python 3.12.7 /opt/hostedtoolcache/Python/3.12.7/x64/bin/python

filename = 'natural_disasters.py', rows = 686, cols = ['Deaths', 'Year']
to_reconstruct = False

@ignore_DataFrameGroupBy
@pytest.mark.skipif(vf is None, reason="vegafusion not installed")
# fmt: off
@pytest.mark.parametrize("filename,rows,cols", [
    ("annual_weather_heatmap.py", 366, ["monthdate_date_end", "max_temp_max"]),
    ("anscombe_plot.py", 44, ["Series", "X", "Y"]),
    ("bar_chart_sorted.py", 6, ["site", "sum_yield"]),
    ("bar_chart_faceted_compact.py", 27, ["p", "p_end"]),
    ("beckers_barley_facet.py", 120, ["year", "site"]),
    ("beckers_barley_wrapped_facet.py", 120, ["site", "median_yield"]),
    ("bump_chart.py", 96, ["rank", "yearmonth_date"]),
    ("comet_chart.py", 120, ["variety", "delta"]),
    ("diverging_stacked_bar_chart.py", 40, ["value", "percentage_start"]),
    ("donut_chart.py", 6, ["value_start", "value_end"]),
    ("gapminder_bubble_plot.py", 187, ["income", "population"]),
    ("grouped_bar_chart2.py", 9, ["Group", "Value_start"]),
    ("hexbins.py", 84, ["xFeaturePos", "mean_temp_max"]),
    pytest.param("histogram_heatmap.py", 378, ["bin_maxbins_40_Rotten_Tomatoes_Rating", "__count"], marks=slow),
    ("histogram_scatterplot.py", 64, ["bin_maxbins_10_Rotten_Tomatoes_Rating", "__count"]),
    pytest.param("interactive_legend.py", 1708, ["sum_count_start", "series"], marks=slow),
    ("iowa_electricity.py", 51, ["net_generation_start", "year"]),
    ("isotype.py", 37, ["animal", "x"]),
    ("isotype_grid.py", 100, ["row", "col"]),
    ("lasagna_plot.py", 492, ["yearmonthdate_date", "sum_price"]),
    ("layered_area_chart.py", 51, ["source", "net_generation"]),
    ("layered_bar_chart.py", 51, ["source", "net_generation"]),
    ("layered_histogram.py", 113, ["bin_maxbins_100_Measurement"]),
    ("line_chart_with_cumsum.py", 52, ["cumulative_wheat"]),
    ("line_custom_order.py", 55, ["miles", "gas"]),
    pytest.param("line_percent.py", 30, ["sex", "perc"], marks=slow),
    ("line_with_log_scale.py", 15, ["year", "sum_people"]),
    ("multifeature_scatter_plot.py", 150, ["petalWidth", "species"]),
    ("natural_disasters.py", 686, ["Deaths", "Year"]),
    ("normalized_stacked_area_chart.py", 51, ["source", "net_generation_start"]),
    ("normalized_stacked_bar_chart.py", 60, ["site", "sum_yield_start"]),
    ("parallel_coordinates.py", 600, ["key", "value"]),
    ("percentage_of_total.py", 5, ["PercentOfTotal", "TotalTime"]),
    ("pie_chart.py", 6, ["category", "value_start"]),
    ("pyramid.py", 3, ["category", "value_start"]),
    ("stacked_bar_chart_sorted_segments.py", 60, ["variety", "site"]),
    ("stem_and_leaf.py", 100, ["stem", "leaf"]),
    pytest.param("streamgraph.py", 1708, ["series", "sum_count"], marks=slow),
    ("top_k_items.py", 10, ["rank", "IMDB_Rating_start"]),
    ("top_k_letters.py", 9, ["rank", "letters"]),
    pytest.param("top_k_with_others.py", 10, ["ranked_director", "mean_aggregate_gross"], marks=slow),
    ("area_faceted.py", 492, ["date", "price"]),
    ("distributions_faceted_histogram.py", 20, ["Origin", "__count"]),
    ("us_population_over_time.py", 38, ["sex", "people_start"]),
    ("us_population_over_time_facet.py", 285, ["year", "sum_people"]),
    ("wilkinson-dot-plot.py", 21, ["data", "id"]),
    ("window_rank.py", 12, ["team", "diff"]),
])
# fmt: on
@pytest.mark.parametrize("to_reconstruct", [True, False])
def test_primitive_chart_examples(filename, rows, cols, to_reconstruct):
    source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
    chart = eval_block(source, strict=True)
    if to_reconstruct:
        # When reconstructing a Chart, Altair uses different classes
        # then what might have been originally used. See
        # https://github.com/hex-inc/vegafusion/issues/354 for more info.
        chart = alt.Chart.from_dict(chart.to_dict())

  df = chart.transformed_data()

tests/test_transformed_data.py:82:

altair/vegalite/v5/api.py:4058: in transformed_data
return transformed_data(self, row_limit=row_limit, exclude=exclude)
altair/utils/_transformed_data.py:138: in transformed_data
datasets, _ = vf.runtime.pre_transform_datasets(

self = VegaFusionRuntime(cache_capacity=64, worker_threads=4)
spec = {'$schema': 'https://vega.github.io/schema/vega/v5.json', 'axes': [{'grid': False, 'labelFlush': True, 'labelOverlap':...: {'grid': False}, 'axisY': {'domain': False, 'offset': 10, 'ticks': False}, 'style': {'cell': {'stroke': None}}}, ...}
datasets = [('data_0', ())], local_tz = 'UTC', default_input_tz = None
row_limit = None, inline_datasets = {}, trim_unused_columns = False
dataset_format = 'auto'

def pre_transform_datasets(
    self,
    spec: Union[dict[str, Any], str],
    datasets: list[Union[str, tuple[str, list[int]]]],
    local_tz: str | None = None,
    default_input_tz: str | None = None,
    row_limit: int | None = None,
    inline_datasets: dict[str, DataFrameLike] | None = None,
    trim_unused_columns: bool = False,
    dataset_format: DatasetFormat = "auto",
) -> tuple[list[DataFrameLike], list[PreTransformWarning]]:
    """
    Extract the fully evaluated form of the requested datasets from a Vega
    specification.

    Args:
        spec: A Vega specification dict or JSON string.
        datasets: A list with elements that are either:

            * The name of a top-level dataset as a string
            * A two-element tuple where the first element is the name of a dataset
              as a string and the second element is the nested scope of the dataset
              as a list of integers
        local_tz: Name of timezone to be considered local. E.g.
            ``'America/New_York'``. Defaults to the value of vf.get_local_tz(),
            which defaults to the system timezone if one can be determined.
        default_input_tz: Name of timezone (e.g. ``'America/New_York'``) that naive
            datetime strings should be interpreted in. Defaults to ``local_tz``.
        row_limit: Maximum number of dataset rows to include in the returned
            datasets. If exceeded, datasets will be truncated to this number of
            rows and a RowLimitExceeded warning will be included in the resulting
            warnings list.
        inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification
            using the following url syntax 'vegafusion+dataset://{dataset_name}'
            or 'table://{dataset_name}'.
        trim_unused_columns: If True, unused columns are removed from returned
            datasets.
        dataset_format: Format for returned datasets. One of:

            * ``"auto"``: (default) Infer the result type based on the types of
              inline datasets. If no inline datasets are provided, return type will
              depend on installed packages.
            * ``"polars"``: polars.DataFrame
            * ``"pandas"``: pandas.DataFrame
            * ``"pyarrow"``: pyarrow.Table
            * ``"arro3"``: arro3.Table

    Returns:
        tuple[list[DataFrameLike], list[PreTransformWarning]]:
        Two-element tuple of

        * List of pandas DataFrames corresponding to the input datasets list
        * A list of warnings as dictionaries. Each warning dict has a 'type'
          key indicating the warning type, and a 'message' key containing a
          description of the warning.
    """
    local_tz = local_tz or get_local_tz()

    # Build input variables
    pre_tx_vars = parse_variables(datasets)

    # Serialize inline datasets
    inline_arrow_dataset = self._import_inline_datasets(
        inline_datasets,
        inline_dataset_usage=get_inline_column_usage(spec)
        if trim_unused_columns
        else None,
    )

  values, warnings = self.runtime.pre_transform_datasets(

        spec,
        pre_tx_vars,
        local_tz=local_tz,
        default_input_tz=default_input_tz,
        row_limit=row_limit,
        inline_datasets=inline_arrow_dataset,
    )

E ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/vegafusion/runtime.py:535: ValueError
_____ test_compound_chart_examples[True-falkensee.py-all_rows3-all_cols3] ______
[gw2] linux -- Python 3.12.7 /opt/hostedtoolcache/Python/3.12.7/x64/bin/python

filename = 'falkensee.py', all_rows = [2, 38, 38]
all_cols = [['event'], ['population'], ['population']], to_reconstruct = True

@pytest.mark.skipif(vf is None, reason="vegafusion not installed")
# fmt: off
@pytest.mark.parametrize("filename,all_rows,all_cols", [
    ("errorbars_with_std.py", [10, 10], [["upper_yield"], ["extent_yield"]]),
    ("candlestick_chart.py", [44, 44], [["low"], ["close"]]),
    ("co2_concentration.py", [713, 7, 7], [["first_date"], ["scaled_date"], ["end"]]),
    ("falkensee.py", [2, 38, 38], [["event"], ["population"], ["population"]]),
    ("heat_lane.py", [10, 10], [["bin_count_start"], ["y2"]]),
    ("histogram_responsive.py", [20, 20], [["__count"], ["__count"]]),
    ("histogram_with_a_global_mean_overlay.py", [9, 1], [["__count"], ["mean_IMDB_Rating"]]),
    ("horizon_graph.py", [20, 20], [["x"], ["ny"]]),
    pytest.param("interactive_cross_highlight.py", [64, 64, 13], [["__count"], ["__count"], ["Major_Genre"]], marks=slow),
    ("interval_selection.py", [123, 123], [["price_start"], ["date"]]),
    ("layered_chart_with_dual_axis.py", [12, 12], [["month_date"], ["average_precipitation"]]),
    ("layered_heatmap_text.py", [9, 9], [["Cylinders"], ["mean_horsepower"]]),
    ("multiline_highlight.py", [560, 560], [["price"], ["date"]]),
    ("multiline_tooltip.py", [300, 300, 300, 0, 300], [["x"], ["y"], ["y"], ["x"], ["x"]]),
    ("pie_chart_with_labels.py", [6, 6], [["category"], ["value"]]),
    ("radial_chart.py", [6, 6], [["values"], ["values_start"]]),
    ("scatter_linked_table.py", [392, 14, 14, 14], [["Year"], ["Year"], ["Year"], ["Year"]]),
    ("scatter_marginal_hist.py", [34, 150, 27], [["__count"], ["species"], ["__count"]]),
    pytest.param(
        "scatter_with_layered_histogram.py",
        [2, 19],
        [["gender"], ["__count"]],
        marks=(slow, pytest.mark.xfail(
            XDIST_ENABLED,
            reason="Possibly `numpy` conflict with `xdist`.\n"
            "Very intermittent, but only affects `to_reconstruct=False`."
        )),
    ),
    ("scatter_with_minimap.py", [1461, 1461], [["date"], ["date"]]),
    ("scatter_with_rolling_mean.py", [1461, 1461], [["date"], ["rolling_mean"]]),
    ("seattle_weather_interactive.py", [1461, 5], [["date"], ["__count"]]),
    ("select_detail.py", [20, 1000], [["id"], ["x"]]),
    ("simple_scatter_with_errorbars.py", [5, 5], [["x"], ["upper_ymin"]]),
    ("stacked_bar_chart_with_text.py", [60, 60], [["site"], ["site"]]),
    ("us_employment.py", [120, 1, 2], [["month"], ["president"], ["president"]]),
    ("us_population_pyramid_over_time.py", [19, 38, 19], [["gender"], ["year"], ["gender"]]),
])
# fmt: on
@pytest.mark.parametrize("to_reconstruct", [True, False])
def test_compound_chart_examples(filename, all_rows, all_cols, to_reconstruct):
    source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
    chart = eval_block(source, strict=True)
    if to_reconstruct:
        # When reconstructing a Chart, Altair uses different classes
        # then what might have been originally used. See
        # https://github.com/hex-inc/vegafusion/issues/354 for more info.
        chart = alt.Chart.from_dict(chart.to_dict())

    assert isinstance(chart, (alt.LayerChart, alt.ConcatChart, alt.HConcatChart, alt.VConcatChart))

  dfs = chart.transformed_data()

tests/test_transformed_data.py:142:

altair/vegalite/v5/api.py:4688: in transformed_data
return transformed_data(self, row_limit=row_limit, exclude=exclude)
altair/utils/_transformed_data.py:138: in transformed_data
datasets, _ = vf.runtime.pre_transform_datasets(

self = VegaFusionRuntime(cache_capacity=64, worker_threads=4)
spec = {'$schema': 'https://vega.github.io/schema/vega/v5.json', 'axes': [{'aria': False, 'domain': False, 'grid': True, 'gri...Finite(+datum["year"]))) && isValid(datum["population"]) && isFinite(+datum["population"])', 'type': 'filter'}]}], ...}
datasets = [('data_0', ()), ('data_1', ()), ('data_2', ())], local_tz = 'UTC'
default_input_tz = None, row_limit = None, inline_datasets = {}
trim_unused_columns = False, dataset_format = 'auto'

def pre_transform_datasets(
    self,
    spec: Union[dict[str, Any], str],
    datasets: list[Union[str, tuple[str, list[int]]]],
    local_tz: str | None = None,
    default_input_tz: str | None = None,
    row_limit: int | None = None,
    inline_datasets: dict[str, DataFrameLike] | None = None,
    trim_unused_columns: bool = False,
    dataset_format: DatasetFormat = "auto",
) -> tuple[list[DataFrameLike], list[PreTransformWarning]]:
    """
    Extract the fully evaluated form of the requested datasets from a Vega
    specification.

    Args:
        spec: A Vega specification dict or JSON string.
        datasets: A list with elements that are either:

            * The name of a top-level dataset as a string
            * A two-element tuple where the first element is the name of a dataset
              as a string and the second element is the nested scope of the dataset
              as a list of integers
        local_tz: Name of timezone to be considered local. E.g.
            ``'America/New_York'``. Defaults to the value of vf.get_local_tz(),
            which defaults to the system timezone if one can be determined.
        default_input_tz: Name of timezone (e.g. ``'America/New_York'``) that naive
            datetime strings should be interpreted in. Defaults to ``local_tz``.
        row_limit: Maximum number of dataset rows to include in the returned
            datasets. If exceeded, datasets will be truncated to this number of
            rows and a RowLimitExceeded warning will be included in the resulting
            warnings list.
        inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification
            using the following url syntax 'vegafusion+dataset://{dataset_name}'
            or 'table://{dataset_name}'.
        trim_unused_columns: If True, unused columns are removed from returned
            datasets.
        dataset_format: Format for returned datasets. One of:

            * ``"auto"``: (default) Infer the result type based on the types of
              inline datasets. If no inline datasets are provided, return type will
              depend on installed packages.
            * ``"polars"``: polars.DataFrame
            * ``"pandas"``: pandas.DataFrame
            * ``"pyarrow"``: pyarrow.Table
            * ``"arro3"``: arro3.Table

    Returns:
        tuple[list[DataFrameLike], list[PreTransformWarning]]:
        Two-element tuple of

        * List of pandas DataFrames corresponding to the input datasets list
        * A list of warnings as dictionaries. Each warning dict has a 'type'
          key indicating the warning type, and a 'message' key containing a
          description of the warning.
    """
    local_tz = local_tz or get_local_tz()

    # Build input variables
    pre_tx_vars = parse_variables(datasets)

    # Serialize inline datasets
    inline_arrow_dataset = self._import_inline_datasets(
        inline_datasets,
        inline_dataset_usage=get_inline_column_usage(spec)
        if trim_unused_columns
        else None,
    )

  values, warnings = self.runtime.pre_transform_datasets(

        spec,
        pre_tx_vars,
        local_tz=local_tz,
        default_input_tz=default_input_tz,
        row_limit=row_limit,
        inline_datasets=inline_arrow_dataset,
    )

E ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/vegafusion/runtime.py:535: ValueError
_____ test_compound_chart_examples[False-falkensee.py-all_rows3-all_cols3] _____
[gw3] linux -- Python 3.12.7 /opt/hostedtoolcache/Python/3.12.7/x64/bin/python

filename = 'falkensee.py', all_rows = [2, 38, 38]
all_cols = [['event'], ['population'], ['population']], to_reconstruct = False

@pytest.mark.skipif(vf is None, reason="vegafusion not installed")
# fmt: off
@pytest.mark.parametrize("filename,all_rows,all_cols", [
    ("errorbars_with_std.py", [10, 10], [["upper_yield"], ["extent_yield"]]),
    ("candlestick_chart.py", [44, 44], [["low"], ["close"]]),
    ("co2_concentration.py", [713, 7, 7], [["first_date"], ["scaled_date"], ["end"]]),
    ("falkensee.py", [2, 38, 38], [["event"], ["population"], ["population"]]),
    ("heat_lane.py", [10, 10], [["bin_count_start"], ["y2"]]),
    ("histogram_responsive.py", [20, 20], [["__count"], ["__count"]]),
    ("histogram_with_a_global_mean_overlay.py", [9, 1], [["__count"], ["mean_IMDB_Rating"]]),
    ("horizon_graph.py", [20, 20], [["x"], ["ny"]]),
    pytest.param("interactive_cross_highlight.py", [64, 64, 13], [["__count"], ["__count"], ["Major_Genre"]], marks=slow),
    ("interval_selection.py", [123, 123], [["price_start"], ["date"]]),
    ("layered_chart_with_dual_axis.py", [12, 12], [["month_date"], ["average_precipitation"]]),
    ("layered_heatmap_text.py", [9, 9], [["Cylinders"], ["mean_horsepower"]]),
    ("multiline_highlight.py", [560, 560], [["price"], ["date"]]),
    ("multiline_tooltip.py", [300, 300, 300, 0, 300], [["x"], ["y"], ["y"], ["x"], ["x"]]),
    ("pie_chart_with_labels.py", [6, 6], [["category"], ["value"]]),
    ("radial_chart.py", [6, 6], [["values"], ["values_start"]]),
    ("scatter_linked_table.py", [392, 14, 14, 14], [["Year"], ["Year"], ["Year"], ["Year"]]),
    ("scatter_marginal_hist.py", [34, 150, 27], [["__count"], ["species"], ["__count"]]),
    pytest.param(
        "scatter_with_layered_histogram.py",
        [2, 19],
        [["gender"], ["__count"]],
        marks=(slow, pytest.mark.xfail(
            XDIST_ENABLED,
            reason="Possibly `numpy` conflict with `xdist`.\n"
            "Very intermittent, but only affects `to_reconstruct=False`."
        )),
    ),
    ("scatter_with_minimap.py", [1461, 1461], [["date"], ["date"]]),
    ("scatter_with_rolling_mean.py", [1461, 1461], [["date"], ["rolling_mean"]]),
    ("seattle_weather_interactive.py", [1461, 5], [["date"], ["__count"]]),
    ("select_detail.py", [20, 1000], [["id"], ["x"]]),
    ("simple_scatter_with_errorbars.py", [5, 5], [["x"], ["upper_ymin"]]),
    ("stacked_bar_chart_with_text.py", [60, 60], [["site"], ["site"]]),
    ("us_employment.py", [120, 1, 2], [["month"], ["president"], ["president"]]),
    ("us_population_pyramid_over_time.py", [19, 38, 19], [["gender"], ["year"], ["gender"]]),
])
# fmt: on
@pytest.mark.parametrize("to_reconstruct", [True, False])
def test_compound_chart_examples(filename, all_rows, all_cols, to_reconstruct):
    source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
    chart = eval_block(source, strict=True)
    if to_reconstruct:
        # When reconstructing a Chart, Altair uses different classes
        # then what might have been originally used. See
        # https://github.com/hex-inc/vegafusion/issues/354 for more info.
        chart = alt.Chart.from_dict(chart.to_dict())

    assert isinstance(chart, (alt.LayerChart, alt.ConcatChart, alt.HConcatChart, alt.VConcatChart))

  dfs = chart.transformed_data()

tests/test_transformed_data.py:142:

altair/vegalite/v5/api.py:4688: in transformed_data
return transformed_data(self, row_limit=row_limit, exclude=exclude)
altair/utils/_transformed_data.py:138: in transformed_data
datasets, _ = vf.runtime.pre_transform_datasets(

self = VegaFusionRuntime(cache_capacity=64, worker_threads=4)
spec = {'$schema': 'https://vega.github.io/schema/vega/v5.json', 'axes': [{'aria': False, 'domain': False, 'grid': True, 'gri...m["population"])', 'type': 'filter'}], 'url': 'vegafusion+dataset://table_7c2f0057_a249_4dc8_8e0c_bd46ba873edb'}], ...}
datasets = [('source_0', ()), ('source_1', ()), ('source_2', ())]
local_tz = 'UTC', default_input_tz = None, row_limit = None
inline_datasets = {'table_0d3e98ef_595c_466b_8231_3067287bafae': start end event
0 1933 1945 Nazi Rule
1 ...9 40179
33 2010 40511
34 2011 40465
35 2012 40905
36 2013 41258
37 2014 41777}
trim_unused_columns = False, dataset_format = 'auto'

def pre_transform_datasets(
    self,
    spec: Union[dict[str, Any], str],
    datasets: list[Union[str, tuple[str, list[int]]]],
    local_tz: str | None = None,
    default_input_tz: str | None = None,
    row_limit: int | None = None,
    inline_datasets: dict[str, DataFrameLike] | None = None,
    trim_unused_columns: bool = False,
    dataset_format: DatasetFormat = "auto",
) -> tuple[list[DataFrameLike], list[PreTransformWarning]]:
    """
    Extract the fully evaluated form of the requested datasets from a Vega
    specification.

    Args:
        spec: A Vega specification dict or JSON string.
        datasets: A list with elements that are either:

            * The name of a top-level dataset as a string
            * A two-element tuple where the first element is the name of a dataset
              as a string and the second element is the nested scope of the dataset
              as a list of integers
        local_tz: Name of timezone to be considered local. E.g.
            ``'America/New_York'``. Defaults to the value of vf.get_local_tz(),
            which defaults to the system timezone if one can be determined.
        default_input_tz: Name of timezone (e.g. ``'America/New_York'``) that naive
            datetime strings should be interpreted in. Defaults to ``local_tz``.
        row_limit: Maximum number of dataset rows to include in the returned
            datasets. If exceeded, datasets will be truncated to this number of
            rows and a RowLimitExceeded warning will be included in the resulting
            warnings list.
        inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification
            using the following url syntax 'vegafusion+dataset://{dataset_name}'
            or 'table://{dataset_name}'.
        trim_unused_columns: If True, unused columns are removed from returned
            datasets.
        dataset_format: Format for returned datasets. One of:

            * ``"auto"``: (default) Infer the result type based on the types of
              inline datasets. If no inline datasets are provided, return type will
              depend on installed packages.
            * ``"polars"``: polars.DataFrame
            * ``"pandas"``: pandas.DataFrame
            * ``"pyarrow"``: pyarrow.Table
            * ``"arro3"``: arro3.Table

    Returns:
        tuple[list[DataFrameLike], list[PreTransformWarning]]:
        Two-element tuple of

        * List of pandas DataFrames corresponding to the input datasets list
        * A list of warnings as dictionaries. Each warning dict has a 'type'
          key indicating the warning type, and a 'message' key containing a
          description of the warning.
    """
    local_tz = local_tz or get_local_tz()

    # Build input variables
    pre_tx_vars = parse_variables(datasets)

    # Serialize inline datasets
    inline_arrow_dataset = self._import_inline_datasets(
        inline_datasets,
        inline_dataset_usage=get_inline_column_usage(spec)
        if trim_unused_columns
        else None,
    )

  values, warnings = self.runtime.pre_transform_datasets(

        spec,
        pre_tx_vars,
        local_tz=local_tz,
        default_input_tz=default_input_tz,
        row_limit=row_limit,
        inline_datasets=inline_arrow_dataset,
    )

E ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/vegafusion/runtime.py:535: ValueError
_____ test_primitive_chart_examples[True-natural_disasters.py-686-cols28] ______
[gw0] linux -- Python 3.12.7 /opt/hostedtoolcache/Python/3.12.7/x64/bin/python

filename = 'natural_disasters.py', rows = 686, cols = ['Deaths', 'Year']
to_reconstruct = True

@ignore_DataFrameGroupBy
@pytest.mark.skipif(vf is None, reason="vegafusion not installed")
# fmt: off
@pytest.mark.parametrize("filename,rows,cols", [
    ("annual_weather_heatmap.py", 366, ["monthdate_date_end", "max_temp_max"]),
    ("anscombe_plot.py", 44, ["Series", "X", "Y"]),
    ("bar_chart_sorted.py", 6, ["site", "sum_yield"]),
    ("bar_chart_faceted_compact.py", 27, ["p", "p_end"]),
    ("beckers_barley_facet.py", 120, ["year", "site"]),
    ("beckers_barley_wrapped_facet.py", 120, ["site", "median_yield"]),
    ("bump_chart.py", 96, ["rank", "yearmonth_date"]),
    ("comet_chart.py", 120, ["variety", "delta"]),
    ("diverging_stacked_bar_chart.py", 40, ["value", "percentage_start"]),
    ("donut_chart.py", 6, ["value_start", "value_end"]),
    ("gapminder_bubble_plot.py", 187, ["income", "population"]),
    ("grouped_bar_chart2.py", 9, ["Group", "Value_start"]),
    ("hexbins.py", 84, ["xFeaturePos", "mean_temp_max"]),
    pytest.param("histogram_heatmap.py", 378, ["bin_maxbins_40_Rotten_Tomatoes_Rating", "__count"], marks=slow),
    ("histogram_scatterplot.py", 64, ["bin_maxbins_10_Rotten_Tomatoes_Rating", "__count"]),
    pytest.param("interactive_legend.py", 1708, ["sum_count_start", "series"], marks=slow),
    ("iowa_electricity.py", 51, ["net_generation_start", "year"]),
    ("isotype.py", 37, ["animal", "x"]),
    ("isotype_grid.py", 100, ["row", "col"]),
    ("lasagna_plot.py", 492, ["yearmonthdate_date", "sum_price"]),
    ("layered_area_chart.py", 51, ["source", "net_generation"]),
    ("layered_bar_chart.py", 51, ["source", "net_generation"]),
    ("layered_histogram.py", 113, ["bin_maxbins_100_Measurement"]),
    ("line_chart_with_cumsum.py", 52, ["cumulative_wheat"]),
    ("line_custom_order.py", 55, ["miles", "gas"]),
    pytest.param("line_percent.py", 30, ["sex", "perc"], marks=slow),
    ("line_with_log_scale.py", 15, ["year", "sum_people"]),
    ("multifeature_scatter_plot.py", 150, ["petalWidth", "species"]),
    ("natural_disasters.py", 686, ["Deaths", "Year"]),
    ("normalized_stacked_area_chart.py", 51, ["source", "net_generation_start"]),
    ("normalized_stacked_bar_chart.py", 60, ["site", "sum_yield_start"]),
    ("parallel_coordinates.py", 600, ["key", "value"]),
    ("percentage_of_total.py", 5, ["PercentOfTotal", "TotalTime"]),
    ("pie_chart.py", 6, ["category", "value_start"]),
    ("pyramid.py", 3, ["category", "value_start"]),
    ("stacked_bar_chart_sorted_segments.py", 60, ["variety", "site"]),
    ("stem_and_leaf.py", 100, ["stem", "leaf"]),
    pytest.param("streamgraph.py", 1708, ["series", "sum_count"], marks=slow),
    ("top_k_items.py", 10, ["rank", "IMDB_Rating_start"]),
    ("top_k_letters.py", 9, ["rank", "letters"]),
    pytest.param("top_k_with_others.py", 10, ["ranked_director", "mean_aggregate_gross"], marks=slow),
    ("area_faceted.py", 492, ["date", "price"]),
    ("distributions_faceted_histogram.py", 20, ["Origin", "__count"]),
    ("us_population_over_time.py", 38, ["sex", "people_start"]),
    ("us_population_over_time_facet.py", 285, ["year", "sum_people"]),
    ("wilkinson-dot-plot.py", 21, ["data", "id"]),
    ("window_rank.py", 12, ["team", "diff"]),
])
# fmt: on
@pytest.mark.parametrize("to_reconstruct", [True, False])
def test_primitive_chart_examples(filename, rows, cols, to_reconstruct):
    source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
    chart = eval_block(source, strict=True)
    if to_reconstruct:
        # When reconstructing a Chart, Altair uses different classes
        # then what might have been originally used. See
        # https://github.com/hex-inc/vegafusion/issues/354 for more info.
        chart = alt.Chart.from_dict(chart.to_dict())

  df = chart.transformed_data()

tests/test_transformed_data.py:82:

altair/vegalite/v5/api.py:4058: in transformed_data
return transformed_data(self, row_limit=row_limit, exclude=exclude)
altair/utils/_transformed_data.py:138: in transformed_data
datasets, _ = vf.runtime.pre_transform_datasets(

self = VegaFusionRuntime(cache_capacity=64, worker_threads=4)
spec = {'$schema': 'https://vega.github.io/schema/vega/v5.json', 'axes': [{'grid': False, 'labelFlush': True, 'labelOverlap':...: {'grid': False}, 'axisY': {'domain': False, 'offset': 10, 'ticks': False}, 'style': {'cell': {'stroke': None}}}, ...}
datasets = [('data_0', ())], local_tz = 'UTC', default_input_tz = None
row_limit = None, inline_datasets = {}, trim_unused_columns = False
dataset_format = 'auto'

def pre_transform_datasets(
    self,
    spec: Union[dict[str, Any], str],
    datasets: list[Union[str, tuple[str, list[int]]]],
    local_tz: str | None = None,
    default_input_tz: str | None = None,
    row_limit: int | None = None,
    inline_datasets: dict[str, DataFrameLike] | None = None,
    trim_unused_columns: bool = False,
    dataset_format: DatasetFormat = "auto",
) -> tuple[list[DataFrameLike], list[PreTransformWarning]]:
    """
    Extract the fully evaluated form of the requested datasets from a Vega
    specification.

    Args:
        spec: A Vega specification dict or JSON string.
        datasets: A list with elements that are either:

            * The name of a top-level dataset as a string
            * A two-element tuple where the first element is the name of a dataset
              as a string and the second element is the nested scope of the dataset
              as a list of integers
        local_tz: Name of timezone to be considered local. E.g.
            ``'America/New_York'``. Defaults to the value of vf.get_local_tz(),
            which defaults to the system timezone if one can be determined.
        default_input_tz: Name of timezone (e.g. ``'America/New_York'``) that naive
            datetime strings should be interpreted in. Defaults to ``local_tz``.
        row_limit: Maximum number of dataset rows to include in the returned
            datasets. If exceeded, datasets will be truncated to this number of
            rows and a RowLimitExceeded warning will be included in the resulting
            warnings list.
        inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification
            using the following url syntax 'vegafusion+dataset://{dataset_name}'
            or 'table://{dataset_name}'.
        trim_unused_columns: If True, unused columns are removed from returned
            datasets.
        dataset_format: Format for returned datasets. One of:

            * ``"auto"``: (default) Infer the result type based on the types of
              inline datasets. If no inline datasets are provided, return type will
              depend on installed packages.
            * ``"polars"``: polars.DataFrame
            * ``"pandas"``: pandas.DataFrame
            * ``"pyarrow"``: pyarrow.Table
            * ``"arro3"``: arro3.Table

    Returns:
        tuple[list[DataFrameLike], list[PreTransformWarning]]:
        Two-element tuple of

        * List of pandas DataFrames corresponding to the input datasets list
        * A list of warnings as dictionaries. Each warning dict has a 'type'
          key indicating the warning type, and a 'message' key containing a
          description of the warning.
    """
    local_tz = local_tz or get_local_tz()

    # Build input variables
    pre_tx_vars = parse_variables(datasets)

    # Serialize inline datasets
    inline_arrow_dataset = self._import_inline_datasets(
        inline_datasets,
        inline_dataset_usage=get_inline_column_usage(spec)
        if trim_unused_columns
        else None,
    )

  values, warnings = self.runtime.pre_transform_datasets(

        spec,
        pre_tx_vars,
        local_tz=local_tz,
        default_input_tz=default_input_tz,
        row_limit=row_limit,
        inline_datasets=inline_arrow_dataset,
    )

E ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/vegafusion/runtime.py:535: ValueError
=========================== short test summary info ============================
FAILED tests/test_transformed_data.py::test_primitive_chart_examples[False-natural_disasters.py-686-cols28] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_compound_chart_examples[True-falkensee.py-all_rows3-all_cols3] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_compound_chart_examples[False-falkensee.py-all_rows3-all_cols3] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1933' using format '%B %d, %Y %H:%M': input contains invalid characters
FAILED tests/test_transformed_data.py::test_primitive_chart_examples[True-natural_disasters.py-686-cols28] - ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

What would you like to happen instead?

To have tests green as 🥦

Which version of Altair are you using?

5.6.0dev

The text was updated successfully, but these errors were encountered:

mattijn · 2024-11-24T10:39:05Z

Linking #3631 (comment). If I inspect https://github.com/vega/altair/actions/runs/11995228691/job/33438600170 it seems vegafusion is not being installed as a dependency.

dangotbanned · 2024-11-24T10:47:54Z

Is this related?

ValueError: DataFusion error: Internal error: Failed to parse -07:00 as a timezone. vegafusion#549

mattijn · 2024-11-24T11:09:42Z

Reproducible:

import altair as alt
from vega_datasets import data

source = data.disasters.url

chart = alt.Chart(source).transform_filter(
    alt.datum.Entity != 'All natural disasters'
).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1,
    strokeOpacity=0.4
).encode(
    x=alt.X('Year:T', title=None, scale=alt.Scale(domain=['1899','2018'])),
    y=alt.Y(
        'Entity:N',
        sort=alt.EncodingSortField(field="Deaths", op="sum", order='descending'),
        title=None
    ),
    size=alt.Size('Deaths:Q',
        scale=alt.Scale(range=[0, 2500]),
        legend=alt.Legend(title='Deaths', clipHeight=30, format='s')
    ),
    color=alt.Color('Entity:N', legend=None),
    tooltip=[
        "Entity:N", 
        alt.Tooltip("Year:T", format='%Y'), 
        alt.Tooltip("Deaths:Q", format='~s')
    ],
).properties(
    width=450,
    height=320,
    title=alt.Title(
        text="Global Deaths from Natural Disasters (1900-2017)",
        subtitle="The size of the bubble represents the total death count per year, by type of disaster",
        anchor='start'
    )
).configure_axisY(
    domain=False,
    ticks=False,
    offset=10
).configure_axisX(
    grid=False,
).configure_view(
    stroke=None
)
chart.transformed_data()

VegaFusionRuntime.pre_transform_datasets(self, spec, datasets, local_tz, default_input_tz, row_limit, inline_datasets, trim_unused_columns, dataset_format)
    527 # Serialize inline datasets
    528 inline_arrow_dataset = self._import_inline_datasets(
    529     inline_datasets,
    530     inline_dataset_usage=get_inline_column_usage(spec)
    531     if trim_unused_columns
    532     else None,
    533 )
--> 535 values, warnings = self.runtime.pre_transform_datasets(
    536     spec,
    537     pre_tx_vars,
    538     local_tz=local_tz,
    539     default_input_tz=default_input_tz,
    540     row_limit=row_limit,
    541     inline_datasets=inline_arrow_dataset,
    542 )
    544 def normalize_timezones(
    545     dfs: list[nw.DataFrame[IntoFrameT] | nw.LazyFrame[IntoFrameT]],
    546 ) -> list[DataFrameLike]:
    547     # Convert to `local_tz` (or, set to UTC and then convert if starting
    548     # from time-zone-naive data), then extract the native DataFrame to return.
    549     processed_datasets = []

ValueError: DataFusion error: Execution error: Error parsing timestamp from '1900' using format '%B %d, %Y %H:%M': input contains invalid characters

dangotbanned · 2024-11-24T11:20:23Z

#3701 (comment)

@mattijn does this work if you change to:

"Year:Q"

Or provide a format string somehow?

I'm not sure what mini-language DataFusion is using, but it could be https://docs.rs/chrono-wasi07/latest/chrono/format/strftime/index.html

Edit

Yeah it is chrono

https://github.com/apache/datafusion/blob/d9abdadda066808345b5d9f7ba234a51b8bb2d9c/Cargo.toml#L96

jonmmease · 2024-11-24T11:48:49Z

on my phone, but this should work if you provide a d3/vega time format string. I believe you can provide this in the chart constructor.

The reason it's happening is that I switched to using datafusion's native time parsing logic for simplicity and performance, and I wasn't able to get chrono to automatically parse just the year as a date. I'll have time later today to look at this if still needed.

dangotbanned · 2024-11-24T12:00:42Z

on my phone, but this should work if you provide a d3/vega time format string. I believe you can provide this in the chart constructor.

The reason it's happening is that I switched to using datafusion's native time parsing logic for simplicity and performance, and I wasn't able to get chrono to automatically parse just the year as a date. I'll have time later today to look at this if still needed.

If there is any possibility to solve this on the python-side, some bits that may be helpful

But I think since these example are coming from a url, this route might not work out

Side note

@MarcoGorelli I feel like I remember reading some narwhals code recently that had some kind of universal parsing of strftime-like dialects.
Am I imagining this, or is there some function that I've been unable to find?

mattijn · 2024-11-24T12:20:21Z

For the record, it is not because of the altar v5.5.0 release, but because of the vegafusion v2.0.0 release which was released near simultaneous.

The issue can be reproduced with the following minimal specification:

import altair as alt
source = alt.Data(values=[{'Year': '1900'}])
chart = alt.Chart(source).mark_tick().encode(x='Year:T')
chart.transformed_data()

And when switching to 'Year:Q' the issue disappears.

MarcoGorelli · 2024-11-24T12:20:38Z

@MarcoGorelli I feel like I remember reading some narwhals code recently that had some kind of universal parsing of strftime-like dialects.

nw.Expr.str.to_datetime and nw.Series.str.to_datetime can auto-infer some common formats, but not just a single year (but perhaps they should 🤔 currently Polars doesn't infer that one either)

dangotbanned · 2024-11-24T12:26:29Z

Working on a PR now #3702

jonmmease · 2024-11-24T12:26:32Z

Yes, this is a VegaFusion 2 thing, not an Altair 5.5 thing. As I mentioned above:

The reason it's happening is that I switched to using DataFusion's native time parsing logic for simplicity and performance, and I wasn't able to get chrono to automatically parse just the year as a date. I'll have time later today to look at this if still needed.

This can be addressed by specifying the date format for the column as %Y.

source = alt.UrlData(
    data.disasters.url,
    format=alt.DataFormat(parse={"Year": "date:%Y"})
)

I'm torn on whether to make all date parsing a little less efficient by always checking for this as a special case (with a \d{4} regular expression) in VegaFusion. It's not a hard fix for our tests, and I can mark it prominently as a breaking change with instructions in the VegaFusion and Altair 5 changelogs. Open to suggestions.

Separately, as we look at reworking the vega_datasets logic, I wonder if something like data.disasters.url could return alt.UrlData so that we have the ability to specifying format strings like this.

dangotbanned · 2024-11-24T12:42:23Z

I'm torn on whether to make all date parsing a little less efficient by always checking for this as a special case (with a \d{4} regular expression) in VegaFusion. It's not a hard fix for our tests, and I can mark it prominently as a breaking change with instructions in the VegaFusion and Altair 5 changelogs. Open to suggestions.

@jonmmease IMO I don't think these should be interpreted as dates without a format specifier.

A 4 digit number (string?) is so ambiguous

Somewhat of a drive-by, but I fixed the `ruff` directives so they apply to only the parameterize blocks All other changes are to resolve rule violations that were hidden #3701

joelostblom · 2024-11-24T16:37:26Z

I think it would be great if the default in altair was that four digit integers would be understood as dates/years when using :T explicitly (as for digit str are currently). I like that this is already the default behavior of Vegafusion as per vega/vegafusion#402 (comment), so that we don't need to use any of the workarounds in #3140. Possibly we will have this by default in Vega 6 vega/vega#1681.

jonmmease · 2024-11-25T13:24:58Z

Edit to the above. There was an error in VegaFusion 2.0.0 where it wasn't handling the custom date parse format. In 2.0.1, this URL format works:

source = alt.UrlData(
    data.disasters.url,
    format=alt.DataFormat(parse={"Year": "date:%Y"})
)

Rather than xfail the tests, how would folks feel about updating the examples to provide an explicit date format like this?

dangotbanned · 2024-11-25T13:59:01Z

Edit to the above. There was an error in VegaFusion 2.0.0 where it wasn't handling the custom date parse format. In 2.0.1, this URL format works:

source = alt.UrlData(
data.disasters.url,
format=alt.DataFormat(parse={"Year": "date:%Y"})
)

Rather than xfail the tests, how would folks feel about updating the examples to provide an explicit date format like this?

@jonmmease I like the end result here, the main tweak I'd add is to use alt.CsvDataFormat instead of alt.DataFormat.
Since that hints at the parse parameter.

https://github.com/vega/altair/blob/922eac35da764d3faae928b870a2c70d0cf1435b/altair/vegalite/v5/schema/core.py#L5513-L5559

Side note

This does make me wonder if there would be any value in being able to specify this in a less verbose way.
E.g. maybe for altair.datasets.url we'd want an option that doesn't require as much nesting

MarcoGorelli added the bug label Nov 24, 2024

dangotbanned added the help wanted label Nov 24, 2024

dangotbanned assigned jonmmease Nov 24, 2024

dangotbanned mentioned this issue Nov 24, 2024

ci: Temporal examples in vegafusion #3702

Merged

dangotbanned added a commit that referenced this issue Nov 24, 2024

test: Add an xfail for temporal examples

5fdf2cb

Somewhat of a drive-by, but I fixed the `ruff` directives so they apply to only the parameterize blocks All other changes are to resolve rule violations that were hidden #3701

mattijn closed this as completed in #3702 Nov 24, 2024

dangotbanned reopened this Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test suite is failing with Vegafusion 2 #3701

Test suite is failing with Vegafusion 2 #3701

MarcoGorelli commented Nov 24, 2024

mattijn commented Nov 24, 2024 •

edited

Loading

dangotbanned commented Nov 24, 2024

mattijn commented Nov 24, 2024

dangotbanned commented Nov 24, 2024 •

edited

Loading

jonmmease commented Nov 24, 2024

dangotbanned commented Nov 24, 2024 •

edited

Loading

mattijn commented Nov 24, 2024

MarcoGorelli commented Nov 24, 2024

dangotbanned commented Nov 24, 2024 •

edited

Loading

jonmmease commented Nov 24, 2024 •

edited

Loading

dangotbanned commented Nov 24, 2024

joelostblom commented Nov 24, 2024 •

edited

Loading

jonmmease commented Nov 25, 2024

dangotbanned commented Nov 25, 2024 •

edited

Loading

Test suite is failing with Vegafusion 2 #3701

Test suite is failing with Vegafusion 2 #3701

Comments

MarcoGorelli commented Nov 24, 2024

What happened?

What would you like to happen instead?

Which version of Altair are you using?

mattijn commented Nov 24, 2024 • edited Loading

dangotbanned commented Nov 24, 2024

mattijn commented Nov 24, 2024

dangotbanned commented Nov 24, 2024 • edited Loading

Edit

jonmmease commented Nov 24, 2024

dangotbanned commented Nov 24, 2024 • edited Loading

Side note

mattijn commented Nov 24, 2024

MarcoGorelli commented Nov 24, 2024

dangotbanned commented Nov 24, 2024 • edited Loading

jonmmease commented Nov 24, 2024 • edited Loading

dangotbanned commented Nov 24, 2024

joelostblom commented Nov 24, 2024 • edited Loading

jonmmease commented Nov 25, 2024

dangotbanned commented Nov 25, 2024 • edited Loading

Side note

mattijn commented Nov 24, 2024 •

edited

Loading

dangotbanned commented Nov 24, 2024 •

edited

Loading

dangotbanned commented Nov 24, 2024 •

edited

Loading

dangotbanned commented Nov 24, 2024 •

edited

Loading

jonmmease commented Nov 24, 2024 •

edited

Loading

joelostblom commented Nov 24, 2024 •

edited

Loading

dangotbanned commented Nov 25, 2024 •

edited

Loading