Update pins on some examples #400

Azaya89 · 2024-06-26T11:08:59Z

This PR updates some of the dependencies in the nyc_taxi and glaciers examples.

glaciers/anaconda-project.yml

nyc_taxi/anaconda-project.yml

glaciers/anaconda-project.yml

maximlt · 2024-06-30T20:56:26Z

New issue: it appears the .parq file in the nyc_taxi examples is no longer being read correctly by fastparquet.

This is a problem I also got in #369. The last comment was:

Ok so I ended up with keeping pyarrow as the engine but adding this before the imports:
import dask

dask.config.set({"dataframe.convert-string": False})
dask.config.set({"dataframe.query-planning": False})
Since HoloViews does that too in its test suite, meaning that there's isn't yet "official" support for these two features (query planner and pyarrow string): https://github.com/holoviz/holoviews/blob/6b0121d5a3685989fca58a1687961523a5fd575c/holoviews/tests/conftest.py#L61-L62

However, since then, HoloViews no longer sets dask.config.set({"dataframe.query-planning": False}) (it still has dask.config.set({"dataframe.convert-string": False})).

https://github.com/holoviz/holoviews/blob/e5f7aede7a58902677eb995b8fd67c54ae9ae3ab/holoviews/tests/conftest.py#L55-L60

My suggestions:

Try with engine='pyarrow' and see whether the notebook runs fine. Don't set any of the dask.config options yet, maybe it works without them.
If it doesn't work, start with dask.config.set({"dataframe.convert-string": False}).

jbednar · 2024-07-01T20:31:32Z

Note that in the past, pyarrow and fastparquet had very different performance from each other in certain workloads, so ideally you'd at least qualitatively compare the old pinned version with the new version, and make sure that performance has not significantly degraded.

Azaya89 · 2024-07-02T13:22:24Z

Try with engine='pyarrow' and see whether the notebook runs fine. Don't set any of the dask.config options yet, maybe it works without them.

If it doesn't work, start with dask.config.set({"dataframe.convert-string": False}).

I have tried each of the suggestions individually and all together but it still didn't work. It still shows the same Traceback error. Here's the full Traceback:

Full traceback

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File <timed exec>:2

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/base.py:348, in DaskMethodsMixin.persist(self, **kwargs)
    309 def persist(self, **kwargs):
    310     """Persist this dask collection into memory
    311 
    312     This turns a lazy Dask collection into a Dask collection with the same
   (...)
    346     dask.persist
    347     """
--> 348     (result,) = persist(self, traverse=False, **kwargs)
    349     return result

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/base.py:998, in persist(traverse, optimize_graph, scheduler, *args, **kwargs)
    995     postpersists.append((rebuild, a_keys, state))
    997 with shorten_traceback():
--> 998     results = schedule(dsk, keys, **kwargs)
   1000 d = dict(zip(keys, results))
   1001 results2 = [r({k: d[k] for k in ks}, *s) for r, ks, s in postpersists]

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:97, in ParquetFunctionWrapper.__call__(self, part)
     94 if not isinstance(part, list):
     95     part = [part]
---> 97 return read_parquet_part(
     98     self.fs,
     99     self.engine,
    100     self.meta,
    101     [
    102         # Temporary workaround for HLG serialization bug
    103         # (see: https://github.com/dask/dask/issues/8581)
    104         (p.data["piece"], p.data.get("kwargs", {}))
    105         if hasattr(p, "data")
    106         else (p["piece"], p.get("kwargs", {}))
    107         for p in part
    108     ],
    109     self.columns,
    110     self.index,
    111     self.common_kwargs,
    112 )

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:645, in read_parquet_part(fs, engine, meta, part, columns, index, kwargs)
    642 if len(part) == 1 or part[0][1] or not check_multi_support(engine):
    643     # Part kwargs expected
    644     func = engine.read_partition
--> 645     dfs = [
    646         func(
    647             fs,
    648             rg,
    649             columns.copy(),
    650             index,
    651             **toolz.merge(kwargs, kw),
    652         )
    653         for (rg, kw) in part
    654     ]
    655     df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0]
    656 else:
    657     # No part specific kwargs, let engine read
    658     # list of parts at once

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:646, in <listcomp>(.0)
    642 if len(part) == 1 or part[0][1] or not check_multi_support(engine):
    643     # Part kwargs expected
    644     func = engine.read_partition
    645     dfs = [
--> 646         func(
    647             fs,
    648             rg,
    649             columns.copy(),
    650             index,
    651             **toolz.merge(kwargs, kw),
    652         )
    653         for (rg, kw) in part
    654     ]
    655     df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0]
    656 else:
    657     # No part specific kwargs, let engine read
    658     # list of parts at once

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:641, in ArrowDatasetEngine.read_partition(cls, fs, pieces, columns, index, dtype_backend, categories, partitions, filters, schema, **kwargs)
    638     row_group = [row_group]
    640 # Read in arrow table and convert to pandas
--> 641 arrow_table = cls._read_table(
    642     path_or_frag,
    643     fs,
    644     row_group,
    645     columns,
    646     schema,
    647     filters,
    648     partitions,
    649     partition_keys,
    650     **kwargs,
    651 )
    652 if multi_read:
    653     tables.append(arrow_table)

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1774, in ArrowDatasetEngine._read_table(cls, path_or_frag, fs, row_groups, columns, schema, filters, partitions, partition_keys, **kwargs)
   1767     arrow_table = frag.to_table(
   1768         use_threads=False,
   1769         schema=schema,
   1770         columns=cols,
   1771         filter=_filters_to_expression(filters) if filters else None,
   1772     )
   1773 else:
-> 1774     arrow_table = _read_table_from_path(
   1775         path_or_frag,
   1776         fs,
   1777         row_groups,
   1778         columns,
   1779         schema,
   1780         filters,
   1781         **kwargs,
   1782     )
   1784 # For pyarrow.dataset api, if we did not read directly from
   1785 # fragments, we need to add the partitioned columns here.
   1786 if partitions and isinstance(partitions, list):

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:271, in _read_table_from_path(path, fs, row_groups, columns, schema, filters, **kwargs)
    264     return pq.ParquetFile(fil, **pre_buffer).read(
    265         columns=columns,
    266         use_threads=False,
    267         use_pandas_metadata=True,
    268         **read_kwargs,
    269     )
    270 else:
--> 271     return pq.ParquetFile(fil, **pre_buffer).read_row_groups(
    272         row_groups,
    273         columns=columns,
    274         use_threads=False,
    275         use_pandas_metadata=True,
    276         **read_kwargs,
    277     )

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/parquet/core.py:537, in ParquetFile.read_row_groups(self, row_groups, columns, use_threads, use_pandas_metadata)
    495 """
    496 Read a multiple row groups from a Parquet file.
    497 
   (...)
    533 animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]]
    534 """
    535 column_indices = self._get_column_indices(
    536     columns, use_pandas_metadata=use_pandas_metadata)
--> 537 return self.reader.read_row_groups(row_groups,
    538                                    column_indices=column_indices,
    539                                    use_threads=use_threads)

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/_parquet.pyx:1418, in pyarrow._parquet.ParquetReader.read_row_groups()

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

OSError: RLE encoding only supports BOOLEAN

maximlt · 2024-07-02T14:13:18Z

OK thanks for the report. It looks like the file cannot be read with pyarrow. We'll have to read it with fastparquet (for that dask-expr will have to be disabled), and save it again using pyarrow.

attractors/anaconda-project.yml

Azaya89 · 2024-07-03T14:49:53Z

OK thanks for the report. It looks like the file cannot be read with pyarrow. We'll have to read it with fastparquet (for that dask-expr will have to be disabled), and save it again using pyarrow.

Can you guide me on how I can do this?

Suggesting Needed to avoid a warning emitted when datashader internally imports dask.dataframe import.

OK. I'll make it clearer.

Azaya89 · 2024-10-28T09:26:01Z

This PR need to be rebased on top of main. Will do that soon...

Azaya89 self-assigned this Jun 26, 2024

Azaya89 requested a review from maximlt June 26, 2024 11:57

maximlt reviewed Jun 30, 2024

View reviewed changes

github-actions added 3 commits July 2, 2024 14:37

updated glaciers lock files

852cde1

updated nyc_taxi lock files

8eaf912

updated lock dependencies

ff14982

added dask as a temp dependency

51079db

Azaya89 force-pushed the update-pins branch from 25fe015 to 51079db Compare July 3, 2024 11:10

maximlt reviewed Jul 3, 2024

View reviewed changes

attractors/anaconda-project.yml Outdated Show resolved Hide resolved

post-review

f94ad64

Azaya89 requested a review from maximlt August 13, 2024 12:35

Azaya89 marked this pull request as draft October 28, 2024 09:25

maximlt added the NF SDG NumFocus Software Development Grant 2024 label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pins on some examples #400

Update pins on some examples #400

Azaya89 commented Jun 26, 2024 •

edited

Loading

maximlt commented Jun 30, 2024

jbednar commented Jul 1, 2024

Azaya89 commented Jul 2, 2024

maximlt commented Jul 2, 2024

Azaya89 commented Jul 3, 2024

Azaya89 commented Oct 28, 2024

Update pins on some examples #400

Are you sure you want to change the base?

Update pins on some examples #400

Conversation

Azaya89 commented Jun 26, 2024 • edited Loading

maximlt commented Jun 30, 2024

jbednar commented Jul 1, 2024

Azaya89 commented Jul 2, 2024

maximlt commented Jul 2, 2024

Azaya89 commented Jul 3, 2024

Azaya89 commented Oct 28, 2024

Azaya89 commented Jun 26, 2024 •

edited

Loading