-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pins on some examples #400
base: main
Are you sure you want to change the base?
Conversation
This is a problem I also got in #369. The last comment was:
However, since then, HoloViews no longer sets My suggestions:
|
Note that in the past, pyarrow and fastparquet had very different performance from each other in certain workloads, so ideally you'd at least qualitatively compare the old pinned version with the new version, and make sure that performance has not significantly degraded. |
I have tried each of the suggestions individually and all together but it still didn't work. It still shows the same Traceback error. Here's the full Traceback: Full traceback
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
File <timed exec>:2
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/base.py:348, in DaskMethodsMixin.persist(self, **kwargs)
309 def persist(self, **kwargs):
310 """Persist this dask collection into memory
311
312 This turns a lazy Dask collection into a Dask collection with the same
(...)
346 dask.persist
347 """
--> 348 (result,) = persist(self, traverse=False, **kwargs)
349 return result
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/base.py:998, in persist(traverse, optimize_graph, scheduler, *args, **kwargs)
995 postpersists.append((rebuild, a_keys, state))
997 with shorten_traceback():
--> 998 results = schedule(dsk, keys, **kwargs)
1000 d = dict(zip(keys, results))
1001 results2 = [r({k: d[k] for k in ks}, *s) for r, ks, s in postpersists]
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:97, in ParquetFunctionWrapper.__call__(self, part)
94 if not isinstance(part, list):
95 part = [part]
---> 97 return read_parquet_part(
98 self.fs,
99 self.engine,
100 self.meta,
101 [
102 # Temporary workaround for HLG serialization bug
103 # (see: https://github.com/dask/dask/issues/8581)
104 (p.data["piece"], p.data.get("kwargs", {}))
105 if hasattr(p, "data")
106 else (p["piece"], p.get("kwargs", {}))
107 for p in part
108 ],
109 self.columns,
110 self.index,
111 self.common_kwargs,
112 )
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:645, in read_parquet_part(fs, engine, meta, part, columns, index, kwargs)
642 if len(part) == 1 or part[0][1] or not check_multi_support(engine):
643 # Part kwargs expected
644 func = engine.read_partition
--> 645 dfs = [
646 func(
647 fs,
648 rg,
649 columns.copy(),
650 index,
651 **toolz.merge(kwargs, kw),
652 )
653 for (rg, kw) in part
654 ]
655 df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0]
656 else:
657 # No part specific kwargs, let engine read
658 # list of parts at once
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:646, in <listcomp>(.0)
642 if len(part) == 1 or part[0][1] or not check_multi_support(engine):
643 # Part kwargs expected
644 func = engine.read_partition
645 dfs = [
--> 646 func(
647 fs,
648 rg,
649 columns.copy(),
650 index,
651 **toolz.merge(kwargs, kw),
652 )
653 for (rg, kw) in part
654 ]
655 df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0]
656 else:
657 # No part specific kwargs, let engine read
658 # list of parts at once
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:641, in ArrowDatasetEngine.read_partition(cls, fs, pieces, columns, index, dtype_backend, categories, partitions, filters, schema, **kwargs)
638 row_group = [row_group]
640 # Read in arrow table and convert to pandas
--> 641 arrow_table = cls._read_table(
642 path_or_frag,
643 fs,
644 row_group,
645 columns,
646 schema,
647 filters,
648 partitions,
649 partition_keys,
650 **kwargs,
651 )
652 if multi_read:
653 tables.append(arrow_table)
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1774, in ArrowDatasetEngine._read_table(cls, path_or_frag, fs, row_groups, columns, schema, filters, partitions, partition_keys, **kwargs)
1767 arrow_table = frag.to_table(
1768 use_threads=False,
1769 schema=schema,
1770 columns=cols,
1771 filter=_filters_to_expression(filters) if filters else None,
1772 )
1773 else:
-> 1774 arrow_table = _read_table_from_path(
1775 path_or_frag,
1776 fs,
1777 row_groups,
1778 columns,
1779 schema,
1780 filters,
1781 **kwargs,
1782 )
1784 # For pyarrow.dataset api, if we did not read directly from
1785 # fragments, we need to add the partitioned columns here.
1786 if partitions and isinstance(partitions, list):
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:271, in _read_table_from_path(path, fs, row_groups, columns, schema, filters, **kwargs)
264 return pq.ParquetFile(fil, **pre_buffer).read(
265 columns=columns,
266 use_threads=False,
267 use_pandas_metadata=True,
268 **read_kwargs,
269 )
270 else:
--> 271 return pq.ParquetFile(fil, **pre_buffer).read_row_groups(
272 row_groups,
273 columns=columns,
274 use_threads=False,
275 use_pandas_metadata=True,
276 **read_kwargs,
277 )
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/parquet/core.py:537, in ParquetFile.read_row_groups(self, row_groups, columns, use_threads, use_pandas_metadata)
495 """
496 Read a multiple row groups from a Parquet file.
497
(...)
533 animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]]
534 """
535 column_indices = self._get_column_indices(
536 columns, use_pandas_metadata=use_pandas_metadata)
--> 537 return self.reader.read_row_groups(row_groups,
538 column_indices=column_indices,
539 use_threads=use_threads)
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/_parquet.pyx:1418, in pyarrow._parquet.ParquetReader.read_row_groups()
File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
OSError: RLE encoding only supports BOOLEAN |
OK thanks for the report. It looks like the file cannot be read with pyarrow. We'll have to read it with fastparquet (for that dask-expr will have to be disabled), and save it again using pyarrow. |
Can you guide me on how I can do this?
OK. I'll make it clearer. |
This PR need to be rebased on top of main. Will do that soon... |
This PR updates some of the dependencies in the
nyc_taxi
andglaciers
examples.