Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pins on some examples #400

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Update pins on some examples #400

wants to merge 5 commits into from

Conversation

Azaya89
Copy link
Collaborator

@Azaya89 Azaya89 commented Jun 26, 2024

This PR updates some of the dependencies in the nyc_taxi and glaciers examples.

@Azaya89 Azaya89 self-assigned this Jun 26, 2024
@Azaya89 Azaya89 requested a review from maximlt June 26, 2024 11:57
glaciers/anaconda-project.yml Outdated Show resolved Hide resolved
nyc_taxi/anaconda-project.yml Outdated Show resolved Hide resolved
nyc_taxi/anaconda-project.yml Outdated Show resolved Hide resolved
nyc_taxi/anaconda-project.yml Outdated Show resolved Hide resolved
glaciers/anaconda-project.yml Outdated Show resolved Hide resolved
@maximlt
Copy link
Contributor

maximlt commented Jun 30, 2024

New issue: it appears the .parq file in the nyc_taxi examples is no longer being read correctly by fastparquet.

This is a problem I also got in #369. The last comment was:

Ok so I ended up with keeping pyarrow as the engine but adding this before the imports:

import dask

dask.config.set({"dataframe.convert-string": False})
dask.config.set({"dataframe.query-planning": False})

Since HoloViews does that too in its test suite, meaning that there's isn't yet "official" support for these two features (query planner and pyarrow string): https://github.com/holoviz/holoviews/blob/6b0121d5a3685989fca58a1687961523a5fd575c/holoviews/tests/conftest.py#L61-L62

However, since then, HoloViews no longer sets dask.config.set({"dataframe.query-planning": False}) (it still has dask.config.set({"dataframe.convert-string": False})).

https://github.com/holoviz/holoviews/blob/e5f7aede7a58902677eb995b8fd67c54ae9ae3ab/holoviews/tests/conftest.py#L55-L60

My suggestions:

  • Try with engine='pyarrow' and see whether the notebook runs fine. Don't set any of the dask.config options yet, maybe it works without them.
  • If it doesn't work, start with dask.config.set({"dataframe.convert-string": False}).

@jbednar
Copy link
Contributor

jbednar commented Jul 1, 2024

Note that in the past, pyarrow and fastparquet had very different performance from each other in certain workloads, so ideally you'd at least qualitatively compare the old pinned version with the new version, and make sure that performance has not significantly degraded.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Jul 2, 2024

  • Try with engine='pyarrow' and see whether the notebook runs fine. Don't set any of the dask.config options yet, maybe it works without them.
  • If it doesn't work, start with dask.config.set({"dataframe.convert-string": False}).

I have tried each of the suggestions individually and all together but it still didn't work. It still shows the same Traceback error. Here's the full Traceback:

Full traceback

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File <timed exec>:2

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/base.py:348, in DaskMethodsMixin.persist(self, **kwargs)
    309 def persist(self, **kwargs):
    310     """Persist this dask collection into memory
    311 
    312     This turns a lazy Dask collection into a Dask collection with the same
   (...)
    346     dask.persist
    347     """
--> 348     (result,) = persist(self, traverse=False, **kwargs)
    349     return result

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/base.py:998, in persist(traverse, optimize_graph, scheduler, *args, **kwargs)
    995     postpersists.append((rebuild, a_keys, state))
    997 with shorten_traceback():
--> 998     results = schedule(dsk, keys, **kwargs)
   1000 d = dict(zip(keys, results))
   1001 results2 = [r({k: d[k] for k in ks}, *s) for r, ks, s in postpersists]

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:97, in ParquetFunctionWrapper.__call__(self, part)
     94 if not isinstance(part, list):
     95     part = [part]
---> 97 return read_parquet_part(
     98     self.fs,
     99     self.engine,
    100     self.meta,
    101     [
    102         # Temporary workaround for HLG serialization bug
    103         # (see: https://github.com/dask/dask/issues/8581)
    104         (p.data["piece"], p.data.get("kwargs", {}))
    105         if hasattr(p, "data")
    106         else (p["piece"], p.get("kwargs", {}))
    107         for p in part
    108     ],
    109     self.columns,
    110     self.index,
    111     self.common_kwargs,
    112 )

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:645, in read_parquet_part(fs, engine, meta, part, columns, index, kwargs)
    642 if len(part) == 1 or part[0][1] or not check_multi_support(engine):
    643     # Part kwargs expected
    644     func = engine.read_partition
--> 645     dfs = [
    646         func(
    647             fs,
    648             rg,
    649             columns.copy(),
    650             index,
    651             **toolz.merge(kwargs, kw),
    652         )
    653         for (rg, kw) in part
    654     ]
    655     df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0]
    656 else:
    657     # No part specific kwargs, let engine read
    658     # list of parts at once

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:646, in <listcomp>(.0)
    642 if len(part) == 1 or part[0][1] or not check_multi_support(engine):
    643     # Part kwargs expected
    644     func = engine.read_partition
    645     dfs = [
--> 646         func(
    647             fs,
    648             rg,
    649             columns.copy(),
    650             index,
    651             **toolz.merge(kwargs, kw),
    652         )
    653         for (rg, kw) in part
    654     ]
    655     df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0]
    656 else:
    657     # No part specific kwargs, let engine read
    658     # list of parts at once

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:641, in ArrowDatasetEngine.read_partition(cls, fs, pieces, columns, index, dtype_backend, categories, partitions, filters, schema, **kwargs)
    638     row_group = [row_group]
    640 # Read in arrow table and convert to pandas
--> 641 arrow_table = cls._read_table(
    642     path_or_frag,
    643     fs,
    644     row_group,
    645     columns,
    646     schema,
    647     filters,
    648     partitions,
    649     partition_keys,
    650     **kwargs,
    651 )
    652 if multi_read:
    653     tables.append(arrow_table)

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1774, in ArrowDatasetEngine._read_table(cls, path_or_frag, fs, row_groups, columns, schema, filters, partitions, partition_keys, **kwargs)
   1767     arrow_table = frag.to_table(
   1768         use_threads=False,
   1769         schema=schema,
   1770         columns=cols,
   1771         filter=_filters_to_expression(filters) if filters else None,
   1772     )
   1773 else:
-> 1774     arrow_table = _read_table_from_path(
   1775         path_or_frag,
   1776         fs,
   1777         row_groups,
   1778         columns,
   1779         schema,
   1780         filters,
   1781         **kwargs,
   1782     )
   1784 # For pyarrow.dataset api, if we did not read directly from
   1785 # fragments, we need to add the partitioned columns here.
   1786 if partitions and isinstance(partitions, list):

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:271, in _read_table_from_path(path, fs, row_groups, columns, schema, filters, **kwargs)
    264     return pq.ParquetFile(fil, **pre_buffer).read(
    265         columns=columns,
    266         use_threads=False,
    267         use_pandas_metadata=True,
    268         **read_kwargs,
    269     )
    270 else:
--> 271     return pq.ParquetFile(fil, **pre_buffer).read_row_groups(
    272         row_groups,
    273         columns=columns,
    274         use_threads=False,
    275         use_pandas_metadata=True,
    276         **read_kwargs,
    277     )

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/parquet/core.py:537, in ParquetFile.read_row_groups(self, row_groups, columns, use_threads, use_pandas_metadata)
    495 """
    496 Read a multiple row groups from a Parquet file.
    497 
   (...)
    533 animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]]
    534 """
    535 column_indices = self._get_column_indices(
    536     columns, use_pandas_metadata=use_pandas_metadata)
--> 537 return self.reader.read_row_groups(row_groups,
    538                                    column_indices=column_indices,
    539                                    use_threads=use_threads)

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/_parquet.pyx:1418, in pyarrow._parquet.ParquetReader.read_row_groups()

File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

OSError: RLE encoding only supports BOOLEAN

@maximlt
Copy link
Contributor

maximlt commented Jul 2, 2024

OK thanks for the report. It looks like the file cannot be read with pyarrow. We'll have to read it with fastparquet (for that dask-expr will have to be disabled), and save it again using pyarrow.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Jul 3, 2024

OK thanks for the report. It looks like the file cannot be read with pyarrow. We'll have to read it with fastparquet (for that dask-expr will have to be disabled), and save it again using pyarrow.

Can you guide me on how I can do this?

Suggesting Needed to avoid a warning emitted when datashader internally imports dask.dataframe import.

OK. I'll make it clearer.

@Azaya89 Azaya89 requested a review from maximlt August 13, 2024 12:35
@Azaya89 Azaya89 marked this pull request as draft October 28, 2024 09:25
@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 28, 2024

This PR need to be rebased on top of main. Will do that soon...

@maximlt maximlt added the NF SDG NumFocus Software Development Grant 2024 label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NF SDG NumFocus Software Development Grant 2024
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants