Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving parquet to AWS S3 with df.write_parquet() fails with FileNotFound #19930

Open
2 tasks done
atzannes opened this issue Nov 22, 2024 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@atzannes
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

>>> import polars as pl
>>> df = pl.DataFrame({"foo": [1, 2, 3]})
>>> df.write_parquet("s3://nomadresearch-research-storage/alex/empty.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/atzannes/code/ice/ice-ingest/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3853, in write_parquet
    self._df.write_parquet(
FileNotFoundError: No such file or directory (os error 2)
>>> 

Log output

No response

Issue description

When passing a path-like object to write_parquet, I am getting a FileNotFoundError. This is almost the same as #14630, but here I'm trying to write to an s3 URI.

The same workaround works for S3 as well:
df.write_parquet("s3://nomadresearch-research-storage/alex/empty.parquet", use_pyarrow=True).

Until this feature is supported natively in Rust, could we have a sentence in the documentation of write_parquet that when we pass a path-like file argument, we should use use_pyarrow=True for certain storage backends?

Expected behavior

I would expect the write_parquet to create or overwrite the file at the given path. I recently changed my code because I was getting warnings that write_parquet didn't like it that I was passing it file-handles and that I should be passing path-like values, but this has broken my code.

Installed versions

--------Version info---------
Polars:              1.14.0
Index type:          UInt32
Platform:            Linux-6.8.0-48-generic-x86_64-with-glibc2.39
Python:              3.10.8 (main, Feb  6 2024, 14:18:09) [GCC 11.4.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                1.34.106
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.10.0
gevent               <not installed>
google.auth          2.30.0
great_tables         <not installed>
matplotlib           3.9.0
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.4
pandas               2.2.2
pyarrow              16.1.0
pydantic             1.10.16
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

Note I also tried with fsspec==2024.06.0

@atzannes atzannes added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 22, 2024
@atzannes
Copy link
Author

atzannes commented Nov 22, 2024

Here's a wrinkle to the workaround of using use_pyarrow=True: it seems that if the path has a space in it, pyarrow fails with

File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected a local filesystem path, got a URI: 's3://bucket/file with space.parquet'

conversely, this fixes my problem:

from upath import UPath

with UPath('s3://bucket/file with space.parquet').open("w") as fd:
    df.write_parquet(fd)

So it looks like I will have to avoid passing path-like args to write_parquet for now, unless perhaps it's a local path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant