Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: replace data/flights-3m.csv with data/flights-3m.parquet #628

Merged
merged 2 commits into from
Nov 16, 2024

Conversation

dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Nov 14, 2024

Changes

  • Modified scripts/flights.py to handle parquet output with customizable compression
  • To address #627, generated data/flights-3m dataset using:
python scripts/flights.py /path/to/DOT/zip/files \
    -f parquet \
    --parquet-compression zstd \
    --start-date 2001-01-01 \
    --end-date 2001-06-30 \
    -c date,delay,distance,origin,destination \
    -o data/flights-3m \
    -n 3000000

Note: Replace /path/to/DOT/zip/files with the local directory containing the Bureau of Transportation Statistics (BTS) monthly ZIP files from their website. Download prezipped files, one per month.

@dsmedia dsmedia changed the title Replace data/flights-3m.csv with data/flights-3m.parquet fix: replace data/flights-3m.csv with data/flights-3m.parquet Nov 14, 2024
@dangotbanned
Copy link
Member

@dsmedia thanks for getting to this so quickly!

One thing stood out here to me.
I used polars.DataFrame.write_parquet to export the files in #627 (comment)

The size for all .parquet files came in around 12MB.
However, (https://github.com/dsmedia/vega-datasets/blob/b64c1df7ec55ca4c0645af7f3b9bd7008571f963/data/flights-3m.parquet) is 22MB.

I wouldn't have expected such a dramatic difference between the two.

Do you get the same result using pandas.DataFrame.to_parquet?

@domoritz
Copy link
Member

Actually, do you think we should try arrow instead of parquet?

It doesn't support compression in JavaScript yet but can be read with common libraries in the browser. Parquet is good as a storage format but if we want to expose the flights file, we might want to use a format that can be easily read in browsers. Thoughts?

@dangotbanned
Copy link
Member

Actually, do you think we should try arrow instead of parquet?

It doesn't support compression in JavaScript yet but can be read with common libraries in the browser. Parquet is good as a storage format but if we want to expose the flights file, we might want to use a format that can be easily read in browsers. Thoughts?

Related vega/vega#3961

From a python-oriented perspective, either would work but .arrow appeared to produce a larger file size.

IIRC there is one dataset already that is exported as .arrow

@domoritz
Copy link
Member

How large would it be? Is it like avro in #627 (comment)?

@dangotbanned
Copy link
Member

How large would it be? Is it like avro in #627 (comment)?

@domoritz I've updated the spec with a larger font size.

Open the (updated) Chart in the Vega Editor

It already included 3 .arrow versions, smallest was 20.19MB

@domoritz
Copy link
Member

Thanks. Now I think let's use parquet to have another file format people can use to demo loaders.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Nov 15, 2024

Great catch, @dangotbanned. The generation script was including an (unneeded) index in the parquet file by default. Removing the index reduced the parquet file size to 12mb, in line with your expectations. The index is now excluded by default in flights.py.

@dsmedia thanks for getting to this so quickly!

One thing stood out here to me. I used polars.DataFrame.write_parquet to export the files in #627 (comment)

The size for all .parquet files came in around 12MB. However, (https://github.com/dsmedia/vega-datasets/blob/b64c1df7ec55ca4c0645af7f3b9bd7008571f963/data/flights-3m.parquet) is 22MB.

I wouldn't have expected such a dramatic difference between the two.

Do you get the same result using pandas.DataFrame.to_parquet?

@domoritz
Copy link
Member

Hehe, as I predicted #627 (comment).

Thank you for the pull request.

@domoritz domoritz merged commit 12644bf into vega:main Nov 16, 2024
2 checks passed
@dangotbanned dangotbanned linked an issue Nov 16, 2024 that may be closed by this pull request
dangotbanned added a commit to vega/altair that referenced this pull request Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flights-3m on v2.10.0 exceeds configured limit
3 participants