Skip to content

Commit

Permalink
Store tables as PARQUET files (#419)
Browse files Browse the repository at this point in the history
* Ensure correct boolean dtype in misc table index

* Remove unneeded code

* Use pyarrow to read CSV files

* Start debugging

* Continue debugging

* Fix tests

* Remove unneeded code

* Improve code

* Fix test for older pandas versions

* Exclude benchmark folder from tests

* Test other implementation

* Remove support for Python 3.8

* Store tables as PARQUET

* Cleanup code + Table.levels

* Use dict for CSV dtype mappings

* Rename helper function

* Simplify code

* Add helper function for CSV schema

* Fix typo in docstring

* Remove levels attribute

* Merge stash

* Remove levels from doctest output

* Convert method to property

* Add comment

* Simplify code

* Simplify code

* Add test for md5sum of parquet file

* Switch back to snappy compression

* Fix linter

* Store hash inside parquet file

* Fix code coverage

* Stay with CSV as default table format

* Test pyarrow==15.0.2

* Test pyarrow==14.0.2

* Test pyarrow==13.0

* Test pyarrow==12.0

* Test pyarrow==11.0

* Test pyarrow==10.0

* Test pyarrow==10.0.1

* Require pyarrow>=10.0.1

* Test pandas<2.1.0

* Add explanations for requirements

* Add test using minimum pip requirements

* Fix alphabetical order of requirements

* Enhance test matrix definition

* Debug failing test

* Test different hash method

* Use different hashing approach

* Require pandas>=2.2.0 and fix hashes

* CI: re-enable all minimal requriements

* Hashing algorithm to respect row order

* Clean up tests

* Fix minimum install of audiofile

* Fix docstring of Table.load()

* Fix docstring of Database.load()

* Ensure correct order in time when storing tables

* Simplify comment

* Add docstring to _load_pickle()

* Fix _save_parquet() docstring

* Improve comment in _dataframe_hash()

* Document arguments of test_table_update...

* Relax test for table saving order

* Update audformat/core/table.py

Co-authored-by: ChristianGeng <[email protected]>

* Revert "Update audformat/core/table.py"

This reverts commit 3f21e3c.

* Use numpy representation for hashing (#436)

* Use numpy representation for hashing

* Enable tests and require pandas>=1.4.1

* Use numpy<2.0 in minimum test

* Skip doctests in minimum

* Require pandas>=2.1.0

* Require numpy<=2.0.0 in minimum test

* Remove print statements

* Fix numpy<2.0.0 for minimum test

* Remove max_rows argument

* Simplify code

* Use test class

* CI: remove pyarrow from branch to start test

---------

Co-authored-by: ChristianGeng <[email protected]>
  • Loading branch information
hagenw and ChristianGeng authored Jun 19, 2024
1 parent eddf224 commit f6c475e
Show file tree
Hide file tree
Showing 9 changed files with 906 additions and 132 deletions.
15 changes: 13 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,13 @@ jobs:
os: [ ubuntu-20.04, windows-latest, macOS-latest ]
python-version: [ '3.10' ]
include:
- os: ubuntu-latest
python-version: '3.8'
- os: ubuntu-latest
python-version: '3.9'
- os: ubuntu-latest
python-version: '3.11'
- os: ubuntu-latest
python-version: '3.9'
requirements: 'minimum'

steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -50,6 +51,16 @@ jobs:
pip install -r requirements.txt
pip install -r tests/requirements.txt
- name: Downgrade to minimum dependencies
run: |
pip install "audeer==2.0.0"
pip install "audiofile==0.4.0"
pip install "numpy<2.0.0"
pip install "pandas==2.1.0"
pip install "pyarrow==10.0.1"
pip install "pyyaml==5.4.1"
if: matrix.requirements == 'minimum'

- name: Test with pytest
run: |
python -m pytest
Expand Down
6 changes: 3 additions & 3 deletions audformat/core/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -979,7 +979,7 @@ def save(
r"""Save database to disk.
Creates a header ``<root>/<name>.yaml``
and for every table a file ``<root>/<name>.<table-id>.[csv,pkl]``.
and for every table a file ``<root>/<name>.<table-id>.[csv,parquet,pkl]``.
Existing files will be overwritten.
If ``update_other_formats`` is provided,
Expand Down Expand Up @@ -1383,7 +1383,7 @@ def load(
r"""Load database from disk.
Expects a header ``<root>/<name>.yaml``
and for every table a file ``<root>/<name>.<table-id>.[csv|pkl]``
and for every table a file ``<root>/<name>.<table-id>.[csv|parquet|pkl]``
Media files should be located under ``root``.
Args:
Expand All @@ -1409,7 +1409,7 @@ def load(
Raises:
FileNotFoundError: if the database header file cannot be found
under ``root``
RuntimeError: if a CSV table file is newer
RuntimeError: if a CSV or PARQUET table file is newer
than the corresponding PKL file
"""
Expand Down
3 changes: 3 additions & 0 deletions audformat/core/define.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,9 @@ class TableStorageFormat(DefineBase):
CSV = "csv"
"""File extension for tables stored in CSV format."""

PARQUET = "parquet"
"""File extension for tables stored in PARQUET format."""

PICKLE = "pkl"
"""File extension for tables stored in PKL format."""

Expand Down
Loading

0 comments on commit f6c475e

Please sign in to comment.