Store tables as PARQUET files (#419)

* Ensure correct boolean dtype in misc table index * Remove unneeded code * Use pyarrow to read CSV files * Start debugging * Continue debugging * Fix tests * Remove unneeded code * Improve code * Fix test for older pandas versions * Exclude benchmark folder from tests * Test other implementation * Remove support for Python 3.8 * Store tables as PARQUET * Cleanup code + Table.levels * Use dict for CSV dtype mappings * Rename helper function * Simplify code * Add helper function for CSV schema * Fix typo in docstring * Remove levels attribute * Merge stash * Remove levels from doctest output * Convert method to property * Add comment * Simplify code * Simplify code * Add test for md5sum of parquet file * Switch back to snappy compression * Fix linter * Store hash inside parquet file * Fix code coverage * Stay with CSV as default table format * Test pyarrow==15.0.2 * Test pyarrow==14.0.2 * Test pyarrow==13.0 * Test pyarrow==12.0 * Test pyarrow==11.0 * Test pyarrow==10.0 * Test pyarrow==10.0.1 * Require pyarrow>=10.0.1 * Test pandas<2.1.0 * Add explanations for requirements * Add test using minimum pip requirements * Fix alphabetical order of requirements * Enhance test matrix definition * Debug failing test * Test different hash method * Use different hashing approach * Require pandas>=2.2.0 and fix hashes * CI: re-enable all minimal requriements * Hashing algorithm to respect row order * Clean up tests * Fix minimum install of audiofile * Fix docstring of Table.load() * Fix docstring of Database.load() * Ensure correct order in time when storing tables * Simplify comment * Add docstring to _load_pickle() * Fix _save_parquet() docstring * Improve comment in _dataframe_hash() * Document arguments of test_table_update... * Relax test for table saving order * Update audformat/core/table.py Co-authored-by: ChristianGeng <[email protected]> * Revert "Update audformat/core/table.py" This reverts commit 3f21e3c. * Use numpy representation for hashing (#436) * Use numpy representation for hashing * Enable tests and require pandas>=1.4.1 * Use numpy<2.0 in minimum test * Skip doctests in minimum * Require pandas>=2.1.0 * Require numpy<=2.0.0 in minimum test * Remove print statements * Fix numpy<2.0.0 for minimum test * Remove max_rows argument * Simplify code * Use test class * CI: remove pyarrow from branch to start test --------- Co-authored-by: ChristianGeng <[email protected]>
audeering · Jun 19, 2024 · f6c475e · f6c475e
1 parent eddf224
commit f6c475e
Show file tree

Hide file tree

Showing 9 changed files with 906 additions and 132 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -15,12 +15,13 @@ jobs:
         os: [ ubuntu-20.04, windows-latest, macOS-latest ]
         python-version: [ '3.10' ]
         include:
-          - os: ubuntu-latest
-            python-version: '3.8'
           - os: ubuntu-latest
             python-version: '3.9'
           - os: ubuntu-latest
             python-version: '3.11'
+          - os: ubuntu-latest
+            python-version: '3.9'
+            requirements: 'minimum'
 
     steps:
     - uses: actions/checkout@v4
@@ -50,6 +51,16 @@ jobs:
         pip install -r requirements.txt
         pip install -r tests/requirements.txt
 
+    - name: Downgrade to minimum dependencies
+      run: |
+        pip install "audeer==2.0.0"
+        pip install "audiofile==0.4.0"
+        pip install "numpy<2.0.0"
+        pip install "pandas==2.1.0"
+        pip install "pyarrow==10.0.1"
+        pip install "pyyaml==5.4.1"
+      if: matrix.requirements == 'minimum'
+
     - name: Test with pytest
       run: |
         python -m pytest

diff --git a/audformat/core/database.py b/audformat/core/database.py
@@ -979,7 +979,7 @@ def save(
         r"""Save database to disk.
 
         Creates a header ``<root>/<name>.yaml``
-        and for every table a file ``<root>/<name>.<table-id>.[csv,pkl]``.
+        and for every table a file ``<root>/<name>.<table-id>.[csv,parquet,pkl]``.
 
         Existing files will be overwritten.
         If ``update_other_formats`` is provided,
@@ -1383,7 +1383,7 @@ def load(
         r"""Load database from disk.
 
         Expects a header ``<root>/<name>.yaml``
-        and for every table a file ``<root>/<name>.<table-id>.[csv|pkl]``
+        and for every table a file ``<root>/<name>.<table-id>.[csv|parquet|pkl]``
         Media files should be located under ``root``.
 
         Args:
@@ -1409,7 +1409,7 @@ def load(
         Raises:
             FileNotFoundError: if the database header file cannot be found
                 under ``root``
-            RuntimeError: if a CSV table file is newer
+            RuntimeError: if a CSV or PARQUET table file is newer
                 than the corresponding PKL file
 
         """

diff --git a/audformat/core/define.py b/audformat/core/define.py
@@ -337,6 +337,9 @@ class TableStorageFormat(DefineBase):
     CSV = "csv"
     """File extension for tables stored in CSV format."""
 
+    PARQUET = "parquet"
+    """File extension for tables stored in PARQUET format."""
+
     PICKLE = "pkl"
     """File extension for tables stored in PKL format."""