Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup audb.available() for S3/Minio #458

Merged
merged 6 commits into from
Nov 18, 2024
Merged

Speedup audb.available() for S3/Minio #458

merged 6 commits into from
Nov 18, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Nov 11, 2024

This speeds up audb.available() for the S3 and MinIO backends by avoiding listing all files recursively, but focusing on the database names and the versions of the database header files.

Execution time for running audb.available() for the repositories audb-public (containing 7 smaller datasets) and audb-internal (containing 1 big dataset).

Branch audb-public audb-internal
speedup-available-s3 1.6 s 0.4 s
main 1.7 s 45.0 s
Benchmark code
import audb
import time


repository = audb.Repository("audb-public", "s3.dualstack.eu-north-1.amazonaws.com", "s3")
audb.config.REPOSITORIES = [repository]
t0 = time.time()
df = audb.available()
t = time.time() - t0
print(f"Execution time: {t:.1f}s")

The pull request also makes sure audb.available() is tested for two S3 repositories as well, besides the Artifactory repositories.

Summary by Sourcery

Enhancements:

  • Optimize the audb.available() function for S3 and MinIO backends by avoiding recursive file listing and focusing on database names and header file versions.

Summary by Sourcery

Enhancements:

  • Optimize the audb.available() function for S3 and MinIO backends by avoiding recursive file listing and focusing on database names and header file versions.

Copy link
Contributor

sourcery-ai bot commented Nov 11, 2024

Reviewer's Guide by Sourcery

The implementation optimizes the audb.available() function for S3 and MinIO backends by introducing a more efficient way to list databases and their versions. Instead of recursively listing all files, it now specifically targets database names and header files, resulting in significant performance improvements, especially for large databases.

Sequence diagram for optimized audb.available() function

sequenceDiagram
    participant User
    participant audb
    participant Repository
    participant Backend

    User->>audb: Call available()
    audb->>Repository: Iterate over REPOSITORIES
    Repository->>Backend: Create backend interface
    Backend-->>Repository: Return interface
    Repository->>audb: List database names and versions
    audb->>User: Return available databases
Loading

Class diagram for audb.available() optimization

classDiagram
    class audb {
        +available()
    }
    class Repository {
        +create_backend_interface()
        +name
        +backend
        +host
    }
    class Backend {
        +list_objects()
        +exists()
    }
    audb --> Repository
    Repository --> Backend
    note for audb "Optimized available() to avoid recursive listing for S3/MinIO"
Loading

File-Level Changes

Change Details Files
Introduced a helper function to reduce code duplication when adding databases to the result list
  • Extracted common database addition logic into a new add_database() function
  • Updated all database addition calls to use the new helper function
audb/core/api.py
Implemented optimized listing strategy for S3 and MinIO backends
  • Added specific handling for 'minio' and 's3' backend types
  • Implemented two-level listing approach: first list database names, then versions
  • Added check for header file existence to validate database versions
  • Filtered out special folders like 'attachment', 'media', and 'meta'
audb/core/api.py
Updated test configuration to include S3 repositories
  • Added S3 public and private repositories to the test configuration
  • Updated repository documentation to reflect new S3 endpoints
  • Separated host definitions for better maintainability
tests/conftest.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hagenw hagenw marked this pull request as draft November 11, 2024 11:03
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hagenw - I've reviewed your changes - here's some feedback:

Overall Comments:

  • The code changes look well-structured, but the benchmarks show this is actually slightly slower (1.9s vs 1.7s). Please investigate why this 'optimization' is resulting in worse performance - perhaps by profiling the code or adding logging to compare the number of backend API calls being made in both approaches.
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

audb/core/api.py Outdated Show resolved Hide resolved
@hagenw hagenw marked this pull request as ready for review November 15, 2024 19:38
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hagenw - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider adding proper interface methods to the backend class instead of accessing _client directly. While the current approach works, it would be cleaner to maintain the abstraction layer. This could be addressed in a follow-up PR.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟡 Testing: 1 issue found
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

tests/conftest.py Show resolved Hide resolved
@hagenw
Copy link
Member Author

hagenw commented Nov 15, 2024

Hey @hagenw - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider adding proper interface methods to the backend class instead of accessing _client directly. While the current approach works, it would be cleaner to maintain the abstraction layer. This could be addressed in a follow-up PR.

My original plan was also to add this feature directly to audbackend, by adding an argument recursive to backend.ls(). But the problem is, that audbackend does also support backends without any folder structure, for which recursive is meaningless. We could still discuss if it would make sense to add it to audbackend, but for now I would stay with the workaround of implementing it directly to audb. I created audeering/audbackend#252 to track this.

@hagenw hagenw requested a review from ChristianGeng November 15, 2024 19:52
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

For S3/MinIO the recursive listing of root directory can be avoided.
This feature is implemented here.

The claim about the new implementation being slower than the head as raised by the bot is not correct afaics.

The comment of the branching conditional for minio/s3 is self-explanatory and detailed enough to understand the purpose of the code added.

I cannot run the benchmarks, so I am stopping by approval.

@hagenw hagenw merged commit bc4bfcb into main Nov 18, 2024
8 checks passed
@hagenw hagenw deleted the speedup-available-s3 branch November 18, 2024 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants