Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an explicit call out about the data and metadata flow differences between gateway and client #8400

Merged
merged 1 commit into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/img/s3gatewayvsclientdataflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions docs/understand/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,15 @@ Using [lakeFSFileSystem][hadoopfs] increases Spark ETL jobs performance by execu
and all data operations directly through the same underlying object store that lakeFS uses.


## How lakeFS Clients and Gateway Handle Metadata and Data Access


When using the Python client, lakeCTL, or the lakeFS Spark client, these clients communicate with the lakeFS server to retrieve metadata information. For example, they may query lakeFS to understand which version of a file is needed or to track changes in branches and commits. This communication does not include the actual data transfer, but instead involves passing only metadata about data locations and versions.
Once the client knows the exact data location from the lakeFS metadata, it directly accesses the data in the underlying object storage (potentially using presigned URLs) without routing through lakeFS. For instance, if data is stored in S3, the Spark client will retrieve the S3 paths from lakeFS, then directly read and write to those paths in S3 without involving lakeFS in the data transfer.

<img src="{{ site.baseurl }}/assets/img/s3gatewayvsclientdataflow.png" alt="lakeFS Clients vs Gateway Data Flow" width="500px"/>


[data-quality-gates]: {% link understand/use_cases/cicd_for_data.md %}#using-hooks-as-data-quality-gates
[dynamodb-permissions]: {% link howto/deploy/aws.md %}#grant-dynamodb-permissions-to-lakefs
[roadmap]: {% link project/index.md %}#roadmap
Expand Down
Loading