From 04287d153b287272e7c1f479f6538451a92fb6b0 Mon Sep 17 00:00:00 2001 From: Oz Katz Date: Tue, 23 Jul 2024 17:06:44 +0300 Subject: [PATCH] Docs: explain mounts on quickstart, working locally (#8001) --- docs/howto/local-checkouts.md | 75 ++++++++++++++++++++--- docs/quickstart/work-with-data-locally.md | 70 +++++++++++++++++++-- 2 files changed, 133 insertions(+), 12 deletions(-) diff --git a/docs/howto/local-checkouts.md b/docs/howto/local-checkouts.md index c1b0e7dfd1a..2266d184204 100644 --- a/docs/howto/local-checkouts.md +++ b/docs/howto/local-checkouts.md @@ -8,11 +8,12 @@ parent: How-To lakeFS is a scalable data version control system designed to scale to billions of objects. The larger the data, the less feasible it becomes to consume it from a single machine. lakeFS addresses this challenge by enabling efficient management -of large-scale data stored remotely. In addition to its capability to manage large datasets, lakeFS offers the flexibility -to perform partial checkouts when necessary for working with specific portions of the data locally. +of large-scale data stored remotely. -This page explains `lakectl local`, a command that lets you clone specific portions of lakeFS' data to your local environment, -and to keep remote and local locations in sync. +In addition to its capability to manage large datasets, lakeFS offers the flexibility +to work with versioned data by exposing it as a local filesystem directory. + +This page explains [lakeFS Mount](../reference/mount.html) and `lakectl local`: two common ways of exposing lakeFS data locally, with different performance characteristics. {% include toc.html %} @@ -39,10 +40,70 @@ storage, resulting in cost savings. -## **lakectl local**: The way to work with lakeFS data locally +## **lakeFS Mount**: Efficiently expose lakeFS Data as a local directory + +⚠️ lakeFS Mount is currently in preview. There is no installation required, please [contact us](https://info.lakefs.io/thanks-lakefs-mounts) to get access. +{: .note } + +#### Prerequisites: + +- A working lakeFS Server running either [lakeFS Enterprise](../enterprise) or [lakeFS Cloud](../cloud) +- You’ve installed the [`lakectl`](../reference/cli.html) command line utility: this is the official lakeFS command line interface, on top of which lakeFS Mount is built. +- lakectl is configured properly to access your lakeFS server as detailed in the configuration instructions + +### Mounting a lakeFS reference as a local directory + +lakeFS Mount works by exposing a virtual mountpoint on the host computer. + +This "acts" as a local directory, allowing applications to read write and interact with data as it is all local to the machine, while lakeFS Mount optimizes this behind the scenes by lazily fetching data as requested, caching accessed objects and efficiently managing metadata to ensure best in class performance. [Read more about how lakeFS Mount optimizes performance](../reference/mount.html) + +Mounting a reference is a single command: + +```bash +everest mount lakefs://example-repo/example-branch/path/to/data/ ./my_local_dir +``` + +Once executed, the `my_local_dir` directory should appear to have the contents of the remote path we provided. We can verify this: + +```bash +ls -l ./my_local_dir/ +``` + +Which should return the listing of the mounted path. + +lakeFS Mount allows quite a bit of tuning to ensure optimal performance. [Read more](../reference/mount.html) about how lakeFS Mount works and how to configure it. +{: .note } + +### Reading from a mount + +Reading from a lakeFS Mount requires no special tools, integrations or SDKs! Simply point your code to the directory and read from it as if it was in fact local: + + +```python +#!/usr/bin/env python +import glob + +for image_path in glob.glob('./my_local_dir/*.png'): + with open(image_path, 'rb') as f: + process(f) + +``` + +### Unmounting + +When done, simply run: + +```bash +everest umount ./my_local_dir +``` + +This will unmount the lakeFS Mount, cleaning up background tasks + + +## **lakectl local**: Sync lakeFS data with a local directory -The _local_ command of lakeFS' CLI _lakectl_ enables working with lakeFS data locally. -It allows cloning lakeFS data into a directory on any machine, syncing local directories with remote lakeFS locations, +The _local_ command of lakeFS' CLI _lakectl_ enables working with lakeFS data locally by copying the data onto the host machine. +It allows syncing local directories with remote lakeFS locations, and to [seamlessly integrate lakeFS with Git](#example-using-lakectl-local-in-tandem-with-git). Here are the available _lakectl local_ commands: diff --git a/docs/quickstart/work-with-data-locally.md b/docs/quickstart/work-with-data-locally.md index a8c9e1f509a..5d7e84006e4 100644 --- a/docs/quickstart/work-with-data-locally.md +++ b/docs/quickstart/work-with-data-locally.md @@ -14,10 +14,66 @@ locally is machine learning model development. Machine learning model developmen process, experiments need to be conducted with speed, tracking ease, and reproducibility. Localizing model data during development accelerates the process by enabling interactive and offline development and reducing data access latency. -We're going to use [lakectl local](../howto/local-checkouts.md) to bring a subset of our lakeFS data to a local directory within the lakeFS -container and edit an image dataset used for ML model development. +lakeFS provides 2 ways to expose versioned data locally -## Cloning a Subset of lakeFS Data into a Local Directory +{% include toc.html %} + +## lakeFS Mount + +lakeFS Mount is available (in preview) for [lakeFS Enterprise](../enterprise/index.html) and [lakeFS Cloud](../cloud/index.html) customers.
+You can try it out by [signing up for the preview](https://info.lakefs.io/thanks-lakefs-mounts) +{: .note } + +### Getting started with lakeFS Mount + +Prerequisites: + +- A working lakeFS Server running either lakeFS Enterprise or lakeFS Cloud +- You’ve installed the lakectl command line utility: this is the official lakeFS command line interface, on top of which lakeFS Mount is built. +- lakectl is configured properly to access your lakeFS server as detailed in the configuration instructions + +### Mounting a path to a local directory: + +1. In lakeFS create a new branch called `my-experiment`. You can do this through the UI or with `lakectl`: + + ```bash + docker exec lakefs \ + lakectl branch create \ + lakefs://quickstart/my-experiment \ + --source lakefs://quickstart/main + ``` + +2. Mount images from your quickstart repository into a local directory named `my_local_dir` +```bash +everest mount lakefs://quickstart/my-experiment/images my_local_dir +``` + +Once complete, `my_local_dir` should be mounted with the specified path. + +3. Verify that `my_local_dir` is linked to the correct path in your lakeFS remote: + +```bash +ls -l my_local_dir +``` + + +4. To unmount the directory, simply run: + +```bash +everest umount ./my_local_dir +``` + +Which will unmount the path and terminate the local mount-server. + + +## lakectl local + +Alternatively, we can use [lakectl local](../howto/local-checkouts.md#sync-a-local-directory-with-lakefs) to bring a subset of our lakeFS data to a local directory within the lakeFS +container and edit an image dataset used for ML model development. Unlike lakeFS Mount, using `lakectl local` requires copying data to/from lakeFS and your local machine. + + +### Cloning a Subset of lakeFS Data into a Local Directory +{: .no_toc } 1. In lakeFS create a new branch called `my-experiment`. You can do this through the UI or with `lakectl`: @@ -57,7 +113,8 @@ container and edit an image dataset used for ML model development. No diff found. ``` -## Making Changes to Data Locally +### Making Changes to Data Locally +{: .no_toc } 1. Download a new image of an Axolotl and add it to the dataset cloned into `my_local_dir`: @@ -92,7 +149,8 @@ container and edit an image dataset used for ML model development. ╚════════╩══════════╩═════════════════════╝ ``` -## Pushing Local Changes to lakeFS +### Pushing Local Changes to lakeFS +{: .no_toc } Once we are done with editing the image dataset in our local environment, we will push our changes to the lakeFS remote so that the improved dataset is shared and versioned. @@ -112,6 +170,7 @@ the improved dataset is shared and versioned. A comparison between a branch that includes local changes to the main branch ## Bonus Challenge +{: .no_toc } And so with that, this quickstart for lakeFS draws to a close. If you're simply having _too much fun_ to stop then here's an exercise for you. @@ -123,6 +182,7 @@ object 2023-03-21 14:45:38 +0000 UTC 916.4 kB lakes.parquet ``` # Finishing Up +{: .no_toc } Once you've finished the quickstart, shut down your local environment with the following command: