Skip to content

Commit

Permalink
Docs: explain mounts on quickstart, working locally (#8001)
Browse files Browse the repository at this point in the history
  • Loading branch information
ozkatz authored Jul 23, 2024
1 parent b4f070c commit 04287d1
Show file tree
Hide file tree
Showing 2 changed files with 133 additions and 12 deletions.
75 changes: 68 additions & 7 deletions docs/howto/local-checkouts.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,12 @@ parent: How-To

lakeFS is a scalable data version control system designed to scale to billions of objects. The larger the data, the less
feasible it becomes to consume it from a single machine. lakeFS addresses this challenge by enabling efficient management
of large-scale data stored remotely. In addition to its capability to manage large datasets, lakeFS offers the flexibility
to perform partial checkouts when necessary for working with specific portions of the data locally.
of large-scale data stored remotely.

This page explains `lakectl local`, a command that lets you clone specific portions of lakeFS' data to your local environment,
and to keep remote and local locations in sync.
In addition to its capability to manage large datasets, lakeFS offers the flexibility
to work with versioned data by exposing it as a local filesystem directory.

This page explains [lakeFS Mount](../reference/mount.html) and `lakectl local`: two common ways of exposing lakeFS data locally, with different performance characteristics.

{% include toc.html %}

Expand All @@ -39,10 +40,70 @@ storage, resulting in cost savings.

<iframe width="420" height="315" src="https://www.youtube.com/embed/afgQnmesLZM"></iframe>

## **lakectl local**: The way to work with lakeFS data locally
## **lakeFS Mount**: Efficiently expose lakeFS Data as a local directory

⚠️ lakeFS Mount is currently in preview. There is no installation required, please [contact us](https://info.lakefs.io/thanks-lakefs-mounts) to get access.
{: .note }

#### Prerequisites:

- A working lakeFS Server running either [lakeFS Enterprise](../enterprise) or [lakeFS Cloud](../cloud)
- You’ve installed the [`lakectl`](../reference/cli.html) command line utility: this is the official lakeFS command line interface, on top of which lakeFS Mount is built.
- lakectl is configured properly to access your lakeFS server as detailed in the configuration instructions

### Mounting a lakeFS reference as a local directory

lakeFS Mount works by exposing a virtual mountpoint on the host computer.

This "acts" as a local directory, allowing applications to read write and interact with data as it is all local to the machine, while lakeFS Mount optimizes this behind the scenes by lazily fetching data as requested, caching accessed objects and efficiently managing metadata to ensure best in class performance. [Read more about how lakeFS Mount optimizes performance](../reference/mount.html)

Mounting a reference is a single command:

```bash
everest mount lakefs://example-repo/example-branch/path/to/data/ ./my_local_dir
```

Once executed, the `my_local_dir` directory should appear to have the contents of the remote path we provided. We can verify this:

```bash
ls -l ./my_local_dir/
```

Which should return the listing of the mounted path.

lakeFS Mount allows quite a bit of tuning to ensure optimal performance. [Read more](../reference/mount.html) about how lakeFS Mount works and how to configure it.
{: .note }

### Reading from a mount

Reading from a lakeFS Mount requires no special tools, integrations or SDKs! Simply point your code to the directory and read from it as if it was in fact local:


```python
#!/usr/bin/env python
import glob

for image_path in glob.glob('./my_local_dir/*.png'):
with open(image_path, 'rb') as f:
process(f)

```

### Unmounting

When done, simply run:

```bash
everest umount ./my_local_dir
```

This will unmount the lakeFS Mount, cleaning up background tasks


## **lakectl local**: Sync lakeFS data with a local directory

The _local_ command of lakeFS' CLI _lakectl_ enables working with lakeFS data locally.
It allows cloning lakeFS data into a directory on any machine, syncing local directories with remote lakeFS locations,
The _local_ command of lakeFS' CLI _lakectl_ enables working with lakeFS data locally by copying the data onto the host machine.
It allows syncing local directories with remote lakeFS locations,
and to [seamlessly integrate lakeFS with Git](#example-using-lakectl-local-in-tandem-with-git).

Here are the available _lakectl local_ commands:
Expand Down
70 changes: 65 additions & 5 deletions docs/quickstart/work-with-data-locally.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,66 @@ locally is machine learning model development. Machine learning model developmen
process, experiments need to be conducted with speed, tracking ease, and reproducibility. Localizing model data during development
accelerates the process by enabling interactive and offline development and reducing data access latency.

We're going to use [lakectl local](../howto/local-checkouts.md) to bring a subset of our lakeFS data to a local directory within the lakeFS
container and edit an image dataset used for ML model development.
lakeFS provides 2 ways to expose versioned data locally

## Cloning a Subset of lakeFS Data into a Local Directory
{% include toc.html %}

## lakeFS Mount

lakeFS Mount is available (in preview) for [lakeFS Enterprise](../enterprise/index.html) and [lakeFS Cloud](../cloud/index.html) customers. <br/>
You can try it out by [signing up for the preview](https://info.lakefs.io/thanks-lakefs-mounts)
{: .note }

### Getting started with lakeFS Mount

Prerequisites:

- A working lakeFS Server running either lakeFS Enterprise or lakeFS Cloud
- You’ve installed the lakectl command line utility: this is the official lakeFS command line interface, on top of which lakeFS Mount is built.
- lakectl is configured properly to access your lakeFS server as detailed in the configuration instructions

### Mounting a path to a local directory:

1. In lakeFS create a new branch called `my-experiment`. You can do this through the UI or with `lakectl`:

```bash
docker exec lakefs \
lakectl branch create \
lakefs://quickstart/my-experiment \
--source lakefs://quickstart/main
```

2. Mount images from your quickstart repository into a local directory named `my_local_dir`
```bash
everest mount lakefs://quickstart/my-experiment/images my_local_dir
```

Once complete, `my_local_dir` should be mounted with the specified path.

3. Verify that `my_local_dir` is linked to the correct path in your lakeFS remote:

```bash
ls -l my_local_dir
```


4. To unmount the directory, simply run:

```bash
everest umount ./my_local_dir
```

Which will unmount the path and terminate the local mount-server.


## lakectl local

Alternatively, we can use [lakectl local](../howto/local-checkouts.md#sync-a-local-directory-with-lakefs) to bring a subset of our lakeFS data to a local directory within the lakeFS
container and edit an image dataset used for ML model development. Unlike lakeFS Mount, using `lakectl local` requires copying data to/from lakeFS and your local machine.


### Cloning a Subset of lakeFS Data into a Local Directory
{: .no_toc }

1. In lakeFS create a new branch called `my-experiment`. You can do this through the UI or with `lakectl`:

Expand Down Expand Up @@ -57,7 +113,8 @@ container and edit an image dataset used for ML model development.
No diff found.
```

## Making Changes to Data Locally
### Making Changes to Data Locally
{: .no_toc }

1. Download a new image of an Axolotl and add it to the dataset cloned into `my_local_dir`:

Expand Down Expand Up @@ -92,7 +149,8 @@ container and edit an image dataset used for ML model development.
╚════════╩══════════╩═════════════════════╝
```

## Pushing Local Changes to lakeFS
### Pushing Local Changes to lakeFS
{: .no_toc }

Once we are done with editing the image dataset in our local environment, we will push our changes to the lakeFS remote so that
the improved dataset is shared and versioned.
Expand All @@ -112,6 +170,7 @@ the improved dataset is shared and versioned.
<img width="75%" src="{{ site.baseurl }}/assets/img/quickstart/lakectl-local-02.png" alt="A comparison between a branch that includes local changes to the main branch" class="quickstart"/>

## Bonus Challenge
{: .no_toc }

And so with that, this quickstart for lakeFS draws to a close. If you're simply having _too much fun_ to stop then here's an exercise for you.

Expand All @@ -123,6 +182,7 @@ object 2023-03-21 14:45:38 +0000 UTC 916.4 kB lakes.parquet
```
# Finishing Up
{: .no_toc }
Once you've finished the quickstart, shut down your local environment with the following command:
Expand Down

0 comments on commit 04287d1

Please sign in to comment.