Skip to content

Commit

Permalink
Docs: Add KV documentation (#3878)
Browse files Browse the repository at this point in the history
Co-authored-by: Barak Amar <[email protected]>
Co-authored-by: itaiad200 <[email protected]>
Co-authored-by: itai-david <[email protected]>
Co-authored-by: eden-ohana <[email protected]>
  • Loading branch information
5 people authored Sep 4, 2022
1 parent 4f13738 commit 249c3e2
Show file tree
Hide file tree
Showing 8 changed files with 507 additions and 64 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Changelog

## v0.80.0 - 2022-08-31
This release requires running database migration.
The lakeFS service will not run if the migration version isn't compatible with the binary.
Before running the new version you will be required to run migrate, with the new version.
Please refer to this [upgrade documentation](https://docs.lakefs.io/reference/upgrade.html##lakefs-0800-or-greater-kv-migration) for more information on the specific migration to KV

This is the first lakeFS version over Key-Value Store
lakeFS is decoupling from PostgreSQL and moving to a KV Store interface.
This will provide greater flexibility and allow production groups working with lakeFS to select their backing DB of choice.
Check our updated [Deploy lakeFS](https://docs.lakefs.io/deploy/#deploy-lakefs) page, for deployment instructions.
Also make sure to check our [Sizing Guide](https://docs.lakefs.io/understand/sizing-guide.html#lakefs-kv-store) for best practices, requirements and benchmarks

## v0.70.6 - 2022-08-30
- UI: fix focus on branch lookup while creating tag (#4005)

Expand Down
60 changes: 52 additions & 8 deletions docs/deploy/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,21 @@ redirect_from:
---

# Deploy lakeFS on AWS

{: .no_toc }
Expected deployment time: 25 min

{% include toc.html %}

{% include_relative includes/prerequisites.md %}

## Creating the Database on AWS RDS
lakeFS requires a PostgreSQL database to synchronize actions on your repositories.
## Preparing the Database for the Key Value Store

lakeFS uses a key-value store to synchronize actions on your repositories. Out of the box, this key-value store can rely on DynamoDB or a PostgreSQL DB. As lakeFS open source, you can also write your own implementation and use any other DB
The following two sections explain how to setup either PostgreSQL or DynamoDB as key-value backing DB

### Creating PostgreSQL Database on AWS RDS

We will show you how to create a database on AWS RDS but you can use any PostgreSQL database as long as it's accessible by your lakeFS installation.

If you already have a database, take note of the connection string and skip to the [next step](#installation-options)
Expand All @@ -33,19 +39,37 @@ If you already have a database, take note of the connection string and skip to t

3. Make sure your security group rules allow you to connect to the database instance.

### DynamoDB on AWS

DynamoDB on AWS does not require any specific preparation, other than properly configuring lakeFS to use it and valid AWS credentials. Please refer to `database.dynamodb` section in the [configuration reference](../reference/configuration.md#reference) for complete configuration options.
AWS credentials, for DynamoDB, can also be provided via environment variables, as described in the [configuration reference](../reference/configuration.md#using-environment-variables)
Please refer to [AWS documentation](https://aws.amazon.com/dynamodb/getting-started/) for further information on DynamoDB

## Installation Options

### On EC2
1. Save the following configuration file as `config.yaml`:

1. Edit and save the following configuration file as `config.yaml`:

```yaml
---
database:
connection_string: "[DATABASE_CONNECTION_STRING]"
type: "postgres" OR "dynamodb"

# when using dynamodb
dynamodb:
table_name: "[DYNAMODB_TABLE_NAME]"
aws_region: "[DYNAMODB_REGION]"

# when using postgres
postgres:
connection_string: "[DATABASE_CONNECTION_STRING]"

auth:
encrypt:
# replace this with a randomly-generated string:
secret_key: "[ENCRYPTION_SECRET_KEY]"

blockstore:
type: s3
s3:
Expand All @@ -54,20 +78,37 @@ If you already have a database, take note of the connection string and skip to t
1. [Download the binary](../index.md#downloads) to the EC2 instance.
1. Run the `lakefs` binary on the EC2 instance:
```bash
```sh
lakefs --config config.yaml run
```
**Note:** It's preferable to run the binary as a service using systemd or your operating system's facilities.

### On ECS
To support container-based environments like AWS ECS, lakeFS can be configured using environment variables. Here is a `docker run`
command to demonstrate starting lakeFS using Docker:

To support container-based environments like AWS ECS, lakeFS can be configured using environment variables. Here are a couple of `docker run`
commands to demonstrate starting lakeFS using Docker:

#### With PostgreSQL

```sh
docker run \
--name lakefs \
-p 8000:8000 \
-e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
-e LAKEFS_DATABASE_TYPE="postgres" \
-e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
-e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
-e LAKEFS_BLOCKSTORE_TYPE="s3" \
treeverse/lakefs:latest run
```

#### With DynamoDB

```sh
docker run \
--name lakefs \
-p 8000:8000 \
-e LAKEFS_DATABASE_TYPE: "dynamodb" \
-e LAKEFS_DATABASE_DYNAMODB_TABLE_NAME="[DYNAMODB_TABLE_NAME]" \
-e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
-e LAKEFS_BLOCKSTORE_TYPE="s3" \
treeverse/lakefs:latest run
Expand All @@ -76,9 +117,11 @@ docker run \
See the [reference](../reference/configuration.md#using-environment-variables) for a complete list of environment variables.

### On EKS

See [Kubernetes Deployment](./k8s.md).

## Load balancing

Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server.
By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint which you can use for health checks.

Expand All @@ -91,4 +134,5 @@ By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint whic
1. Configure the health-check to use the exposed `/_health` URL

## Next Steps

Your next step is to [prepare your storage](../setup/storage/index.md). If you already have a storage bucket/container, you're ready to [create your first lakeFS repository](../setup/create-repo.md).
7 changes: 5 additions & 2 deletions docs/deploy/azure.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ If you already have a database, take note of the connection string and skip to t
```yaml
---
database:
connection_string: "[DATABASE_CONNECTION_STRING]"
type: "postgres"
postgres:
connection_string: "[DATABASE_CONNECTION_STRING]"
auth:
encrypt:
# replace this with a randomly-generated string:
Expand Down Expand Up @@ -64,7 +66,8 @@ command to demonstrate starting lakeFS using Docker:
docker run \
--name lakefs \
-p 8000:8000 \
-e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
-e LAKEFS_DATABASE_TYPE="postgres" \
-e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
-e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
-e LAKEFS_BLOCKSTORE_TYPE="azure" \
-e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT="[YOUR_STORAGE_ACCOUNT]" \
Expand Down
7 changes: 5 additions & 2 deletions docs/deploy/gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@ For example, if you install lakeFS on GKE, you need to deploy the SQL Auth Proxy
```yaml
---
database:
connection_string: "[DATABASE_CONNECTION_STRING]"
type: "postgres"
postgres:
connection_string: "[DATABASE_CONNECTION_STRING]"
auth:
encrypt:
# replace this with a randomly-generated string:
Expand All @@ -64,7 +66,8 @@ command to demonstrate starting lakeFS using Docker:
docker run \
--name lakefs \
-p 8000:8000 \
-e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
-e LAKEFS_DATABASE_TYPE="postgres" \
-e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
-e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
-e LAKEFS_BLOCKSTORE_TYPE="gs" \
treeverse/lakefs:latest run
Expand Down
62 changes: 50 additions & 12 deletions docs/reference/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,43 @@ This reference uses `.` to denote the nesting of values.

* `logging.format` `(one of ["json", "text"] : "text")` - Format to output log message in
* `logging.level` `(one of ["TRACE", "DEBUG", "INFO", "WARN", "ERROR", "NONE"] : "DEBUG")` - Logging level to output
* `logging.audit_log_level` `(one of ["TRACE", "DEBUG", "INFO", "WARN", "ERROR", "NONE"] : "DEBUG")` - Audit logs level to output. **Please notice that in case you configure this field to be lower than the main logger level, you won't be able to get the audit logs**
* `logging.audit_log_level` `(one of ["TRACE", "DEBUG", "INFO", "WARN", "ERROR", "NONE"] : "DEBUG")` - Audit logs level to output.

**Note:** In case you configure this field to be lower than the main logger level, you won't be able to get the audit logs
{: .note }
* `logging.output` `(string : "-")` - A path or paths to write logs to. A `-` means the standard output, `=` means the standard error.
* `logging.file_max_size_mb` `(int : 100)` - Output file maximum size in megabytes.
* `logging.files_keep` `(int : 0)` - Number of log files to keep, default is all.
* `actions.enabled` `(bool : true)` - Setting this to false will block hooks from being executed
* `database.connection_string` `(string : "postgres://localhost:5432/postgres?sslmode=disable")` - PostgreSQL connection string to use
* `database.max_open_connections` `(int : 25)` - Maximum number of open connections to the database
* `database.max_idle_connections` `(int : 25)` - Sets the maximum number of connections in the idle connection pool
* `database.connection_max_lifetime` `(duration : 5m)` - Sets the maximum amount of time a connection may be reused
* ~~`database.connection_string` `(string : "postgres://localhost:5432/postgres?sslmode=disable")` - PostgreSQL connection string to use~~
* ~~`database.max_open_connections` `(int : 25)` - Maximum number of open connections to the database~~
* ~~`database.max_idle_connections` `(int : 25)` - Sets the maximum number of connections in the idle connection pool~~
* ~~`database.connection_max_lifetime` `(duration : 5m)` - Sets the maximum amount of time a connection may be reused~~

**Note:** Deprecated - See `database` section
{: .note }
* `database` - Configuration section for the lakeFS key-value store database
+ `database.type` `(string : ["postgres"|"dynamodb"])` - lakeFS database type
+ `database.postgres` - Configuration section when using `database.type="postgres"`
+ `database.postgres.connection_string` `(string : "postgres://localhost:5432/postgres?sslmode=disable")` - PostgreSQL connection string to use
+ `database.postgres.max_open_connections` `(int : 25)` - Maximum number of open connections to the database
+ `database.postgres.max_idle_connections` `(int : 25)` - Maximum number of connections in the idle connection pool
+ `database.postgres.connection_max_lifetime` `(duration : 5m)` - Sets the maximum amount of time a connection may be reused `(valid units: ns|us|ms|s|m|h)`
+ `database.dynamodb` - Configuration section when using `database.type="dynamodb"`
+ `database.dynamodb.table_name` `(string : "kvstore")` - Table used to store the data
+ `database.dynamodb.read_capacity_units` `(int : 1000)` - Read capacity units, measured in requests per second
+ `database.dynamodb.write_capacity_units` `(int : 1000)` - Write capacity units, measured in requests per second
+ `database.dynamodb.scan_limit` `(int : )` - Maximal number of items per page during scan operation

**Note:** Refer to the following [AWS documentation](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Limit) for further information
{: .note }
+ `database.dynamodb.endpoint` `(string : )` - Endpoint URL for database instance
+ `database.dynamodb.aws_region` `(string : )` - AWS Region of database instance
+ `database.dynamodb.aws_access_key_id` `(string : )` - AWS access key ID
+ `database.dynamodb.aws_secret_access_key` `(string : )` - AWS secret access key

**Note:** `endpoint` `aw_region` `aws_access_key_id` `aws_secret_access_key` are not required and used mainly for experimental purposes when working with DynamoDB with different AWS credentials.
{: .note }
* `listen_address` `(string : "0.0.0.0:8000")` - A `<host>:<port>` structured string representing the address to listen on
* `auth.cache.enabled` `(bool : true)` - Whether to cache access credentials and user policies in-memory. Can greatly improve throughput when enabled.
* `auth.cache.size` `(int : 1024)` - How many items to store in the auth cache. Systems with a very high user count should use a larger value at the expense of ~1kb of memory per cached user.
Expand Down Expand Up @@ -146,14 +174,16 @@ To set an environment variable, prepend `LAKEFS_` to its name, convert it to upp
For example, `logging.format` becomes `LAKEFS_LOGGING_FORMAT`, `blockstore.s3.region` becomes `LAKEFS_BLOCKSTORE_S3_REGION`, etc.


## Example: Local Development
## Example: Local Development with PostgreSQL database

```yaml
---
listen_address: "0.0.0.0:8000"

database:
connection_string: "postgres://localhost:5432/postgres?sslmode=disable"
type: "postgres"
postgres:
connection_string: "postgres://localhost:5432/postgres?sslmode=disable"

logging:
format: text
Expand All @@ -175,7 +205,7 @@ gateways:
```
## Example: AWS Deployment
## Example: AWS Deployment with DynamoDB database
```yaml
---
Expand All @@ -185,7 +215,9 @@ logging:
output: "-"

database:
connection_string: "postgres://user:[email protected]:5432/postgres"
type: "dynamodb"
dynamodb:
table_name: "kvstore"

auth:
encrypt:
Expand Down Expand Up @@ -213,7 +245,9 @@ logging:
output: "-"

database:
connection_string: "postgres://user:[email protected]:5432/postgres"
type: "postgres"
postgres:
connection_string: "postgres://user:[email protected]:5432/postgres"

auth:
encrypt:
Expand All @@ -236,7 +270,9 @@ logging:
output: "-"

database:
connection_string: "postgres://user:[email protected]:5432/postgres"
type: "postgres"
postgres:
connection_string: "postgres://user:[email protected]:5432/postgres"

auth:
encrypt:
Expand All @@ -263,7 +299,9 @@ logging:
output: "-"

database:
connection_string: "postgres://user:[email protected]:5432/postgres"
type: "postgres"
postgres:
connection_string: "postgres://user:[email protected]:5432/postgres"

auth:
encrypt:
Expand Down
30 changes: 30 additions & 0 deletions docs/reference/database-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
layout: default
title: Database Migration
description: A guide to migrating lakeFS database.
parent: Reference
nav_order: 51
has_children: false
---

# lakeFS Database Migrate
{: .no_toc }

**Note:** Feature in development
{: .note }

The lakeFS database migration tool simplifies switching from one database implementation to another.
More information can be found [here](https://github.com/treeverse/lakeFS/issues/3899)

## lakeFS with Key Value Store

Starting at version 0.80.0, lakeFS abandoned the tight coupling to [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL) and moved all database operations to work over [Key-Value Store](https://en.wikipedia.org/wiki/Key%E2%80%93value_database)

While SQL databases, and Postgres among them, have their obvious advantages, we felt that the tight coupling to Postgres is limiting our users and so, lakeFS with Key Value Store is introduced.
Our KV Store implements a generic interface, with methods for `Get`, `Set`, `Compare-and-Set`, `Delete` and `Scan`. Each entry is represented by a [`partition`, `key`, `value`] triplet. All these fields are generic byte-array, and the using module has maximal flexibility on the format to use for each field

Under the hood, our KV implementation relies on a backing DB, which persists the data. Theoretically, it could be any type of database and out of the box, we already implemented drivers for [DynamoDB](https://en.wikipedia.org/wiki/Amazon_DynamoDB), for AWS users, and [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL), using its relational nature to store a KV Store. More databases will be supported in the future, and lakeFS users and contributors can develop their own driver to use their own favorite database. For experimenting purposes, an in-memory KV store can be used, though it obviously lack the persistency aspect

In order to store ref store objects (that is `Repositories`, `Branches`, `Commits`, `Tags`, and `Uncommitted Objects`), lakeFS implements another layer over the generic KV Store, which supports serialization and deserialization of these objects as [protobuf](https://en.wikipedia.org/wiki/Protocol_Buffers). As this layer relies on the generic interface of the KV Store layer, it is totally agnostic to whichever store implementation is in use, gaining our users the maximal flexibility

For further reading, please refer to our [KV Design](https://github.com/treeverse/lakeFS/blob/master/design/accepted/metadata_kv/index.md)
Loading

0 comments on commit 249c3e2

Please sign in to comment.