Docs: Add KV documentation (#3878)

Co-authored-by: Barak Amar <[email protected]> Co-authored-by: itaiad200 <[email protected]> Co-authored-by: itai-david <[email protected]> Co-authored-by: eden-ohana <[email protected]>
treeverse · Sep 4, 2022 · 249c3e2 · 249c3e2
1 parent 4f13738
commit 249c3e2
Show file tree

Hide file tree

Showing 8 changed files with 507 additions and 64 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,17 @@
 # Changelog
 
+## v0.80.0 - 2022-08-31
+This release requires running database migration.
+The lakeFS service will not run if the migration version isn't compatible with the binary.
+Before running the new version you will be required to run migrate, with the new version. 
+Please refer to this [upgrade documentation](https://docs.lakefs.io/reference/upgrade.html##lakefs-0800-or-greater-kv-migration) for more information on the specific migration to KV
+
+This is the first lakeFS version over Key-Value Store
+lakeFS is decoupling from PostgreSQL and moving to a KV Store interface. 
+This will provide greater flexibility and allow production groups working with lakeFS to select their backing DB of choice.  
+Check our updated [Deploy lakeFS](https://docs.lakefs.io/deploy/#deploy-lakefs) page, for deployment instructions.
+Also make sure to check our [Sizing Guide](https://docs.lakefs.io/understand/sizing-guide.html#lakefs-kv-store) for best practices, requirements and benchmarks
+
 ## v0.70.6 - 2022-08-30
 - UI: fix focus on branch lookup while creating tag (#4005)
 

diff --git a/docs/deploy/aws.md b/docs/deploy/aws.md
@@ -12,15 +12,21 @@ redirect_from:
 ---
 
 # Deploy lakeFS on AWS
+
 {: .no_toc }
 Expected deployment time: 25 min
 
 {% include toc.html %}
 
 {% include_relative includes/prerequisites.md %}
 
-## Creating the Database on AWS RDS
-lakeFS requires a PostgreSQL database to synchronize actions on your repositories.
+## Preparing the Database for the Key Value Store
+
+lakeFS uses a key-value store to synchronize actions on your repositories. Out of the box, this key-value store can rely on DynamoDB or a PostgreSQL DB. As lakeFS open source, you can also write your own implementation and use any other DB
+The following two sections explain how to setup either PostgreSQL or DynamoDB as key-value backing DB
+
+### Creating PostgreSQL Database on AWS RDS
+
 We will show you how to create a database on AWS RDS but you can use any PostgreSQL database as long as it's accessible by your lakeFS installation.
 
 If you already have a database, take note of the connection string and skip to the [next step](#installation-options)
@@ -33,19 +39,37 @@ If you already have a database, take note of the connection string and skip to t
 
 3. Make sure your security group rules allow you to connect to the database instance.
 
+### DynamoDB on AWS
+
+DynamoDB on AWS does not require any specific preparation, other than properly configuring lakeFS to use it and valid AWS credentials. Please refer to `database.dynamodb` section in the [configuration reference](../reference/configuration.md#reference) for complete configuration options.
+AWS credentials, for DynamoDB, can also be provided via environment variables, as described in the [configuration reference](../reference/configuration.md#using-environment-variables)
+Please refer to [AWS documentation](https://aws.amazon.com/dynamodb/getting-started/) for further information on DynamoDB
+
 ## Installation Options
 
 ### On EC2
-1. Save the following configuration file as `config.yaml`:
+
+1. Edit and save the following configuration file as `config.yaml`:
 
    ```yaml
    ---
    database:
-     connection_string: "[DATABASE_CONNECTION_STRING]"
+     type: "postgres" OR "dynamodb"
+
+     # when using dynamodb
+     dynamodb: 
+       table_name: "[DYNAMODB_TABLE_NAME]"
+       aws_region: "[DYNAMODB_REGION]"
+
+     # when using postgres
+     postgres:
+       connection_string: "[DATABASE_CONNECTION_STRING]"
+
    auth:
      encrypt:
        # replace this with a randomly-generated string:
        secret_key: "[ENCRYPTION_SECRET_KEY]"
+
    blockstore:
      type: s3
      s3:
@@ -54,20 +78,37 @@ If you already have a database, take note of the connection string and skip to t
 
 1. [Download the binary](../index.md#downloads) to the EC2 instance.
 1. Run the `lakefs` binary on the EC2 instance:
-   ```bash
+   ```sh
    lakefs --config config.yaml run
    ```
    **Note:** It's preferable to run the binary as a service using systemd or your operating system's facilities.
 
 ### On ECS
-To support container-based environments like AWS ECS, lakeFS can be configured using environment variables. Here is a `docker run` 
-command to demonstrate starting lakeFS using Docker:
+
+To support container-based environments like AWS ECS, lakeFS can be configured using environment variables. Here are a couple of `docker run` 
+commands to demonstrate starting lakeFS using Docker:
+
+#### With PostgreSQL
 
 ```sh
 docker run \
   --name lakefs \
   -p 8000:8000 \
-  -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
+  -e LAKEFS_DATABASE_TYPE="postgres" \
+  -e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
+  -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
+  -e LAKEFS_BLOCKSTORE_TYPE="s3" \
+  treeverse/lakefs:latest run
+```
+
+#### With DynamoDB
+
+```sh
+docker run \
+  --name lakefs \
+  -p 8000:8000 \
+  -e LAKEFS_DATABASE_TYPE: "dynamodb" \
+  -e LAKEFS_DATABASE_DYNAMODB_TABLE_NAME="[DYNAMODB_TABLE_NAME]" \
   -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
   -e LAKEFS_BLOCKSTORE_TYPE="s3" \
   treeverse/lakefs:latest run
@@ -76,9 +117,11 @@ docker run \
 See the [reference](../reference/configuration.md#using-environment-variables) for a complete list of environment variables.
 
 ### On EKS
+
 See [Kubernetes Deployment](./k8s.md).
 
 ## Load balancing
+
 Depending on how you chose to install lakeFS, you should have a load balancer direct requests to the lakeFS server.  
 By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint which you can use for health checks.
 
@@ -91,4 +134,5 @@ By default, lakeFS operates on port 8000, and exposes a `/_health` endpoint whic
 1. Configure the health-check to use the exposed `/_health` URL
 
 ## Next Steps
+
 Your next step is to [prepare your storage](../setup/storage/index.md). If you already have a storage bucket/container, you're ready to [create your first lakeFS repository](../setup/create-repo.md).
diff --git a/docs/deploy/azure.md b/docs/deploy/azure.md
@@ -34,7 +34,9 @@ If you already have a database, take note of the connection string and skip to t
    ```yaml
    ---
    database:
-     connection_string: "[DATABASE_CONNECTION_STRING]"
+     type: "postgres"
+     postgres:
+       connection_string: "[DATABASE_CONNECTION_STRING]"
    auth:
      encrypt:
        # replace this with a randomly-generated string:
@@ -64,7 +66,8 @@ command to demonstrate starting lakeFS using Docker:
 docker run \
   --name lakefs \
   -p 8000:8000 \
-  -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
+  -e LAKEFS_DATABASE_TYPE="postgres" \
+  -e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
   -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
   -e LAKEFS_BLOCKSTORE_TYPE="azure" \
   -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT="[YOUR_STORAGE_ACCOUNT]" \

diff --git a/docs/deploy/gcp.md b/docs/deploy/gcp.md
@@ -37,7 +37,9 @@ For example, if you install lakeFS on GKE, you need to deploy the SQL Auth Proxy
    ```yaml
    ---
    database:
-     connection_string: "[DATABASE_CONNECTION_STRING]"
+     type: "postgres"
+     postgres:
+       connection_string: "[DATABASE_CONNECTION_STRING]"
    auth:
      encrypt:
        # replace this with a randomly-generated string:
@@ -64,7 +66,8 @@ command to demonstrate starting lakeFS using Docker:
 docker run \
   --name lakefs \
   -p 8000:8000 \
-  -e LAKEFS_DATABASE_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
+  -e LAKEFS_DATABASE_TYPE="postgres" \
+  -e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
   -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
   -e LAKEFS_BLOCKSTORE_TYPE="gs" \
   treeverse/lakefs:latest run

diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
@@ -28,15 +28,43 @@ This reference uses `.` to denote the nesting of values.
 
 * `logging.format` `(one of ["json", "text"] : "text")` - Format to output log message in
 * `logging.level` `(one of ["TRACE", "DEBUG", "INFO", "WARN", "ERROR", "NONE"] : "DEBUG")` - Logging level to output
-* `logging.audit_log_level` `(one of ["TRACE", "DEBUG", "INFO", "WARN", "ERROR", "NONE"] : "DEBUG")` - Audit logs level to output. **Please notice that in case you configure this field to be lower than the main logger level, you won't be able to get the audit logs**
+* `logging.audit_log_level` `(one of ["TRACE", "DEBUG", "INFO", "WARN", "ERROR", "NONE"] : "DEBUG")` - Audit logs level to output.
+
+  **Note:** In case you configure this field to be lower than the main logger level, you won't be able to get the audit logs 
+  {: .note }
 * `logging.output` `(string : "-")` - A path or paths to write logs to. A `-` means the standard output, `=` means the standard error.
 * `logging.file_max_size_mb` `(int : 100)` - Output file maximum size in megabytes.
 * `logging.files_keep` `(int : 0)` - Number of log files to keep, default is all.
 * `actions.enabled` `(bool : true)` - Setting this to false will block hooks from being executed
-* `database.connection_string` `(string : "postgres://localhost:5432/postgres?sslmode=disable")` - PostgreSQL connection string to use
-* `database.max_open_connections` `(int : 25)` - Maximum number of open connections to the database
-* `database.max_idle_connections` `(int : 25)` - Sets the maximum number of connections in the idle connection pool
-* `database.connection_max_lifetime` `(duration : 5m)` - Sets the maximum amount of time a connection may be reused
+* ~~`database.connection_string` `(string : "postgres://localhost:5432/postgres?sslmode=disable")` - PostgreSQL connection string to use~~
+* ~~`database.max_open_connections` `(int : 25)` - Maximum number of open connections to the database~~
+* ~~`database.max_idle_connections` `(int : 25)` - Sets the maximum number of connections in the idle connection pool~~
+* ~~`database.connection_max_lifetime` `(duration : 5m)` - Sets the maximum amount of time a connection may be reused~~
+
+  **Note:** Deprecated - See `database` section 
+  {: .note }
+* `database` - Configuration section for the lakeFS key-value store database
+  + `database.type` `(string : ["postgres"|"dynamodb"])` - lakeFS database type 
+  + `database.postgres` - Configuration section when using `database.type="postgres"`
+    + `database.postgres.connection_string` `(string : "postgres://localhost:5432/postgres?sslmode=disable")` - PostgreSQL connection string to use
+    + `database.postgres.max_open_connections` `(int : 25)` - Maximum number of open connections to the database
+    + `database.postgres.max_idle_connections` `(int : 25)` - Maximum number of connections in the idle connection pool
+    + `database.postgres.connection_max_lifetime` `(duration : 5m)` - Sets the maximum amount of time a connection may be reused `(valid units: ns|us|ms|s|m|h)` 
+  + `database.dynamodb` - Configuration section when using `database.type="dynamodb"`
+    + `database.dynamodb.table_name` `(string : "kvstore")` - Table used to store the data
+    + `database.dynamodb.read_capacity_units` `(int : 1000)` - Read capacity units, measured in requests per second 
+    + `database.dynamodb.write_capacity_units` `(int : 1000)` - Write capacity units, measured in requests per second
+    + `database.dynamodb.scan_limit` `(int : )` - Maximal number of items per page during scan operation
+
+      **Note:** Refer to the following [AWS documentation](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Limit) for further information
+      {: .note }
+    + `database.dynamodb.endpoint` `(string : )` - Endpoint URL for database instance
+    + `database.dynamodb.aws_region` `(string : )` - AWS Region of database instance
+    + `database.dynamodb.aws_access_key_id` `(string : )` - AWS access key ID
+    + `database.dynamodb.aws_secret_access_key` `(string : )` - AWS secret access key
+
+      **Note:** `endpoint` `aw_region` `aws_access_key_id` `aws_secret_access_key` are not required and used mainly for experimental purposes when working with DynamoDB with different AWS credentials. 
+      {: .note } 
 * `listen_address` `(string : "0.0.0.0:8000")` - A `<host>:<port>` structured string representing the address to listen on
 * `auth.cache.enabled` `(bool : true)` - Whether to cache access credentials and user policies in-memory. Can greatly improve throughput when enabled.
 * `auth.cache.size` `(int : 1024)` - How many items to store in the auth cache. Systems with a very high user count should use a larger value at the expense of ~1kb of memory per cached user.
@@ -146,14 +174,16 @@ To set an environment variable, prepend `LAKEFS_` to its name, convert it to upp
 For example, `logging.format` becomes `LAKEFS_LOGGING_FORMAT`, `blockstore.s3.region` becomes `LAKEFS_BLOCKSTORE_S3_REGION`, etc.
 
 
-## Example: Local Development
+## Example: Local Development with PostgreSQL database
 
 ```yaml
 ---
 listen_address: "0.0.0.0:8000"
 
 database:
-  connection_string: "postgres://localhost:5432/postgres?sslmode=disable"
+  type: "postgres"
+  postgres:
+    connection_string: "postgres://localhost:5432/postgres?sslmode=disable"
 
 logging:
   format: text
@@ -175,7 +205,7 @@ gateways:
 ```
 
 
-## Example: AWS Deployment
+## Example: AWS Deployment with DynamoDB database
 
 ```yaml
 ---
@@ -185,7 +215,9 @@ logging:
   output: "-"
 
 database:
-  connection_string: "postgres://user:[email protected]:5432/postgres"
+  type: "dynamodb"
+  dynamodb:
+    table_name: "kvstore"
 
 auth:
   encrypt:
@@ -213,7 +245,9 @@ logging:
   output: "-"
 
 database:
-  connection_string: "postgres://user:[email protected]:5432/postgres"
+  type: "postgres"
+  postgres:
+    connection_string: "postgres://user:[email protected]:5432/postgres"
 
 auth:
   encrypt:
@@ -236,7 +270,9 @@ logging:
   output: "-"
 
 database:
-  connection_string: "postgres://user:[email protected]:5432/postgres"
+  type: "postgres"
+  postgres:
+    connection_string: "postgres://user:[email protected]:5432/postgres"
 
 auth:
   encrypt:
@@ -263,7 +299,9 @@ logging:
   output: "-"
 
 database:
-  connection_string: "postgres://user:[email protected]:5432/postgres"
+  type: "postgres"
+  postgres:
+    connection_string: "postgres://user:[email protected]:5432/postgres"
 
 auth:
   encrypt:

diff --git a/docs/reference/database-migration.md b/docs/reference/database-migration.md
@@ -0,0 +1,30 @@
+---
+layout: default
+title: Database Migration
+description: A guide to migrating lakeFS database.
+parent: Reference
+nav_order: 51
+has_children: false
+---
+
+# lakeFS Database Migrate
+{: .no_toc }
+
+**Note:** Feature in development
+{: .note }
+
+The lakeFS database migration tool simplifies switching from one database implementation to another.
+More information can be found [here](https://github.com/treeverse/lakeFS/issues/3899)
+
+## lakeFS with Key Value Store
+
+Starting at version 0.80.0, lakeFS abandoned the tight coupling to [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL) and moved all database operations to work over [Key-Value Store](https://en.wikipedia.org/wiki/Key%E2%80%93value_database)
+
+While SQL databases, and Postgres among them, have their obvious advantages, we felt that the tight coupling to Postgres is limiting our users and so, lakeFS with Key Value Store is introduced.
+Our KV Store implements a generic interface, with methods for `Get`, `Set`, `Compare-and-Set`, `Delete` and `Scan`. Each entry is represented by a [`partition`, `key`, `value`] triplet. All these fields are generic byte-array, and the using module has maximal flexibility on the format to use for each field
+
+Under the hood, our KV implementation relies on a backing DB, which persists the data. Theoretically, it could be any type of database and out of the box, we already implemented drivers for [DynamoDB](https://en.wikipedia.org/wiki/Amazon_DynamoDB), for AWS users, and [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL), using its relational nature to store a KV Store. More databases will be supported in the future, and lakeFS users and contributors can develop their own driver to use their own favorite database. For experimenting purposes, an in-memory KV store can be used, though it obviously lack the persistency aspect
+
+In order to store ref store objects (that is `Repositories`, `Branches`, `Commits`, `Tags`, and `Uncommitted Objects`), lakeFS implements another layer over the generic KV Store, which supports serialization and deserialization of these objects as [protobuf](https://en.wikipedia.org/wiki/Protocol_Buffers). As this layer relies on the generic interface of the KV Store layer, it is totally agnostic to whichever store implementation is in use, gaining our users the maximal flexibility
+
+For further reading, please refer to our [KV Design](https://github.com/treeverse/lakeFS/blob/master/design/accepted/metadata_kv/index.md)