diff --git a/docs/deploying/offboarding.md b/docs/deploying/offboarding.md index c47aa29d3d8..7fea8bf53a9 100644 --- a/docs/deploying/offboarding.md +++ b/docs/deploying/offboarding.md @@ -26,53 +26,3 @@ The simplest way to migrate away from lakeFS is to copy data from a lakeFS repos For smaller repositories, this could be done using the [AWS cli](../using/aws_cli.md) or [rclone](../using/rclone.md). For larger repositories, running [distcp](https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html){: target="_blank"} with lakeFS as the source is also an option. - -## Using treeverse-distcp - -If for some reason, lakeFS is not accessible, we can still migrate data to S3 using [treeverse-distcp](https://github.com/treeverse/treeverse-distcp){: target="_blank"} - -assuming the underlying S3 bucket is intact. Here's how to do it: - -1. Create a Copy Manifest - this file describes the source and destination for every object we want to copy. It is a mapping between lakeFS' internal storage addressing and the paths of the objects as we'd expect to see them in S3. - - To generate a manifest, connect to the PostgreSQL instance used by lakeFS and run the following command: - - ```shell - psql \ - --var "repository_name=repo1" \ - --var "branch_name=master" \ - --var "dst_bucket_name=bucket1" \ - postgres < create-extraction-manifest.sql > manifest.csv - ``` - - You can download the `create-extraction-manifest.sql` script from the [lakeFS GitHub repository](https://github.com/treeverse/lakeFS/blob/master/scripts/create-extraction-manifest.sql){: target="_blank" }. - - **Note** This manifest is useful for recovery - it will allow you to restore service in case the PostgreSQL database is for some reason not accessible. - For safety, you can automate the creation of this manifest to happen daily. - {: .note .note-info } -1. Copy the manifest to S3. Once copied, keep note of its etag - we'll need this to run the copy batch job: - - ```shell - cp /path/to/manifest.csv s3://my-bucket/path/to/manifest.csv - aws s3api head-object --bucket my-bucket --key path/to-manifest/csv | jq -r .ETag # Or look for ETag in the output - ``` -1. Once we have a manifest, let's define a S3 batch job that will copy all files for us. -To do this, let's start by creating an IAM role called `lakeFSExportJobRole`, and grant it permissions as described in ["Granting permissions for Batch Operations"](https://docs.aws.amazon.com/AmazonS3/latest/dev/batch-ops-iam-role-policies.html#batch-ops-iam-role-policies-create){: target="_blank" } -1. Once we have an IAM role, let's install the [`treeverse-distcp` lambda function](https://github.com/treeverse/treeverse-distcp/blob/master/lambda_handler.py){: target="_blank" } - - Make a note of the Lambda function ARN -- this is required for running an S3 Batch Job. -1. Take note of your account ID - this is required for running an S3 Batch Job: - - ```shell - aws sts get-caller-identity | jq -r .Account - ``` -1. Dispatch a copy job using the [`run_copy.py`](https://github.com/treeverse/treeverse-distcp/blob/master/run_copy.py){: target="_blank" } script: - - ```shell - run_copy.py \ - --account-id "123456789" \ - --csv-path "s3://s3://my-bucket/path/to/manifest" \ - --csv-etag "..." \ - --report-path "s3://another-bucket/prefix/for/reports" \ - --lambda-handler-arn "arn:lambda:..." - ``` -1. You will get a job number. Now go to the [AWS S3 Batch Operations Console](https://s3.console.aws.amazon.com/s3/jobs){: target="_blank" }, switch to the region of your bucket, and confirm execution of that job. diff --git a/scripts/create-extraction-manifest.sql b/scripts/create-extraction-manifest.sql deleted file mode 100644 index 2275dbb3912..00000000000 --- a/scripts/create-extraction-manifest.sql +++ /dev/null @@ -1,117 +0,0 @@ --- PostgreSQL script to create a manifest file for extracting files --- from LakeFS to S3. Use this manifest to run treeverse-distcp, --- which will extract the files. - --- -- -- -- -- -- -- -- -- --- Set these variables by running "psql --var VARIABLE=VALUE", e.g.: --- --- "psql -var repository_name=foo -var branch_name=master --var dst_bucket_name=foo-extract". - --- Variable repository_name: repository to be extracted. Must be --- specified. --- --- Variable branch_name: branch to be extracted. Must be specified --- (otherwise object paths can identify more than a single object). --- --- Variable dst_bucket_name: name of bucket to place files. Must be --- specified. - --- Avoid superfluous output (such as "CREATE FUNCTION) -\set QUIET 1 - -CREATE FUNCTION pg_temp.maybe_concat_slash(p text) -RETURNS text -LANGUAGE sql IMMUTABLE STRICT -AS $$ - SELECT regexp_replace(p, '/$', '') || '/'; -$$; - -CREATE FUNCTION pg_temp.join_paths(p text, q text) -RETURNS text -LANGUAGE sql IMMUTABLE STRICT -AS $$ - SELECT pg_temp.maybe_concat_slash(p) || q; -$$; - --- encode URI from https://stackoverflow.com/a/60260190/192263 -CREATE FUNCTION pg_temp.encode_uri_component(text) -RETURNS text -LANGUAGE sql IMMUTABLE STRICT -AS $$ - SELECT string_agg( - CASE WHEN bytes > 1 OR c !~ '[0-9a-zA-Z_.!~*''()-]+' THEN - regexp_replace(encode(convert_to(c, 'utf-8')::bytea, 'hex'), '(..)', E'%\\1', 'g') - ELSE - c - END, - '' - ) - FROM ( - SELECT c, octet_length(c) bytes - FROM regexp_split_to_table($1, '') c - ) q; -$$; - --- Return the first part of path. -CREATE FUNCTION pg_temp.get_head(path text) -RETURNS text -LANGUAGE sql IMMUTABLE STRICT -AS $$ - SELECT regexp_replace($1, '^s3://([^/]*)/.*$', '\1') -$$; - --- Return the bucket name of path: its head if it has slashes, or all --- of it if it does not. -CREATE FUNCTION pg_temp.get_bucket(path text) -RETURNS text -LANGUAGE sql IMMUTABLE STRICT -AS $$ - SELECT CASE WHEN head = '' THEN $1 ELSE head END FROM ( - SELECT pg_temp.get_head(path) head - ) i; -$$; - --- If path is an S3 path with a key after the bucket, return the rest --- ("path") of path (everything after the first slash) and a trailing --- slash. Otherwise return ''. -CREATE FUNCTION pg_temp.get_rest(path text) -RETURNS text -LANGUAGE sql IMMUTABLE STRICT -AS $$ - SELECT CASE WHEN tail = '' THEN '' ELSE pg_temp.maybe_concat_slash(tail) END FROM ( - SELECT substr($1, length(pg_temp.get_head($1)) + 7) tail - ) i; -$$; - --- Format output appropriately -\pset format csv -\pset tuples_only on - --- TODO(ariels): Works just for S3-based namespaces. Current --- alternatives (mem, local) do not require support, future may be --- different. -SELECT -DISTINCT ON (physical_address) - regexp_replace(pg_temp.get_bucket(storage_namespace), '^s3://', 'arn:aws:s3:::') src_bucket_arn, - pg_temp.encode_uri_component(json_build_object( - 'dstBucket', :'dst_bucket_name', - 'dstKey', pg_temp.join_paths(:'repository_name', path), - 'srcKey', concat(pg_temp.get_rest(storage_namespace), physical_address)) #>> '{}') -FROM ( - SELECT entry.path path, entry.physical_address physical_address, entry.min_commit min_commit, - entry.max_commit = 0 tombstone, -- true for an uncommitted deletion - repository.storage_namespace storage_namespace - FROM (catalog_entries entry - JOIN catalog_branches branch ON entry.branch_id = branch.id - JOIN catalog_repositories repository ON branch.repository_id = repository.id) - WHERE repository.name = :'repository_name' AND - branch.name = :'branch_name' AND - -- uncommitted OR current - (entry.min_commit = 0 OR entry.min_commit > 0 AND entry.max_commit = catalog_max_commit_id()) AND - -- Skip explicit physical addresses: imported from elsewhere - -- with meaningful name so do not export - regexp_match(entry.physical_address, '^[a-zA-Z0-9]+://') IS NULL -) i -WHERE NOT tombstone -ORDER BY physical_address, min_commit; -