Skip to content

Commit

Permalink
Create Repository: dummy file location change (#6497)
Browse files Browse the repository at this point in the history
  • Loading branch information
Isan-Rivkin authored Aug 29, 2023
1 parent f2e9e7f commit 1149f79
Show file tree
Hide file tree
Showing 7 changed files with 37 additions and 109 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

## UNRELEASED

- When creating Repository: `dummy` file location changed from `<storage-namespace>/dummy` to `<storage-namespace>/_lakefs/dummy`
- Removed support for migration from lakeFS version < `v0.50.0`

# v0.107.1

:bug: Bug fixed:
Expand Down
23 changes: 0 additions & 23 deletions cmd/lakefs/cmd/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -418,8 +418,6 @@ func checkRepos(ctx context.Context, logger logging.Logger, authMetadataManager
}

checkForeignRepo(repoStorageType, logger, adapterStorageType, repo.Name)
checkMetadataPrefix(ctx, repo, logger, blockStore, repoStorageType)

next = repo.Name
}
}
Expand Down Expand Up @@ -453,27 +451,6 @@ func getScheduler() *gocron.Scheduler {
return gocron.NewScheduler(time.UTC)
}

// checkMetadataPrefix checks for non-migrated repos of issue #2397 (https://github.com/treeverse/lakeFS/issues/2397)
func checkMetadataPrefix(ctx context.Context, repo *catalog.Repository, logger logging.Logger, adapter block.Adapter, repoStorageType block.StorageType) {
if repoStorageType != block.StorageTypeGS &&
repoStorageType != block.StorageTypeAzure {
return
}

const dummyFile = "dummy"
if _, err := adapter.Get(ctx, block.ObjectPointer{
StorageNamespace: repo.StorageNamespace,
Identifier: dummyFile,
IdentifierType: block.IdentifierTypeRelative,
}, -1); err != nil {
logger.WithFields(logging.Fields{
"path": dummyFile,
"storage_namespace": repo.StorageNamespace,
}).Fatal("Can't find dummy file in storage namespace, did you run the migration? " +
"(https://docs.lakefs.io/reference/upgrade.html#data-migration-for-version-v0500)")
}
}

// checkForeignRepo checks whether a repo storage namespace matches the block adapter.
// A foreign repo is a repository which namespace doesn't match the current block adapter.
// A foreign repo might exist if the lakeFS instance configuration changed after a repository was
Expand Down
81 changes: 1 addition & 80 deletions docs/howto/deploy/upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,83 +139,4 @@ cataloger:

## Data Migration for Version v0.50.0

We discovered a bug in the way lakeFS is storing objects in the underlying object store.
It affects only repositories on Azure and GCP, and not all of them.
[Issue #2397](https://github.com/treeverse/lakeFS/issues/2397#issuecomment-908397229) describes the repository storage namespaces patterns
that are affected by this bug.

When first upgrading to a version greater or equal to v0.50.0, you must follow these steps:
1. Stop lakeFS.
1. Perform a data migration (details below)
1. Start lakeFS with the new version.
1. After a successful run of the new version and validation that the objects are accessible, you can delete the old data prefix.

Note: Migrating data is a delicate procedure. The lakeFS team is here to help, reach out to us on Slack.
We'll be happy to walk you through the process.
{: .note .pb-3 }

### Data migration

The following patterns have been impacted by the bug:

| Type | Storage Namespace pattern | Copy From | Copy To |
|-------|-----------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|
| gs | gs://bucket/prefix | gs://bucket//prefix/* | gs://bucket/prefix/* |
| gs | gs://bucket/prefix/ | gs://bucket//prefix/* | gs://bucket/prefix/* |
| azure | https://account.blob.core.windows.net/containerid | https://account.blob.core.windows.net/containerid//* | https://account.blob.core.windows.net/containerid/* |
| azure | https://account.blob.core.windows.net/containerid/ | https://account.blob.core.windows.net/containerid//* | https://account.blob.core.windows.net/containerid/* |
| azure | https://account.blob.core.windows.net/containerid/prefix/ | https://account.blob.core.windows.net/containerid/prefix// | https://account.blob.core.windows.net/containerid/prefix/* |

You can find the repositories storage namespaces with:

```shell
lakectl repo list
```

Or the settings tab in the UI.

#### Migrating Google Storage data with gsutil

[gsutil](https://cloud.google.com/storage/docs/gsutil) is a Python application that lets you access Cloud Storage from the command line.
We can use it for copying the data between the prefixes in the Google bucket, and later on removing it.

For every affected repository, copy its data with:
```shell
gsutil -m cp -r gs://<BUCKET>//<PREFIX>/ gs://<BUCKET>/
```

Note the double slash after the bucket name.

#### Migrating Azure Blob Storage data with AzCopy

[AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10) is a command-line utility that you can use to copy blobs or files to or from a storage account.
We can use it for copying the data between the prefixes in the Azure storage account container, and later on removing it.

First, you need to acquire an [Account SAS](https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview#account-sas).
Using the Azure CLI:
```shell
az storage container generate-sas \
--account-name <ACCOUNT> \
--name <CONTAINER> \
--permissions cdrw \
--auth-mode key \
--expiry 2021-12-31
```

With the resulted SAS, use AzCopy to copy the files.
If a prefix exists after the container:
```shell
azcopy copy \
"https://<ACCOUNT>.blob.core.windows.net/<CONTAINER>/<PREFIX>//?<SAS_TOKEN>" \
"https://<ACCOUNT>.blob.core.windows.net/<CONTAINER>?<SAS_TOKEN>" \
--recursive=true
```

Or when using the container without a prefix:

```shell
azcopy copy \
"https://<ACCOUNT>.blob.core.windows.net/<CONTAINER>//?<SAS_TOKEN>" \
"https://<ACCOUNT>.blob.core.windows.net/<CONTAINER>/./?<SAS_TOKEN>" \
--recursive=true
```
If you are using a version before 0.50.0, you must first perform the [previous upgrade to that version](https://docs.lakefs.io/v0.50/reference/upgrade.html#data-migration-for-version-v0500). {: note: .note-warning }
1 change: 1 addition & 0 deletions docs/reference/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ This reference uses `.` to denote the nesting of values.
* `graveler.reposiory_cache.size` `(int : 1000)` - How many items to store in the repository cache.
* `graveler.reposiory_cache.ttl` `(time duration : "5s")` - How long to store an item in the repository cache.
* `graveler.reposiory_cache.jitter` `(time duration : "2s")` - A random amount of time between 0 and this value is added to each item's TTL.
* `graveler.ensure_readable_root_namespace` `(bool: true)` - When creating a new repository use this to verify that lakeFS has access to the root of the underlying storage namespace. Set `false` only if lakeFS should not have access (i.e pre-sign mode only).
* `graveler.commit_cache.size` `(int : 50000)` - How many items to store in the commit cache.
* `graveler.commit_cache.ttl` `(time duration : "10m")` - How long to store an item in the commit cache.
* `graveler.commit_cache.jitter` `(time duration : "2s")` - A random amount of time between 0 and this value is added to each item's TTL.
Expand Down
36 changes: 30 additions & 6 deletions pkg/api/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -1627,19 +1627,43 @@ func (c *Controller) validateStorageNamespace(storageNamespace string) error {

func (c *Controller) ensureStorageNamespace(ctx context.Context, storageNamespace string) error {
const (
dummyKey = "dummy"
dummyData = "this is dummy data - created by lakeFS in order to check accessibility"
dummyData = "this is dummy data - created by lakeFS to check accessibility"
dummyObjName = "dummy"
)
dummyKey := fmt.Sprintf("%s/%s", c.Config.Committed.BlockStoragePrefix, dummyObjName)

objLen := int64(len(dummyData))

// check if the dummy file exist in the root of the storage namespace
// this serves 2 purposes, first, we maintain safety check for older lakeFS version.
// second, in scenarios where lakeFS shouldn't have access to the root namespace (i.e pre-sign URL only).
if c.Config.Graveler.EnsureReadableRootNamespace {
rootObj := block.ObjectPointer{
StorageNamespace: storageNamespace,
IdentifierType: block.IdentifierTypeRelative,
Identifier: dummyObjName,
}

if s, err := c.BlockAdapter.Get(ctx, rootObj, objLen); err == nil {
s.Close()
return fmt.Errorf("found lakeFS objects in the storage namespace root(%s): %w",
storageNamespace, ErrStorageNamespaceInUse)
} else if !errors.Is(err, block.ErrDataNotFound) {
return err
}
}

// check if the dummy file exists
obj := block.ObjectPointer{
StorageNamespace: storageNamespace,
IdentifierType: block.IdentifierTypeRelative,
Identifier: dummyKey,
}
objLen := int64(len(dummyData))
if _, err := c.BlockAdapter.Get(ctx, obj, objLen); err == nil {
return fmt.Errorf("found lakeFS objects in the storage namespace(%s): %w",
storageNamespace, ErrStorageNamespaceInUse)

if s, err := c.BlockAdapter.Get(ctx, obj, objLen); err == nil {
s.Close()
return fmt.Errorf("found lakeFS objects in the storage namespace(%s) key(%s): %w",
storageNamespace, obj.Identifier, ErrStorageNamespaceInUse)
} else if !errors.Is(err, block.ErrDataNotFound) {
return err
}
Expand Down
1 change: 1 addition & 0 deletions pkg/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,7 @@ type Config struct {
PrepareInterval time.Duration `mapstructure:"prepare_interval"`
} `mapstructure:"ugc"`
Graveler struct {
EnsureReadableRootNamespace bool `mapstructure:"ensure_readable_root_namespace"`
BatchDBIOTransactionMarkers bool `mapstructure:"batch_dbio_transaction_markers"`
RepositoryCache struct {
Size int `mapstructure:"size"`
Expand Down
1 change: 1 addition & 0 deletions pkg/config/defaults.go
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ func setDefaults(cfgType string) {
viper.SetDefault("database.postgres.max_idle_connections", 25)
viper.SetDefault("database.postgres.connection_max_lifetime", "5m")

viper.SetDefault("graveler.ensure_readable_root_namespace", true)
viper.SetDefault("graveler.repository_cache.size", 1000)
viper.SetDefault("graveler.repository_cache.expiry", 5*time.Second)
viper.SetDefault("graveler.repository_cache.jitter", 2*time.Second)
Expand Down

0 comments on commit 1149f79

Please sign in to comment.