Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Bunny.net object storage support #4325

Closed
wants to merge 35 commits into from

Conversation

l0wl3vel
Copy link
Contributor

@l0wl3vel l0wl3vel commented Jan 9, 2024

This PR implements support for Bunny.net object storage.

Fixes #4317

@CLAassistant
Copy link

CLAassistant commented Jan 10, 2024

CLA assistant check
All committers have signed the CLA.

@l0wl3vel l0wl3vel marked this pull request as ready for review January 10, 2024 17:02
@solracsf

This comment was marked as resolved.

@zhijian-pro zhijian-pro self-assigned this Jan 11, 2024
@zhijian-pro
Copy link
Contributor

image In fact, the S3 API is better suited for juicefs, and if the S3 api is released in the future, we may need to reimplement it with the S3 API. @l0wl3vel @solracsf

@l0wl3vel
Copy link
Contributor Author

I know about the pending implementation of S3 for bunny storage. I would say this is a similar situation to B2 with their native API and their S3 compatible API.

I think it might be useful to have their native API supported for two reasons:

  • It is really simple compared to S3. Four API calls that can be done via curl, if you desire

  • Performance. Bunny Storage seems to perform much better than their competitors in a Multi-Cloud scenario. (I tested B2 and Cloudflare R2)

They advertise extremely low TTFB latency. I think it might be a good idea to have their native API implemented to compare to S3 when they offer it. Bunny's performance advantage could be caused partially by their own API which would justify having a bespoke implementation.

@solracsf
Copy link
Contributor

solracsf commented Jan 12, 2024

S3 is "coming soon" since 2022... 💭
SFTP is broken (not using standards).

So API is the better option (also because there are no egress fees using the API).

@solracsf
Copy link
Contributor

It is really simple compared to S3. Four API calls that can be done via curl, if you desire.

@l0wl3vel regarding this, if this can simplify things, is the external library git.sr.ht/~jamesponddotco/bunnystorage-go really necessary? Could the necessary code to make the API calls be integrated directly into JuiceFS codebase, instead of relying on an external lib?

@l0wl3vel
Copy link
Contributor Author

Maybe, but I do not think that is a good idea to basically put an API SDK inside of a specialized codebase the size of JuiceFS.

I thought about forking/redoing the git.sr.ht/~jamesponddotco/bunnystorage-go library, since I think that it features some weird abstractions and it is using some homebrew libraries written by the author, some of which (e.g. git.sr.ht/~jamesponddotco/httpx-go) are deprecated.

@solracsf Would you be interested in building new bunnystorage-go library using more standard dependencies with me?

If the JuiceFS maintainers are fine with pulling in the jamesponddotco/bunnystorage-go dependency for now I would prefer it to just get it merged and working on replacing the API SDK with another implementation later on.

@solracsf
Copy link
Contributor

@l0wl3vel I'm not a Go developer, I'm afraid i can't help so much here 😿

@zhijian-pro
Copy link
Contributor

We want to merge a stable code. And one that is useful to users.

@l0wl3vel
Copy link
Contributor Author

Would you consider merging the code in its current form? And if not, what changes would be required to consider it for merging?

What kind of stability are we talking about? Upgrade Stability (Users can upgrade JuiceFS without need for migration), functional stability, dependency stability? Are there guidelines on what you would consider stable and what assurances are necessary to proof a certain kind of stability?

@jamesponddotco
Copy link

jamesponddotco commented Jan 14, 2024

@l0wl3vel Let me know what you need changed in the base module and I can see if I can find the time to work on it. At the very least, I'll remove the need for httpx-go, since I deprecated that, and replace it with the standard net/http module.

@l0wl3vel
Copy link
Contributor Author

l0wl3vel commented Jan 14, 2024

@jamesponddotco Hey, really appreciating you chiming in here. Thanks for writing bunnystorage-go. Here are a few things I think would improve it.

  1. Combine the functionality of Config and Client. I do not see the reason to keep them separate.
  2. Initialize the new Client with a single New function or with a builder-style semantic
  3. Replace the xstrings logic for building URLs with net/url. I think it is much more idiomatic and easier to read.
  4. Change the Enum-Style Abstraction for the endpoints. It requires updating the library every time a new endpoint is added by Bunny.net. Replace the field in the Client struct with a net/url and providing some constants to quickly fill in the endpoint with a known endpoint.
  5. Remove the dependency on xerrors
  6. Replace UserAgent with a string without any custom datatype.
  7. Move the code to Github for visibility and better collaboration.

I think this is getting a bit off-topic here, but I already started converting bunnystorage-go to resty on my fork. Uploading does not work properly, unfortunately. Do you think you could move the main bunnystorage-go repo to github?

No offense, I would really like to remove the dependency on sr.ht and your xstd-go entirely to make the module it more idiomatic and accessible.

@jamesponddotco
Copy link

jamesponddotco commented Jan 14, 2024

@l0wl3vel, thank you for the feedback, it's invaluable!

Combine the functionality of Config and Client. I do not see the reason to keep them separate.
Initialize the new Client with a single New function or with a builder-style semantic

I mostly like to keep configuration in its own struct, hence why they're separate. This is personal preference, really, so I don't see it changing. If you look at other clients I wrote, like the cloudcraft-go one for Datadog, you'll see the same style.

Replace the xstrings logic for building URLs with net/url. I think it is much more idiomatic and easier to read.

xstrings is used to avoid allocations, but since I didn't really took precautions anywhere else, I could definitely remove it.

Change the Enum-Style Abstraction for the endpoints. It requires updating the library every time a new endpoint is added by Bunny.net. Replace the field in the Client struct with a net/url and providing some constants to quickly fill in the endpoint with a known endpoint.

That was a mistake indeed, but fixing it would require a breaking change that I'm not sure I'm willing to make right now. I'll consider it, though.

Same goes for that logging interface. I should probably have gone with something that would allow me to use log/slog instead.

Remove the dependency on xerrors.

Using errors as constants is also a personal preference, but there is some reason behind it:
https://dave.cheney.net/2016/04/07/constant-errors

Replace UserAgent with a string without any custom datatype.

Another mistake that would require a breaking change to undo, one that I make on every client, really. Not sure if this will be undone, though.

Move the code to Github for visibility and better collaboration.

Not happening, sorry. I much prefer the email workflow for working with code and git in general, and prefer SourceHut to GitHub as a platform for several reasons. I still use GitHub as a read-only mirror and would probably accept PRs over here since I can't disable them, but that is as far as I'll go with the platform.


As for the usage of xstd, you'll see that module in pretty much all code I write, because I wrote it to decrease duplication. If I need to write the same function for two different projects, it goes into xstd instead.

I'll look into those changes this week, but I assume you'll use your fork instead for this PR. Still, glad my code could be of some help, haha.

Since this is indeed becoming off-topic, reach out to me at [james at cipher dot host] if you want to continue the conversation.

@zhijian-pro zhijian-pro removed their assignment Jan 16, 2024
@l0wl3vel
Copy link
Contributor Author

l0wl3vel commented Jan 17, 2024

Ok, as I see it, the plan is the following:

  • @jamesponddotco moves bunnystorage-go away from using httpx
  • @zhijian-pro We move away from using the deprecated httpx dependency. Would the JuiceData team be willing to merge this PR? If not, what are your concerns and preferred methods for dealing with them?
  • The only thing that concerns me about this PR is the format of the endpoint passed via CLI: I think something like https://<endpoint>/<bucketname> makes more sense than bunny://<bucketname>.<endpoint>. We stay closer to how the bunny API works internally and opens the path to easily moving away from bunnystorage-go, if the JuiceFS maintainers see a problem with using it, since it is just the base URL for all API calls. I will implement the endpoint change this week.

Does anyone have anything to add?

@l0wl3vel
Copy link
Contributor Author

l0wl3vel commented Feb 7, 2024

object_storage_test.go is passing now 🎉

@zhijian-pro
Copy link
Contributor

object_storage_test.go is passing now 🎉

I will review this pr again after the Spring Festival.

previously the path parsed out of list responses did not contain the whole path. This commit allows juicefs delete, sync etc. to work properly
Required some fixes around error masking due to changed errors
@l0wl3vel
Copy link
Contributor Author

Updated the bunny-storage-go-sdk to 0.0.10, which uses fasthttp instead of net/http and broke 1GiB/s in write speed and got some nice boost in read speed as well:

BlockSize: 1 MiB, BigFileSize: 1024 MiB, SmallFileSize: 128 KiB, SmallFileCount: 100, NumThreads: 1
Time used: 0.0 s, CPU: NaN%, Memory: 0.0 MiB
+------------------+----------------+---------------+
|       ITEM       |      VALUE     |      COST     |
+------------------+----------------+---------------+
|   Write big file |  1001.05 MiB/s |   1.02 s/file |
|    Read big file |   706.92 MiB/s |   1.45 s/file |
| Write small file |   24.6 files/s | 40.63 ms/file |
|  Read small file | 1540.2 files/s |  0.65 ms/file |
|        Stat file | 6314.6 files/s |  0.16 ms/file |
|   FUSE operation |   0 operations |    0.00 ms/op |
|      Update meta |   0 operations |    0.00 ms/op |
|       Put object |   0 operations |    0.00 ms/op |
|       Get object |   0 operations |    0.00 ms/op |
|    Delete object |   0 operations |    0.00 ms/op |
| Write into cache |   0 operations |    0.00 ms/op |
|  Read from cache |   0 operations |    0.00 ms/op |
+------------------+----------------+---------------+

pkg/object/bunny.go Outdated Show resolved Hide resolved
// Deleting keys that do not exist are expected to not throw an error
func (b *bunnyClient) Delete(key string) error {
if err := b.client.Delete(key, false); err != nil {
if err.Error() == "Not Found" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is confusing, we should only retry it when the key is ending with /.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or call b.client.Delete(key, strings.HasSuffix("/"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround is necessary to pass the tests. The boolean flag on b.client.Delete() only appends a / character to the path, which means that setting it at all should not be necessary.

Tests fail on repeated test runs with the following implementation:

func (b *bunnyClient) Delete(key string) error {
	if err := b.client.Delete(key, false); err != nil {
		if err.Error() == "Not Found" {
			return nil
		} else {
			return err
		}
	}
	return nil
}

Well, apparently some tests do not consistently pass paths with a trailing /. The first run works fine, but on any subsequent run a list test fails because there are empty directories left over from the first test run.

First run

=== RUN   TestBunny
    object_storage_test.go:131: PUT testEncodeFile failed: Unauthorized
    object_storage_test.go:180: out-of-range get: '', but got 0, error: Requested Range Not Satisfiable
    object_storage_test.go:284: list with delimiter is not supported
    object_storage_test.go:326: list with delimiter is not supported
    object_storage_test.go:527: https://storage.bunnycdn.com/juicefs-testunit-test/ does not support multipart upload: not supported
--- PASS: TestBunny (10.97s)
PASS

Object storage directory after the first run

image

Second Run

go1.20.13 test -v -tags gluster -run TestBunny github.com/juicedata/juicefs/pkg/object      
=== RUN   TestBunny
    object_storage_test.go:131: PUT testEncodeFile failed: Unauthorized
    object_storage_test.go:180: out-of-range get: '', but got 0, error: Requested Range Not Satisfiable
    object_storage_test.go:222: First key should be test, but got a/
--- FAIL: TestBunny (0.65s)
FAIL
FAIL    github.com/juicedata/juicefs/pkg/object 1.020s
FAIL

This workaround seems to be necessary, since it is not guaranteed that JuiceFS passes in the trailing /.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retested it after replacing ListAll with List and the tests fail again. With or without my deletion workaround there are objects left in the unit-test directory. Subsequent runs will always fail again. Is it possible that testStorage does not handle the cleanup of the unit-test directory on object stores, that allow prefixes containing only empty directories? I would propose cleaning up the root directory of the storage prefix like this:

diff --git a/pkg/object/object_storage_test.go b/pkg/object/object_storage_test.go
index cfab5b97..a806c4ef 100644
--- a/pkg/object/object_storage_test.go
+++ b/pkg/object/object_storage_test.go
@@ -120,6 +120,7 @@ func testStorage(t *testing.T, s ObjectStorage) {
        }
        prefix := "unit-test/"
        s = WithPrefix(s, prefix)
+       defer s.Delete("/")
        defer func() {
                if err := s.Delete("test"); err != nil {
                        t.Fatalf("delete failed: %s", err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this problem only appears on Object Storage providers that do support the concept of an empty directory, which S3 style APIs do not support. We would need to adjust the tests to clean up the unit-test directory properly.

pkg/object/bunny.go Outdated Show resolved Hide resolved
pkg/object/bunny.go Outdated Show resolved Hide resolved
pkg/object/bunny.go Outdated Show resolved Hide resolved

//listInParentDirectory := !strings.HasSuffix(prefix, "/")

listedObjects, err := b.client.List(path.Dir(prefix))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take the file.go as an example

@SandyXSD
Copy link
Contributor

@l0wl3vel Hi, it seems there's still an issue relating to the 'list' method that is not solved yet. Are you still working on this?

@l0wl3vel
Copy link
Contributor Author

@SandyXSD I still want to finish it. I already have the changes implemented, but I can´t make sure it works, since Bunny Object Storage has the concept of empty directories, which trips up the test suite when running it multiple times, because there are empty directories left from the previous run. I just pushed my changes.

}
} else if marker == "" { // If Directory && no marker: Return prefix directory as well
parentPath := path.Dir(path.Dir(prefix))
objects, err := b.client.List(parentPath+dirSuffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to add the directory itself here, will list it later. Please follow the example in file.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what changes you would like to see. The code here is based on sftp.go, like you suggested before. Could you elaborate please on what about it is incorrect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how sftp implements List:

	var objs []Object
	dir := f.path(prefix)
	if !strings.HasSuffix(dir, "/") {
		dir = filepath.Dir(dir)
		if !strings.HasSuffix(dir, dirSuffix) {
			dir += dirSuffix
		}
	} else if marker == "" {
		obj, err := f.Head(prefix)
		if err != nil {
			if os.IsNotExist(err) {
				return nil, nil
			}
			return nil, err
		}
		objs = append(objs, obj)
	}
	infos, err := c.sftpClient.ReadDir(dir)
	if err != nil {
		if os.IsPermission(err) {
			logger.Warnf("skip %s: %s", dir, err)
			return nil, nil
		}
		if os.IsNotExist(err) {
			return nil, nil
		}
		return nil, err
	}

	entries := f.sortByName(c.sftpClient, dir, infos, followLink)
	for _, o := range entries {
		key := o.Key()
		if !strings.HasPrefix(key, prefix) || (marker != "" && key <= marker) {
			continue
		}
		objs = append(objs, o)
		if len(objs) == int(limit) {
			break
		}
	}

When the marker is '', it try to return the prefix self (could be a directory). So instead of list the parent of prefix, can we just call Head() instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but listing a prefix in S3 does not only return the virtual prefix "directory", rather all keys that start with that prefix. This would mean that I implemented the correct behavior. Working on this is also really difficult, since the tests for object storage are very incomplete and miss a lot of corner cases around listing objects.

@davies
Copy link
Contributor

davies commented Apr 2, 2024

@l0wl3vel please rebase to main branch and fix the conflict, I think we are very close to merge this one.

@davies
Copy link
Contributor

davies commented Apr 17, 2024

This one is took over by #4719, thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Bunny Object Storage
7 participants