-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO Implementation using Go CDK #176
Conversation
@dwilson1988 I saw your note about wanting to work on the CDK features, if you're able to provide some feedback that would be great. |
@loicalleyne - happy to take a look. We use this internally in some of our software with Parquet and implemented a ReaderAt. I'll do a more thorough review when I get a chance, but my first thought was to leave it completely separate from the |
My goal today was just to "get something on paper" to move this forward since the other PR has been stalled since July, I used the other PR as a starting point so I mostly followed the existing patterns. Very open to moving things around if it makes sense. Do you have any idea how your idea would work with the interfaces defined in io.go? |
Understood! I'll dig into your last question and get back to you. |
Okay, played around a bit and here's where my head is at. The main reason I'd like to isolate the creation of a What I came up with is changing // CreateBlobFileIO creates a new BlobFileIO instance
func CreateBlobFileIO(parsed *url.URL, bucket *blob.Bucket) *BlobFileIO {
ctx := context.Background()
return &BlobFileIO{Bucket: bucket, ctx: ctx, opts: &blob.ReaderOptions{}, prefix: parsed.Host + parsed.Path}
} The URL is still critical there, but now we don't have to concern ourselves with credentials to open the bucket except for in Thoughts on this? |
@dwilson1988 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loicalleyne, This looks really good to me! I'm not a maintainer of this repo, so I can't give the final word or anything, but this is exactly the direction I was thinking.
I'm happy to give azure a go after this is merged.
@loicalleyne is this still on your radar? |
hi @dwilson1988 |
Cool - just checking. I'll be patient. 🙂 |
@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look? |
Yes, can probably take a look next week |
Hi @dwilson1988, do you think you'll have time to take a look at this? |
I opened a PR on your branch earlier today |
@zeroshade hoping you can review when you've got time. |
I should be able to give this a review tomorrow or Friday. In the meantime can you resolve the conflict in the go.mod? Thanks! |
The go.mod conflicts are mostly minor version bumps, and a few additions/removals due to the switch to the CDK. Does resolving go.mod conflicts mean undoing all changes to go.mod from the PR branch and letting the CI update the go.mod at build? |
@loicalleyne - you should just be able to manually remove conflicts in |
@loicalleyne looks like the integration tests are failing, unable to read the manifest files from the minio instance. |
I did some debugging by copying some of the test scenarios into a regular Go program (if anyone can tell me how to run Delve in VsCode on a test that uses testify please let me know), running the docker compose file and manually running the commands in It seems there's something wrong with the bucket prefix and how it interacts with subsequent calls, the prefix is assigned here
Unfortunately I don't have time to investigate any further right now, @dwilson1988 if you've seen this before please feel free to jump in. |
Signed-off-by: Loïc Alleyne <[email protected]>
Signed-off-by: Loïc Alleyne <[email protected]>
Signed-off-by: Loïc Alleyne <[email protected]>
I've been able to replicate and debug the issue myself locally. Aside from needing to make a bunch of changes to fix the prefix, bucket and key strings, I was still unable to get gocloud.dev/blob/s3blob to find the file appropriately. I followed it down to the call to I'll try poking at this tomorrow a bit more and see if i can make a small mainprog that is able to use s3blob to access a file from minio locally as a place to start. |
Then I suspect it might be the |
@loicalleyne I haven't dug too far into the blob code, is it a relatively easy fix to handle that |
My understanding is that it's just another property to pass in |
|
@loicalleyne can you take a look at the latest changes I made here? |
Is it intended to not provide the choice between virtual hosted bucket addressing and path-style addressing? |
@loicalleyne following pyiceberg's example, I've added an option to force virtual addressing. That work for you? |
LGTM 👍 |
@dwilson1988 When you get a chance, can you take a look at the changes I made here. I liked your thought on isolating things, but there was still a bunch of specific options for particular bucket types that needed to get accounted for as the options are not always passed via URL due to how Iceberg config properties work. So I'd like your thoughts or comments on what I ultimately came up with to simplify what @loicalleyne already had while solving the failing tests and whether it fits what you were thinking and using internally. Once this is merged, I'd definitely greatly appreciate contributions for Azure as you said :) |
@zeroshade - I'll take a look this weekend! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Extends PR #111
Implements #92. The Go CDK has well-maintained implementations for accessing objects stores from S3, Azure, and GCS via a io/fs.Fs-like interface. However, their file interface doesn't support the io.ReaderAt interface or the Seek() function that Iceberg-Go requires for files. Furthermore, the File components are private. So we copied the wrappers and implement the remaining functions inside of Iceberg-Go directly.
In addition, we add support for S3 Read IO using the CDK, providing the option to choose between the existing and new implementation using an extra property.
GCS connection options can be passed in properties map.