Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO Implementation using Go CDK #176

Merged
merged 14 commits into from
Dec 16, 2024
Merged

IO Implementation using Go CDK #176

merged 14 commits into from
Dec 16, 2024

Conversation

loicalleyne
Copy link
Contributor

Extends PR #111

Implements #92. The Go CDK has well-maintained implementations for accessing objects stores from S3, Azure, and GCS via a io/fs.Fs-like interface. However, their file interface doesn't support the io.ReaderAt interface or the Seek() function that Iceberg-Go requires for files. Furthermore, the File components are private. So we copied the wrappers and implement the remaining functions inside of Iceberg-Go directly.

In addition, we add support for S3 Read IO using the CDK, providing the option to choose between the existing and new implementation using an extra property.

GCS connection options can be passed in properties map.

@loicalleyne
Copy link
Contributor Author

@dwilson1988 I saw your note about wanting to work on the CDK features, if you're able to provide some feedback that would be great.

@dwilson1988
Copy link
Contributor

@loicalleyne - happy to take a look. We use this internally in some of our software with Parquet and implemented a ReaderAt. I'll do a more thorough review when I get a chance, but my first thought was to leave it completely separate from the blob.Bucket implementation and let the Create/New funcs simple accept a *blob.Bucket and leave the rest as an exercise to the user. This keeps it more or less completely isolated from the implementation. Thoughts on this direction?

@loicalleyne
Copy link
Contributor Author

My goal today was just to "get something on paper" to move this forward since the other PR has been stalled since July, I used the other PR as a starting point so I mostly followed the existing patterns. Very open to moving things around if it makes sense. Do you have any idea how your idea would work with the interfaces defined in io.go?

@dwilson1988
Copy link
Contributor

Understood! I'll dig into your last question and get back to you.

@dwilson1988
Copy link
Contributor

dwilson1988 commented Oct 16, 2024

Okay, played around a bit and here's where my head is at.

The main reason I'd like to isolate the creation of a *blob.Bucket is I've found that the particular implementation of bucket access can get tricky and rather than support it in this package for all situations, support the most common usage in io.LoadFS/inferFileIOFromSchema and change io.CreateBlobFileIO to accept a *url.URL and a *blob.Bucket. This enables a user to open a bucket with whatever implementation they so choose (GCS, Azure, S3, MinIO, Mem, FileSystem, etc) and there's less code here to maintain.

What I came up with is changing CreateBlobFileIO to:

// CreateBlobFileIO creates a new BlobFileIO instance
func CreateBlobFileIO(parsed *url.URL, bucket *blob.Bucket) *BlobFileIO {
	ctx := context.Background()
	return &BlobFileIO{Bucket: bucket, ctx: ctx, opts: &blob.ReaderOptions{}, prefix: parsed.Host + parsed.Path}
}

The URL is still critical there, but now we don't have to concern ourselves with credentials to open the bucket except for in LoadFS.

Thoughts on this?

@loicalleyne
Copy link
Contributor Author

@dwilson1988
Sounds good, I've made the changes, please take a look.

Copy link
Contributor

@dwilson1988 dwilson1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loicalleyne, This looks really good to me! I'm not a maintainer of this repo, so I can't give the final word or anything, but this is exactly the direction I was thinking.

I'm happy to give azure a go after this is merged.

@zeroshade

io/blob.go Outdated Show resolved Hide resolved
io/blob.go Outdated Show resolved Hide resolved
io/gcs.go Outdated Show resolved Hide resolved
io/gcs.go Outdated Show resolved Hide resolved
io/io.go Outdated Show resolved Hide resolved
io/local.go Outdated Show resolved Hide resolved
io/s3.go Outdated Show resolved Hide resolved
@dwilson1988
Copy link
Contributor

@loicalleyne is this still on your radar?

@loicalleyne
Copy link
Contributor Author

hi @dwilson1988
yes, I'm wrapping up some work on another project and will be jumping back on this in a day or two.

@dwilson1988
Copy link
Contributor

Cool - just checking. I'll be patient. 🙂

@loicalleyne
Copy link
Contributor Author

@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look?

@dwilson1988
Copy link
Contributor

@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look?

Yes, can probably take a look next week

@loicalleyne
Copy link
Contributor Author

Hi @dwilson1988, do you think you'll have time to take a look at this?

@dwilson1988
Copy link
Contributor

Hi @dwilson1988, do you think you'll have time to take a look at this?

I opened a PR on your branch earlier today

@loicalleyne
Copy link
Contributor Author

@zeroshade hoping you can review when you've got time.

@zeroshade
Copy link
Member

I should be able to give this a review tomorrow or Friday. In the meantime can you resolve the conflict in the go.mod? Thanks!

@loicalleyne
Copy link
Contributor Author

@zeroshade

In the meantime can you resolve the conflict in the go.mod? Thanks!
I tried updating the go.mod version and toolchain versions to match main and ran go mod tidy, still shows conflicts; then I tried deleting go.mod and go.sum and reran go init github.com/apache/iceberg-go, re-updated the versions and ran go mod tidy and it's still flagging conflicts; it appears to be minor version updates on a lot of packages.

The go.mod conflicts are mostly minor version bumps, and a few additions/removals due to the switch to the CDK.

Does resolving go.mod conflicts mean undoing all changes to go.mod from the PR branch and letting the CI update the go.mod at build?

@dwilson1988
Copy link
Contributor

@loicalleyne - you should just be able to manually remove conflicts in go.mod, delete go.sum and run a go mod tidy. Probably best to do this after syncing your fork and rebasing onto main.

@zeroshade
Copy link
Member

@loicalleyne looks like the integration tests are failing, unable to read the manifest files from the minio instance.

@loicalleyne
Copy link
Contributor Author

loicalleyne commented Dec 12, 2024

I did some debugging by copying some of the test scenarios into a regular Go program (if anyone can tell me how to run Delve in VsCode on a test that uses testify please let me know), running the docker compose file and manually running the commands in iceberg-go\.github\workflows\go-integration.yml (note: to point to the local Minio in Docker I had to run export AWS_S3_ENDPOINT=http://127.0.0.1:9000).

It seems there's something wrong with the bucket prefix and how it interacts with subsequent calls, the prefix is assigned here
ie. it's trying to HEAD object

default/test_null_nan/metadata/00000-770ce240-af4c-49dd-bae9-6871f55f8be1.metadata.jsonwarehouse/default/test_null_nan/metadata/snap-2616202072048292962-1-6c011b0d-0f2a-4b62-bc17-158f94b1c470.avro

Unfortunately I don't have time to investigate any further right now, @dwilson1988 if you've seen this before please feel free to jump in.

@zeroshade
Copy link
Member

I've been able to replicate and debug the issue myself locally. Aside from needing to make a bunch of changes to fix the prefix, bucket and key strings, I was still unable to get gocloud.dev/blob/s3blob to find the file appropriately. I followed it down to the call to clientV2.GetObject and the s3v2.GetObjectInput has all the correct values: Bucket: "warehouse", Key: "default/test_all_types/....." etc. and yet minio still reports a 404. So I'm not sure what's going on.

I'll try poking at this tomorrow a bit more and see if i can make a small mainprog that is able to use s3blob to access a file from minio locally as a place to start.

@loicalleyne
Copy link
Contributor Author

Then I suspect it might be the s3ForcePathStyle option referred to here. It affected Minio in particular once they moved to s3 V2.

@zeroshade
Copy link
Member

@loicalleyne I haven't dug too far into the blob code, is it a relatively easy fix to handle that s3ForcePathStyle?

@loicalleyne
Copy link
Contributor Author

My understanding is that it's just another property to pass in props. Would also have to add it as a recognized property/constant in io/s3.go I should think.

@loicalleyne
Copy link
Contributor Author

s3.UsePathStyle

// Allows you to enable the client to use path-style addressing, i.e.,
// https://s3.amazonaws.com/BUCKET/KEY . By default, the S3 client will use virtual
// hosted bucket addressing when possible( https://bucket.s3.amazonaws.com/KEY ).
UsePathStyle [bool](https://pkg.go.dev/builtin#bool)

@zeroshade
Copy link
Member

@loicalleyne can you take a look at the latest changes I made here?

@loicalleyne
Copy link
Contributor Author

Is it intended to not provide the choice between virtual hosted bucket addressing and path-style addressing?
LGTM otherwise - the tests are passing :)

@zeroshade
Copy link
Member

@loicalleyne following pyiceberg's example, I've added an option to force virtual addressing. That work for you?

@loicalleyne
Copy link
Contributor Author

LGTM 👍

@zeroshade
Copy link
Member

@dwilson1988 When you get a chance, can you take a look at the changes I made here. I liked your thought on isolating things, but there was still a bunch of specific options for particular bucket types that needed to get accounted for as the options are not always passed via URL due to how Iceberg config properties work.

So I'd like your thoughts or comments on what I ultimately came up with to simplify what @loicalleyne already had while solving the failing tests and whether it fits what you were thinking and using internally. Once this is merged, I'd definitely greatly appreciate contributions for Azure as you said :)

@dwilson1988
Copy link
Contributor

@zeroshade - I'll take a look this weekend!

Copy link
Contributor

@dwilson1988 dwilson1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zeroshade zeroshade merged commit a2c1d2e into apache:main Dec 16, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants