-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add s3 batch consumer #43
base: main
Are you sure you want to change the base?
Conversation
ctx context.Context | ||
cancelFn context.CancelFunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, let's avoid contexts on structs. From experience this was always possible with some restructuring. I think this is one of my more controversial opinions when it comes to writing Go code :D I just checked the rest of the code and the Kinesis Datastream also follows this pattern. So better be consistent instead of mixing patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I had the same dilemma in mind when I saw the AWS kinesis DataStream implementation, so I decided to keep it constant.
Happy to reorg the logic if you think it's a dealbreaker and it doesn't take much time to implement
host/s3.go
Outdated
func (b *traceBatcher) reset() []ParquetTraceEvent { | ||
b.Lock() | ||
prevTraces := make([]ParquetTraceEvent, len(b.traces)) | ||
for i, trace := range b.traces { | ||
prevTraces[i] = *trace.toParquet() | ||
} | ||
b.traces = make([]*TraceEvent, 0) | ||
b.Unlock() | ||
return prevTraces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the data copy I think (not sure) that you can do something like this:
func (b *traceBatcher) reset() []ParquetTraceEvent {
b.Lock()
prevTraces := b.traces
b.traces = nil
b.Unlock()
return prevTraces
I'm really not sure about this.
Awesome that you have added all the tests! |
host/s3_test.go
Outdated
require.NoError(t, err) | ||
|
||
// wait 2,5 secs (flusher should kick in) | ||
time.Sleep(2500 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid sleeps in tests. Is there another synchronization way?
host/s3_test.go
Outdated
|
||
// submit the traces | ||
s3ds.submitRecords(ctx) | ||
time.Sleep(300 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. I'm happy to brainstorm how to add synchronization
Description
Due to the large number of traces Hermes can generate, we've decided to include the option to submit the traces (batched) into a given S3 bucket. Furthermore, we've agreed (at least so far) to rely on parquet files to make the traces easier to import on the Data processing side.
This PR adds that functionality into Hermes, adding all the necessary checks and tests to ensure nothing is broken.
Tasks:
localstack
s3 docker imageNOTE: this PR is still WIP, as I'll need to check the performance of both: the parquet formater and the s3 submitter.