[sc-24431] Replace FileSink with StreamSink #776
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🤔 Why?
In order to reduce crawler memory usage, instead of having all parsed MCE hogging up memory it's perhaps more memory efficient to pipe a MCE to the file once it's been parsed. In order to do that, the sink has to support writing a single MCE at a time.
The new
StreamSink
has the exact same chunking mechanism asFileSink
, but it has to be context managed:When the context ends, the sink finalizes the last batch file, and writes the execution logs and metadata.
🤓 What?
StreamSink
class to support piped MCEs from crawler classes.Sink
ABC andFileSink
.🧪 Tested?
Tested on personal dev env with Snowflake crawler:
crawler output:
ingestion logs:
live-tail-results.csv
All 427 MCEs are ingested successfully.
☑️ Checks
pyproject.toml
.