Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sc-24431] Replace FileSink with StreamSink #776

Conversation

usefulalgorithm
Copy link
Contributor

@usefulalgorithm usefulalgorithm commented Feb 2, 2024

🤔 Why?

In order to reduce crawler memory usage, instead of having all parsed MCE hogging up memory it's perhaps more memory efficient to pipe a MCE to the file once it's been parsed. In order to do that, the sink has to support writing a single MCE at a time.

The new StreamSink has the exact same chunking mechanism as FileSink, but it has to be context managed:

with StreamSink(config, metadata) as sink:
    for event in events: # TODO: in the near future `events` will be replaced by `extractor.run_async()`, which is a iterator that yields a MCE once at a time.
        sink.write_event(event)

When the context ends, the sink finalizes the last batch file, and writes the execution logs and metadata.

🤓 What?

  • Implemented new StreamSink class to support piped MCEs from crawler classes.
  • Removed Sink ABC and FileSink.
  • Modified unit test.

🧪 Tested?

Tested on personal dev env with Snowflake crawler:

crawler output:
截圖 2024-02-02 下午6 18 42
ingestion logs:
live-tail-results.csv

All 427 MCEs are ingested successfully.

☑️ Checks

  • My PR contains actual code changes, and I have updated the version number in pyproject.toml.

Copy link

This pull request has been linked to Shortcut Story #24431: Refactor BaseExtractor for memory efficiency.

Copy link

github-actions bot commented Feb 2, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
15794 14569 92% 85% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
metaphor/common/base_config.py 100% 🟢
metaphor/common/cli.py 100% 🟢
metaphor/common/runner.py 80% 🟢
metaphor/common/sink.py 97% 🟢
metaphor/common/storage.py 72% 🟢
TOTAL 90% 🟢

updated for commit: 94f0937 by action🐍

Copy link

codecov bot commented Feb 2, 2024

Codecov Report

Attention: 15 lines in your changes are missing coverage. Please review.

Comparison is base (568318a) 92.10% compared to head (94f0937) 92.24%.

Files Patch % Lines
metaphor/common/storage.py 54.54% 10 Missing ⚠️
metaphor/common/runner.py 70.00% 3 Missing ⚠️
metaphor/common/sink.py 98.16% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #776      +/-   ##
==========================================
+ Coverage   92.10%   92.24%   +0.14%     
==========================================
  Files         194      153      -41     
  Lines       15918    15794     -124     
==========================================
- Hits        14661    14569      -92     
+ Misses       1257     1225      -32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@usefulalgorithm
Copy link
Contributor Author

looks like S3 simply does not allow append:
https://stackoverflow.com/questions/41783903/append-data-to-an-s3-object

@usefulalgorithm
Copy link
Contributor Author

figured out a new way to do this, closing this PR

@mars-lan mars-lan deleted the tsung-julii/sc-24431/deprecate-file-sink-with-stream-sink branch May 21, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant