Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Change incremental sync to support stateless containers #56

Open
TylerHillery opened this issue Dec 1, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@TylerHillery
Copy link
Owner

Issue Description

I want to migrate workload to something like Google Cloud Run. To do this I am going to need to change the data sync as Google Cloud Run is for stateless containers.

I am essentially using SQLite as a read only cache but I need a way to update the cache incrementally.

I am thinking I can have a separate container called "pypacktrends-sync" that can run a on daily cron. This container will be responsible for refreshing dbt and incrementally updating sqlite. The container will first pull down the last updated SQLite file, stored on some object store, probably Cloudflare D2 as they have zero egress fees, update the SQLite database ( including migrations ) and then publish the database to D2. After publishing I want to emit an event that any active containers can subscribe to to pull in the new file. Otherwise on container startup the container will pull in most recent updated file.

With this model I should be able to relax many of the pragma statements on SQLite when using as a read only cache that doesn't need any transactions. I will have to figure out how I can have different settings when doing the incremental sync and when the containers pulling it in. I believe journal mode is one of the few statements that persists after connection.

I want to avoid any volume mounts as this will add more latency on the reads.

I may also take the time into looking at using DuckDB instead because this new sync model I never have to worry about having a writer and read accessing the file at the same time, the writer will have exclusive access to the file during incremental sync and readers each have their own copy.

In my micro benchmarking DuckDB did not perform as well on queries will multiple package names as ART indexes per the discord

IN-clauses are not currently pushed into the scan, the development version has (partial) support for that - although it is not yet utilized by the ART index

Resources

@TylerHillery TylerHillery added the enhancement New feature or request label Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant