You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to migrate workload to something like Google Cloud Run. To do this I am going to need to change the data sync as Google Cloud Run is for stateless containers.
I am essentially using SQLite as a read only cache but I need a way to update the cache incrementally.
I am thinking I can have a separate container called "pypacktrends-sync" that can run a on daily cron. This container will be responsible for refreshing dbt and incrementally updating sqlite. The container will first pull down the last updated SQLite file, stored on some object store, probably Cloudflare D2 as they have zero egress fees, update the SQLite database ( including migrations ) and then publish the database to D2. After publishing I want to emit an event that any active containers can subscribe to to pull in the new file. Otherwise on container startup the container will pull in most recent updated file.
With this model I should be able to relax many of the pragma statements on SQLite when using as a read only cache that doesn't need any transactions. I will have to figure out how I can have different settings when doing the incremental sync and when the containers pulling it in. I believe journal mode is one of the few statements that persists after connection.
I want to avoid any volume mounts as this will add more latency on the reads.
I may also take the time into looking at using DuckDB instead because this new sync model I never have to worry about having a writer and read accessing the file at the same time, the writer will have exclusive access to the file during incremental sync and readers each have their own copy.
In my micro benchmarking DuckDB did not perform as well on queries will multiple package names as ART indexes per the discord
IN-clauses are not currently pushed into the scan, the development version has (partial) support for that - although it is not yet utilized by the ART index
Issue Description
I want to migrate workload to something like Google Cloud Run. To do this I am going to need to change the data sync as Google Cloud Run is for stateless containers.
I am essentially using SQLite as a read only cache but I need a way to update the cache incrementally.
I am thinking I can have a separate container called "pypacktrends-sync" that can run a on daily cron. This container will be responsible for refreshing dbt and incrementally updating sqlite. The container will first pull down the last updated SQLite file, stored on some object store, probably Cloudflare D2 as they have zero egress fees, update the SQLite database ( including migrations ) and then publish the database to D2. After publishing I want to emit an event that any active containers can subscribe to to pull in the new file. Otherwise on container startup the container will pull in most recent updated file.
With this model I should be able to relax many of the pragma statements on SQLite when using as a read only cache that doesn't need any transactions. I will have to figure out how I can have different settings when doing the incremental sync and when the containers pulling it in. I believe journal mode is one of the few statements that persists after connection.
I want to avoid any volume mounts as this will add more latency on the reads.
I may also take the time into looking at using DuckDB instead because this new sync model I never have to worry about having a writer and read accessing the file at the same time, the writer will have exclusive access to the file during incremental sync and readers each have their own copy.
In my micro benchmarking DuckDB did not perform as well on queries will multiple package names as ART indexes per the discord
Resources
The text was updated successfully, but these errors were encountered: