Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The many benefits of full house Postgres & its extensions #216

Open
tucnak opened this issue Dec 22, 2022 · 1 comment
Open

The many benefits of full house Postgres & its extensions #216

tucnak opened this issue Dec 22, 2022 · 1 comment

Comments

@tucnak
Copy link

tucnak commented Dec 22, 2022

I have only joined only but recently, however it wasn't long until it had restored in that lepra feeling albeit with a modern twist. This is surprising to me because I assumed, for the longest time, that the lepra phenomenon really can't be replicated. Turns out, given the circumstance, it absolutely can and it is my assessment that it will soon surpass lepra fair and square. That said, there is one aspect to lepra that I think requires quite a bit of forethought, and that is circus— all kinds of fun, exciting things (think elections, funny switches, modes, auto-corrections) that we used to cherish in the lepra of the past. Obviously, the old and tired tricks won't do it; good news is there's time and place for new ones! (That should be developed in secrecy and merged publicly only eventually, to not give away the surprise.)

My line of work has to do with data analysis, forecasting & large language model research.

But first, before we can go into tricks I would like to offer something more substantial from the engineering standpoint. Postgres has a wide array of advantages when it comes to projects with a clear, siloed data flow. First, I will enumerate some of these advantages, and then will formulate a proposition on exactly how these would play out to Orbitar's own advantage in the long run.

  • Extensive JSON support that allows aggregating complex payloads in the relational model, pre-planning said queries and thereby alleviating the need to perform ORM-based back-and-forth in cases where the logic is better defined in terms of SQL; I know it may seem like a silly idea to perform json_agg() in the data layer, however in my experience it's the most clear & performant way to implement. This doesn't have to be done in every query, but sometimes if you can get away with it, it's very rewarding. I could later provide instructive examples of how much more flexibility it has to offer!
  • ZomboDB is a native Postgres extension, offering "foreign" zdb Elasticsearch access method bringing the full extent of Elasticsearch capability to Postgres; simply turn your text column into indexed zdb.text and ZomboDB will replicate its contents to a nearby Elasticsearch instance with quality ranking & query language vastly superior to tf-idf, tsvector, tsquery approaches native to Postgres. This is incredibly useful and would otherwise allow to add state-of-the-art search functionality at a fraction of engineering cost.
  • Postgres is a great pub/sub & job server; you only need a table to represent the queue, then you can use SELECT FOR UPDATE SKIP LOCKED to guarantee that no single job is processed by more than one consumer in any given state and SQL NOTIFY to deliver job updates to the back-end service asynchronously. No more long polling, no more Kafka, or anything else for that matter. Unless you're dealing with hundreds of thousands of events a second, you will remain in the relational model, have your rollbacks and whatnot.
  • Timescale is perhaps the most impressive native Postgres extension that does time series very well, but also simplifies things like time & space partitioned hypertables, distributed hypertables, hyperfunctions, continuous aggregates and state of the art columnar compression while maintaining 100% compatibility with all of the Postgres APIs, so you can talk to hypertables like you would normally talk to any other table. Zero cost. Massive benefit when it comes to irregular loads, plus you can get away with hoarding and actively utilising much more data at only a fraction of the storage. I have used it successfully to ingest 1000s rps worth of product as well as analytical data on a single 16-core machine that also served the primary set of back-end services, including Redis cache. The single most general purpose invention of theirs that goes much further beyond the domain of time series— is continuous aggregates— a hybrid between views & materialised views, whereas any SELECT to it translates into a UNION of the most recent not-yet-materialised data points and the already materialised stuff. The materialisation occurs in the background incrementally, and takes updates/deletes into account using lightweight triggers. You can think of it as being the perfect realtime-correct solution to something like like/comment count for posts and comments that wouldn't require constant re-aggregation as well as providing full correctness at any point in time. With the help of user-defined aggregates you would be able to push— hefty logic that would otherwise be straight up nontrivial to even begin lay out in MySQL— to these in terms of lightweight, realtime compressible indexes at a fraction of the cost.
  • PostgresML is a native Postgres extension, offering bindings to complex classifier/regression models, as well as HuggingFace-compatible large language models based on Transformers. You can quite literally do SELECT pgml.train() and SELECT pgml.predict() in it, thereby massively simplifying experimentation, development and deployment of machine learning in the least amount of effort imaginable. Normally, you would have to provision a ML service of some sort, maintain the contract and bear the communication cost, but here it's really as trivial as it seems. First, feature engineer a view, then pass it around as a training rowset, boom, Bob's your uncle. All of your back-end services can now do predictions part of some REST SELECT.
  • Foreign data wrappers is a Postgres way of data integration; you can simply CREATE FOREIGN TABLE to any database in the world, including NoSQL things like MongoDB by specifying loosely-defined schema and the FDW policies. There's a FDW plugin for Telegram bots, can you believe it? So you can SELECT messages from chats, INSERT to reply, et cetera. Food for thought. This means no more ETL, all your data remains in the relational model, no back-end coordination whatsoever. This is not vastly interesting for small projects like this, but otherwise nice to know you have it in case you ever need it. And boy you do.

Now, the proposition is as follows; Orbitar is a young, dynamic website that is yet to realise its full potential, and some small tweaks now, when it comes to schema and data integration— would allow to low the bar in complexity, integration, maintenance to the point where complex things would really become trivial. Please don't get me wrong. The value proposition here is not performance, but everything else— ergonomics, extensibility, simplified reasoning, time-to-market. Things like PostgresML would make it trivial to roll out all kinds of models, be it for detecting multiple accounts, pushing GPT-like (there are now plenty of nice Russian and Ukrainian-speaking models of modest size which can be fine-tuned on Orbitar content to provide relevant generations) and other circus tricks that people like so much without ever bothering to set up a machine learning pipeline, sophisticated ranking that would correct for big clusters of bidirectional likes (кармодроч) and so much more.

MySQL holds a dear place in my heart, but these days the benefits of rolling out full house Postges in new projects outweigh all else. Just so you know; I'm very passionate about this project and would love to help out with the data engineering & machine learning work, bring some fun tricks in the process, regardless of what your take on the contents of this issue comes to be. I have a pre-made Docker container with the majority of extensions outlined here, if that's something you should like to try. I also have numerous instructive examples on how to improve things like vote aggregation logic in realtime-correct fashion by an approximate factor of 100x (cache notwithstanding) to the current implementation using continuous aggregates.

This is not a feature request, i.e. I would be happy to consult on & implement anything from above.

Best regards

-Ilya

@Aivean
Copy link
Collaborator

Aivean commented Dec 22, 2022

Thank you, Ilya, for the extensive writeup, it's quite an interesting proposition. I completely agree with your point about Postgres being more feature rich compared to MySQL (which we use mostly for the legacy reasons).

However, I must point out, that the engineering effort required for the switch (especially to make it seamless) is non-trivial, both in terms of operations and the codebase change. It's certainly nice to have the features you listed, however, it's not like they can't be implemented with the current data storage.

For example, we already have elasticsearch integration mostly ready, we already have workarounds in place for data aggregation, where it's needed, at our scale there is no problem in using mysql table as a queue or install rabbitmq as a service, and we had examples of bots integration previously and arguably using DB API for this not not the best approach design-wise.

That being said, I appreciate your enthusiasm and help, and I think we should definitely continue this discussion, weighing the options and coming to some mutual agreement for the path forward. I've sent you an invite to our dev discord, hope to see you there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants