Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add one more Engine for PostgreSQL #986

Open
chandr-andr opened this issue Apr 24, 2024 · 19 comments
Open

Add one more Engine for PostgreSQL #986

chandr-andr opened this issue Apr 24, 2024 · 19 comments

Comments

@chandr-andr
Copy link

Hello! Thank you for your awesome ORM.

Recently, I've created a new driver for PostgreSQL - https://github.com/qaspen-python/psqlpy. It shows a significant performance upgrade in comparison to asyncpg.
It has already been tested in some production services and I can say it is production-ready.

Do you mind if I create a big PR that adds a new engine based on PSQLPy?

@sinisaos
Copy link
Member

@chandr-andr Thank you for your interest in Piccolo. Your work with psqlpy and qaspen (I didn't go into details and had a quick look at the source code and I like that it has a lot of similarities with Piccolo :) ) is also great.
I have nothing against the alternative driver, but I have one question regarding the performance that you stated have a significant performance upgrade in comparison to asyncpg.
Apologies in advance if I'm doing something wrong as I'm not a benchmark expert and I'm using the http bencmark test as the ORM or driver will most likely be used for a web application. I've tried the FastAPI example from the docs and built a similar example with asyncpg like this

# users table only have 2 columns (id and name)
import asyncpg
import uvicorn

from contextlib import asynccontextmanager
from typing import AsyncGenerator

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse


@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
    """Startup database connection pool and close it on shutdown."""
    db_pool = await asyncpg.create_pool(
        dsn="postgres://postgres:postgres@localhost:5432/psqlpydb",
        max_size=10,
    )
    app.state.db_pool = db_pool
    yield
    await db_pool.close()


app = FastAPI(lifespan=lifespan)


@app.get("/asyncpg")
async def pg_pool_example(request: Request):
    query_result = await request.app.state.db_pool.fetch(
        "SELECT * FROM users",
    )
    return JSONResponse(content=[dict(item) for item in query_result])


if __name__ == "__main__":
    uvicorn.run("app:app" )

and the http benchmark results are pretty similar and asyncpg does a bit better? Here are the results.

rkl@mint21:~$ bombardier -c 500 -d 10s -l http://localhost:8000/asyncpg
Bombarding http://localhost:8000/asyncpg for 10s using 500 connection(s)
[============================================================================================]10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec       707.47     109.03    1021.59
  Latency      678.21ms    70.37ms      1.54s
  Latency Distribution
     50%   687.24ms
     75%   698.83ms
     90%   709.14ms
     95%   718.07ms
     99%      0.88s
  HTTP codes:
    1xx - 0, 2xx - 7556, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:   218.29KB/s
rkl@mint21:~$ bombardier -c 500 -d 10s -l http://localhost:8000/psqlpy
Bombarding http://localhost:8000/psqlpy for 10s using 500 connection(s)
[============================================================================================] 10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec       639.18      90.12     975.61
  Latency      745.92ms    72.92ms      1.08s
  Latency Distribution
     50%   755.51ms
     75%   762.43ms
     90%   784.68ms
     95%   797.32ms
     99%      0.97s
  HTTP codes:
    1xx - 0, 2xx - 6860, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:   197.62KB/s

Sorry again if I missed the point about driver performance.

@chandr-andr
Copy link
Author

@sinisaos Thanks for the quick feedback!
I found your benchmarks interesting and decided to find out what's going on.
So, in my example in the docs, I set only 2 connections for the connection pool, so if you used it without any change it has to lose to the asyncpg.

I created the FastAPI application (the same as yours) but set 10 connections for the connection pool in psqlpy.
The results are above:
PSQLPy

> ./bombardier -c 500 -d 10s -l http://127.0.0.1:8000/psqlpy
Bombarding http://127.0.0.1:8000/psqlpy for 10s using 500 connection(s)
[================================================================================================================================================================================================================] 10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec      4487.78     575.16    6509.45
  Latency      110.76ms     8.40ms   272.14ms
  Latency Distribution
     50%   109.78ms
     75%   113.55ms
     90%   117.31ms
     95%   121.84ms
     99%   146.59ms
  HTTP codes:
    1xx - 0, 2xx - 45289, 3xx - 0, 4xx - 0, 5xx - 0
    others - 3
  Errors:
    dial tcp 127.0.0.1:8000: connect: connection reset by peer - 3
  Throughput:     1.14MB/s

And AsyncPG

> ./bombardier -c 500 -d 10s -l http://127.0.0.1:8000/asyncpg
Bombarding http://127.0.0.1:8000/asyncpg for 10s using 500 connection(s)
[================================================================================================================================================================================================================] 10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec      4465.72     362.02    6188.94
  Latency      111.26ms   108.25ms      1.31s
  Latency Distribution
     50%   108.62ms
     75%   124.19ms
     90%   314.84ms
     95%   336.71ms
     99%   549.38ms
  HTTP codes:
    1xx - 0, 2xx - 45064, 3xx - 0, 4xx - 0, 5xx - 0
    others - 17
  Errors:
    dial tcp 127.0.0.1:8000: connect: connection reset by peer - 17
  Throughput:     1.14MB/s

In this test, PSQLPy beats asyncpg.
The difference, of course, is not so significant, but the query is as simple as possible too.
PSQLPy can give significant improvement with high-load big queries with a lot of output data.

In this example, there are not a lot of places for speeding up.

@chandr-andr
Copy link
Author

chandr-andr commented Apr 25, 2024

BTW, one guy made a database performance repo - https://github.com/ymezencev/db_perf with PSQLPy already.

@sinisaos
Copy link
Member

BTW, one guy made a database performance repo - https://github.com/ymezencev/db_perf with PSQLPy already.

@chandr-andr Thank you for your reply. According to the tests from that repo, the difference is significant for larger amounts of data, as you mentioned before, although I doubt anyone will load 50.000 rows from a single web api call without some sort of pagination. For example, PiccoloCRUD (which is used by Piccolo Admin), has a max_page_size parameter that limits the maximum number of rows returned per single API call. I have nothing against an alternative PostgreSQL engine (another option is always good), but you'll have to wait for the Piccolo ORM author to say what he thinks about all this. Cheers.

@dantownsend
Copy link
Member

It looks like a cool project.

Since we've only supported asyncpg up until now, I'm not sure how much work is involved in adding a new Postgres engine.

If we're lucky, it's just a case of creating a new Engine subclass, with no other changes to Piccolo, in which case it could even be a third party library at first.

@chandr-andr
Copy link
Author

@dantownsend Hello!
Thank you for the positive assessment of the PSQLPy.
I can do MVP engine based on the driver myself, I've had some experience with piccolo at work, so have knowledge about the internals.
I think I can do it in a new repo near PSQLPy.

@dantownsend
Copy link
Member

@chandr-andr That would be great, thanks!

@insani7y
Copy link

insani7y commented Jul 20, 2024

Hello everyone!

I'm one of the developers behind psqlpy. We've created a third-party library, which you can check out here. However, we've encountered a small problem that makes the integration nearly impossible without your assistance.

The issue lies within the _process_results method, where some unwrapping is performed. Asyncpg returns a Record object, but psqlpy does not. Instead, it behaves more like SQLite in this context, returning a list of dictionaries. Therefore, we don't need this unwrapping. Unfortunately, there is no appropriate engine type for our use case. We cannot use the SQLite engine type for obvious reasons.

Would it be possible to add a psqlpy-postgres type, so we can eliminate this Record object unwrapping?

Additionally, this change needs to be reflected in the Query class within the querystrings property. There might be other places affected, but these are what I identified during my reverse engineering.

Looking forward to our collaboration!

@dantownsend
Copy link
Member

@insani7y Interesting point - the logic to standardise responses as lists of dicts should go into the engine itself, rather than in the query. I'll have a look.

@dantownsend
Copy link
Member

@insani7y If you try piccolo==1.14.0 hopefully this issue is fixed.

@insani7y
Copy link

insani7y commented Jul 22, 2024

Hello, @dantownsend, thanks for your help, everything is working just fine! But there is another problem with nodes.
There is this code inside Query._run

if node is not None:
    from piccolo.engine.postgres import PostgresEngine

    if isinstance(engine, PostgresEngine):
        engine = engine.extra_nodes[node]

But psqlpy is also about postgres. Let's add some is_nodes_available attribute/method to the engine, so this functionality would not depend on the Query._run realization, but on the engine itself.
What do you think?

@dantownsend
Copy link
Member

@insani7y Yes, good point.

I had a look and there are a couple of other places which use isinstance(engine, PostgresEngine), which also need tweaking.

@insani7y
Copy link

insani7y commented Jul 23, 2024

@dantownsend would you like to fix it yourself? I can open an issue and fix it myself, actually.

@dantownsend
Copy link
Member

@insani7y I wonder whether instead of having an is_nodes_available property, we just move extra_nodes to the base Engine class.

It's unlikely anybody would use it for SQLite, but not impossible.

Or for SQLite, we just remove it from SQLiteEngine.__init__ and pass an empty list to super().__init__.

What do you think?

@dantownsend
Copy link
Member

@insani7y Also, one thought - is there any benefit in creating something like BasePosgresEngine that all Postgres engines inherit from?

Or what if your own custom engine inherited from PostgresEngine? Do you think that's too messy as there's so much asyncpg specific logic in there?

@insani7y
Copy link

insani7y commented Jul 23, 2024

@dantownsend I believe there's limited benefit in using BasePostgresEngine because the various drivers like asyncpg, psqlpy, and psycopg have significant internal differences. Adapting a single method to accommodate all these drivers would be extremely challenging and would essentially require a complete rewrite. Additionally, other components such as AsyncBatch, Savepoint, Transaction, e.t.c. differ considerably among these drivers.

Inheritance presents similar issues, offering minimal advantage in this context imo.

I prefer the concept of EngineProtocol or some EngineABC that includes abstract methods only. With this approach, Piccolo would expect an interface rather than a specific implementation. For a protocol-based approach, moving extra_nodes to the engine class seems more appropriate.

If SQLite does not require this attribute, allowing engines to accept different configurations could be acceptable. In such a scenario, the configuration could be represented by a dataclass that the engine depends on via typing.Generic. However, this might be overkill. We could simply ignore those nodes and issue a warning, but it would be better for the user if this option was completely removed from the engine config.

@sinisaos
Copy link
Member

@insani7y @chandr-andr I've tried your library and I think it's great. Maybe I should have created an issue in your repo and if needed I can do that. I found a few problems with the columns. If asyncpg is used as the driver, there are no such problems. Here it is and I hope it is well represented.

Intervalcolumn:

Object save example :

data = MegaTable(interval_col=datetime.timedelta(seconds=10))
await data.save()

Error:
psqlpy.exceptions.PyToRustValueMappingError: Can't convert value from python to rust type: Can not covert you type 0:00:10 into inner one

JSONB column:

Object save example :

data = MegaTable(jsonb_col={"a": 1})
await data.save()

Error:
psqlpy.exceptions.ConnectionExecuteError: Connection execute error: Cannot execute statement, error - db error: ERROR: unsupported jsonb version number 123.

BigInt column:

Error:
psqlpy.exceptions.ConnectionExecuteError: Connection execute error: Cannot execute statement, error - db error: ERROR: insufficient data left in message.

SmallInt column:

Error:
psqlpy.exceptions.ConnectionExecuteError: Connection execute error: Cannot execute statement, error - db error: ERROR: incorrect binary data format in bind parameter 6.

DoublePrecision column:

Error:
psqlpy.exceptions.ConnectionExecuteError: Connection execute error: Cannot execute statement, error - db error: ERROR: insufficient data left in message.

Error trace is same for all column errors:

File "/home/rkl/dev/env/lib/python3.11/site-packages/piccolo/query/base.py", line 189, in run
    return await self._run(node=node, in_pool=in_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkl/dev/env/lib/python3.11/site-packages/piccolo/query/base.py", line 171, in _run
    results = await engine.run_querystring(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkl/dev/env/lib/python3.11/site-packages/psqlpy_piccolo/engine.py", line 668, in run_querystring
    response = await self._run_in_pool(query, query_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rkl/dev/env/lib/python3.11/site-packages/psqlpy_piccolo/engine.py", line 609, in _run_in_pool
    response = await connection.execute(
               ^^^^^^^^^^^^^^^^^^^^^^^^^

Sorry for the long post and sorry if I'm doing something wrong. I hope you find it useful.

@insani7y
Copy link

@sinisaos Thanks for identifying these issues. We'll look into them and get back to you once everything is fixed!

@sinisaos
Copy link
Member

@insani7y Great. Thanks. You probably know that the rest of the columns errors come from object creating like this

data = MegaTable(
    bigint_col=100000000,
    double_precision_col=1.23,
    smallint_col=1,
)
await data.save()

I forgot to write that in the previous comment. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants