-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-893080: session.bulk_save_objects
does not put all objects in one INSERT
#441
Comments
session.bulk_save_objects
does not put all objects in one INSERTsession.bulk_save_objects
does not put all objects in one INSERT
Just came across this as well. Seems like a pretty big issue... |
Has there been any progress on this issue? |
hi and thank you for raising this issue. checking with the team to see whether we can address this before sqlalchemy 2.0 support release (which has right now priority). thank you for bearing with us ! |
my colleague took a look and shared the below example: Base = declarative_base()
class SampleBulk(Base):
__tablename__ = "sample_bulk"
pk = Column(Integer, Sequence('sample_bulk_pk_seq', order=True), primary_key=True)
name = Column(String(30), )
amount = Column(Integer, default=0)
def __repr__(self) -> str:
return f"SampleBulk(pk={self.pk}, name={self.name}, amount={self.amount})"
def main(engine):
try:
Base.metadata.create_all(engine)
with Session(engine) as session:
todds = (SampleBulk(name=f"Tod_{i}", amount=i) for i in range(1, 59999))
session.bulk_save_objects(todds)
session.commit()
result = session.query(func.count(SampleBulk.pk)).scalar()
print(f" *** {result=}")
finally:
Base.metadata.drop_all(engine)
if __name__ == "__main__":
main() This was able to insert 60k rows in one single command. Also on a side note, to be able to handle NULL in ORM models this example uses All in all; if any of you still has this problem, could you please check with v1.5.1 and see if it works for you ? |
Thanks for the response @sfc-gh-dszmolka. I ran your test and it did write out all 60k records in a single statement. I'm not seeing the same behavior in our app, though, so I'm trying to put together a test case that reproduces what I'm seeing. The entity I'm working with has about 40 columns and when I submit a batch of ~5k records, it inserts in batches of only about 15 at a time. I'll provide more info on the batch insert once I have a better test to illustrate it. However, the updates are an even bigger issue. If you modify your test case to include the below snippet, you will see that the updates are happening one at a time:
Output:
My expectation would be that it should be able to batch the updates into a single query with something like:
|
Thanks for the repro, we're taking a look. However this issue started to branch out a little from the original Also considering you now have problems with the Once the |
@sfc-gh-dszmolka : I have a reproducible example. The issue is if an entity has optional/nullable columns. Here's an example where inserts are not done in bulk:
Here's snippet of the output:
It appears that the inserts will only be batched if there are consecutive objects added to the list with the same set of columns populated. EDIT-Editing to add that I ran this test with |
thank you for adding this repro for 'not batching INSERTs' issue (issue handled in this issue441). Out of curiosity; I tried running the reproduction without Snowflake, using a postgres instance: # cat testpg.py
from sqlalchemy import create_engine, Column, String, Integer, Sequence, func
from sqlalchemy.engine import URL
from sqlalchemy.orm import declarative_base, Session
import random
url = URL.create(
drivername="postgresql",
username="postgres",
host="/var/run/postgresql",
database="test_db"
)
engine = create_engine(url, echo=True)
Base = declarative_base()
class SampleBulk(Base):
__tablename__ = "sample_bulk"
pk = Column(Integer, Sequence('sample_bulk_pk_seq', order=True), primary_key=True)
name = Column(String(30), )
amount = Column(Integer, default=0)
col1 = Column(String(4000))
col2 = Column(String(4000))
col3 = Column(String(4000))
def __repr__(self) -> str:
return f"SampleBulk(pk={self.pk}, name={self.name}, amount={self.amount})"
def main(engine):
try:
Base.metadata.create_all(engine)
with Session(engine) as session:
todds = []
for i in range(1, 10000):
d = {
"pk": i,
"name": f"Tod_{i}",
"amount": i,
}
for col in ['col1', 'col2', 'col3']:
if bool(random.getrandbits(1)):
d[col] = f"{col}_{i}"
todds.append(SampleBulk(**d))
### tried with both the defaults and both from your repro, INSERTs still come one at a time
session.bulk_save_objects(todds, update_changed_only=False, return_defaults=False, preserve_order=False)
#session.bulk_save_objects(todds)
session.commit()
result = session.query(func.count(SampleBulk.pk)).scalar()
print(f" *** {result=}")
finally:
Base.metadata.drop_all(engine)
if __name__ == "__main__":
main(engine) Result:
every For the issue with the |
my colleague also found the following interesting Stackoverflow post : https://stackoverflow.com/questions/48874745/sqlalchemy-bulk-insert-mappings-generates-a-large-number-of-insert-batches-is-t Looks to be very relevant to your use-case and has possible solutions as well, please take a look once you get a chance. Has to do with how the input data is structured, rather than any Snowflake aspect. |
Yes, I'm aware of that SO post. That is the method I used to work around the batching issue when using |
I can very much agree that it's not ideal, but also a bit baffled by the fact that same behaviour of During my tests with Postgres, I also tried using SQLAlchemy 2.0 instead 1.4.52, result was similar and Do you know perhaps of any dialect with which your reproduction works as expected, i.e. sends the generated 10k rows in a single |
Since the behaviour does not seem to originate from In the meantime please reach out to your Snowflake Account Team and let them know how having this improvement would be important for your use-case - this might put additional traction on the request and might help with reprioritizing. |
Please answer these questions before submitting your issue. Thanks!
Python 3.10.11 (main, Apr 5 2023, 14:15:10) [GCC 9.4.0]
Linux-5.14.0-1029-oem-x86_64-with-glibc2.31
pip freeze
)?Using the SqlAlchemy 1.4 bulk_save_objects API, I added 5000 objects in one SqlAlchemy session and then committed the session.
I expected to see one INSERT with 5000 VALUES rows. Instead, I see a variety of INSERT sizes, from 1 row to ~50 rows, and the insert of 5k objects takes 3+ minutes.
The text was updated successfully, but these errors were encountered: