Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-219884: Pandas datetime with timezone converts to timestamp_ntz in snowflake #199

Open
kychanbp opened this issue Nov 9, 2020 · 19 comments
Assignees
Labels
bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team

Comments

@kychanbp
Copy link

kychanbp commented Nov 9, 2020

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using (python --version)?
    Python 3.8.2

  2. What operating system and processor architecture are you using (python -c 'import platform; print(platform.platform())')?
    macOS-10.15.7-x86_64-i386-64bit

  3. What are the component versions in the environment (pip list)?
    Package Version


asn1crypto 1.4.0
awswrangler 1.10.0
azure-common 1.1.25
azure-core 1.8.2
azure-storage-blob 12.5.0
boto3 1.15.18
botocore 1.18.18
certifi 2020.6.20
cffi 1.14.3
chardet 3.0.4
cryptography 3.2.1
idna 2.10
isodate 0.6.0
jmespath 0.10.0
msrest 0.6.19
numpy 1.19.4
oauthlib 3.1.0
oscrypto 1.2.1
packaging 20.4
pandas 1.1.4
pip 20.2.1
psycopg2-binary 2.8.6
pyarrow 2.0.0
pycparser 2.20
pycryptodomex 3.9.9
PyJWT 1.7.1
PyMySQL 0.10.1
pyOpenSSL 19.1.0
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2020.4
requests 2.23.0
requests-oauthlib 1.3.0
s3transfer 0.3.3
setuptools 50.2.0
six 1.15.0
snowflake-connector-python 2.3.5
snowflake-sqlalchemy 1.2.4
SQLAlchemy 1.3.20
sqlalchemy-redshift 0.8.1
urllib3 1.25.11
wheel 0.35.1

  1. What did you do?
    If possible, provide a recipe for reproducing the error.
    A complete runnable program is good.

I extract data from MySQL and constructed a pandas data frame. For example,

                     transaction_date                       created_on                      modified_on
0            2017-10-04 08:00:00+00:00        2018-10-05 18:56:39+08:00        2018-10-05 18:56:39+08:00
1            2017-10-04 08:00:00+00:00        2018-10-05 18:56:39+08:00        2018-10-05 18:56:39+08:00
2            2017-10-04 08:00:00+00:00        2018-10-05 18:56:39+08:00        2018-10-05 18:56:39+08:00
3            2017-10-04 08:00:00+00:00        2018-10-05 18:56:39+08:00        2018-10-05 18:56:39+08:00
4            2017-10-04 08:00:00+00:00        2018-10-05 18:56:39+08:00        2018-10-05 18:56:39+08:00
...                                ...                              ...                              ...
33934        2020-11-05 00:00:00+00:00 2020-11-06 18:45:03.681785+08:00 2020-11-06 18:45:03.711735+08:00
33935        2020-11-05 00:00:00+00:00 2020-11-06 18:45:03.685278+08:00 2020-11-06 18:45:18.441977+08:00
33936        2020-11-05 00:00:00+00:00 2020-11-06 18:45:03.688080+08:00 2020-11-06 18:45:21.579158+08:00
33937 2020-11-09 10:00:13.190445+00:00 2020-11-09 10:00:13.195200+08:00 2020-11-09 10:00:13.195237+08:00
33938        2020-11-09 10:28:57+00:00 2020-11-09 10:30:02.515862+08:00 2020-11-09 10:30:02.515884+08:00

the column types are:

transaction_date               datetime64[ns, UTC]
created_on          datetime64[ns, Asia/Hong_Kong]
modified_on         datetime64[ns, Asia/Hong_Kong]

I used to following code to load data into snowflake:
data.to_sql(target_table, conn, if_exists='replace',index=False, method=pd_writer)

  1. What did you expect to see?
    TIMESTAMP_TZ columns

  2. What did you see instead?
    TIMESTAMP_NTZ(9)

  3. Can you set logging to DEBUG and collect the logs?

import logging
import os

for logger_name in ['snowflake.sqlalchemy', 'snowflake.connector', 'botocore']: 
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setLevel(logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
    logger.addHandler(ch)
@github-actions github-actions bot changed the title Pandas datetime with timezone converts to timestamp_ntz in snowflake SNOW-219884: Pandas datetime with timezone converts to timestamp_ntz in snowflake Nov 9, 2020
@kychanbp
Copy link
Author

Any possibilities I got this error?

@kychanbp
Copy link
Author

I tried to inspect the log file of alchemy. The sql statement is

CREATE TABLE x_ap_account_transactions (
	"ID" BIGINT, 
	"OBJECT_PK" BIGINT, 
	"OBJECT_REF" TEXT, 
	"TRANSACTION_DESC" TEXT, 
	"TRANSACTION_TYPE" TEXT, 
	"TRANSACTION_DATE" TIMESTAMP, 
	"ORIGINAL_FEE" FLOAT, 
	"ORIGINAL_CURRENCY" TEXT, 
	"EFFECTIVE_FEE" FLOAT, 
	"EFFECTIVE_CURRENCY" TEXT, 
	"IS_SUSPENSION_TRANSACTION" BIGINT, 
	"CREATED_ON" TIMESTAMP, 
	"MODIFIED_ON" TIMESTAMP, 
	"STATUS" TEXT, 
	"ACCOUNT_ID" BIGINT
)

It seems that the statement only creates `TIMESTAMP`. But not timestamp with timezone information?

@kychanbp
Copy link
Author

Or should I set the timestamp mapping specify in this link: https://docs.snowflake.com/en/sql-reference/parameters.html#timestamp-type-mapping?

@kychanbp
Copy link
Author

I think I solve it by

conn.execute("ALTER SESSION SET TIMESTAMP_TYPE_MAPPING = 'TIMESTAMP_TZ';")
conn.execute("ALTER SESSION SET TIMEZONE = 'UTC';")

setting the timestamp type mapping and timezone.

@lightupyiqian
Copy link

Saw the same problem using pandato_sql. Looks like a bug as doc states that

"Timezone aware datetime columns will be written as Timestamp with timezone type with SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone unaware timestamps local to the original timezone.
" https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

@kychanbp
Copy link
Author

@lightupyiqian But it seem snowflake supports Timestamp with timezone? I expect pd_writer can handle that?

@lightupyiqian
Copy link

@lightupyiqian But it seem snowflake supports Timestamp with timezone? I expect pd_writer can handle that?

yeah, it should, but has other problem like not quoting identifier properly.

@Nazarii
Copy link

Nazarii commented Nov 19, 2020

This bug was introduced in v1.1.18 and is also touches sqlalchemy.DateTime(timezone=True) columns - in that case, the column is being passed as DATETIME to the snowflake, and DATETIME is an alias to TIMESTAMP_NTZ in snowflake. Previous v1.1.17 logic created TIMESTAMP_TZ column.
v1.1.17...v1.1.18

@toandm
Copy link

toandm commented Sep 8, 2022

Hi guys, this issue still persists with snowflake-connector-python==2.7.11. My dataframe column datatype is datetime64[ns, UTC], but when I use df.to_sql to write to Snowflake, result datatype is TIMESTAMP_NTZ. Any ideas how to fix this?

@ChuliangXiao
Copy link

Having the same issue Pandas datetime64[ns, America/New_York] becomes Snowflake TIMESTAMP_NTZ(9) using snowpark session.write_pandas

@El-Carverino
Copy link

El-Carverino commented Dec 10, 2022

I am also encountering the same problem.

  • I have a timezone-aware Python datetime object: start_date = datetime.now(tz=timezone.utc).
  • I have a Snowflake procedure, which expects a TIMESTAMP_TZ argument.
  • I am calling that Snowflake procedure, using Snowpark session.call, passing the above start_date variable as the argument.
  • Contrary to the documentation, the Snowflake data type becomes TIMESTAMP_NTZ, and when handled by the procedure, it "converts" the value (back) to TIMESTAMP_TZ, to match the expected type, which incorrectly alters the timestamp value.

@ed-openstore
Copy link

Has anyone found a true fix for this? Still persists using sqlalchemy and pdwriter

@sabrinazhengliu
Copy link

sabrinazhengliu commented Jan 5, 2023

I'm having the same issue using sqlalchemy engine and df.to_sql(). The source data is pandas timestamp with local time zone information. After writing to Snowflake it becomes TIMESTAMP_NTZ type. Anyone working on this issue?

Meanwhile, I was able to get the desired data types by explicitly creating the table with column definitions before ingesting data.

@jaycgibbs
Copy link

Also having this issue -- pandas dataframe with correct time zone offsets is converted to NTZ when sent to snowflake. I'm using sqlalchemy engine and df.to_sql().

@JayP718
Copy link

JayP718 commented Oct 19, 2023

bump

@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Mar 20, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-triage Issue is under initial triage label Mar 20, 2024
@sfc-gh-dszmolka
Copy link
Contributor

hi and apologies for taking so long to look into this, we're going to change this going forward. checking this issue and thank you Nazarii for your reproduction on the linked issue!

@sfc-gh-dszmolka
Copy link
Contributor

unfortunately it looks like this long regression is still there :( and switching to sqlalchemy.TIMESTAMP does not help either; the query is still generated as updated_at TIMESTAMP which is the same as the original updated_at datetime ; both are mapped to TIMESTAMP_NTZ Snowflake datatype by default.

we're going to take a look and fix this.

until then, the workarounds already suggested in this issue are available to use:

  • setting TIMESTAMP_TYPE_MAPPING to e.g. TIMESTAMP_TZ prior to creating the table with the timestamp column
  • creating the table with explicit TIMESTAMP_TZ column before ingesting data into it

thank you for bearing with us !

@sfc-gh-dszmolka sfc-gh-dszmolka added bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Mar 20, 2024
@Nazarii
Copy link

Nazarii commented Mar 20, 2024

@sfc-gh-dszmolka Thanks, btw I have a PR that fixes this issue that was opened for a long time, now there are merge conflicts that should be fixed,

@sfc-gh-dszmolka
Copy link
Contributor

thanks @Nazarii for the contribution! added some reviewers from the connector team in hope that the review process can be sped up

however as you mentioned, the conflicts still need to be fixed eventually; do you think that would be something possible to do ? or even submit a new PR based on the current main? there's been probably tons of changes since the PR was submitted originally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests