Google Sheet: Unformatted date column not loaded correctly, if the the first row contains no value for that column #553

francescomucio · 2024-08-13T06:59:22Z

dlt version

0.5.2

Source name

google_sheets

Describe the problem

Loading a Google Sheet we discovered that a date column (let's call it column_dt) is not loaded correctly if:

the first cell for that column is empty
the column is not formatted

If both conditions apply the first row had column_dt empty (as expected), but in the second row we will find 1970-01-01 instead of the actual value.

This is a problem because we often load sheet which we do not own.

I tried to use data_type hints, but the only thing that actually works is formatting the column in the sheet.
It happens with duckdb and PG, so I assume is db agnostic.

Expected behavior

The expected behaviour is to have the values loaded correctly.

Steps to reproduce

This is the issue and how to reproduce:

create a new Google Sheet and populate it with the following value, but do not format any column:

Start Date	End Date	Text
	        2027/04/12	blablabla
2027/04/12	2027/04/12	43432

Important: The date columns are not formatted as date. If you start playing with formatting, it is possible that you won't be able to reproduce the issue. Just create a new sheet and copy there only the values.

Use this code:

import dlt

from google_sheets import google_spreadsheet


def load_pipeline_with_sheets(spreadsheet_url_or_id: str) -> None:
    """
    Will load all the sheets in the spreadsheet, but it will not load any of the named ranges in the spreadsheet.
    """
    pipeline = dlt.pipeline(
        pipeline_name="google_sheets_pipeline",
        destination="duckdb",
        # full_refresh=True,
        dataset_name="sample_google_sheet_data",
    )
    data = google_spreadsheet(
        spreadsheet_url_or_id=spreadsheet_url_or_id,
        get_sheets=True,
        range_names=["Sheet1"],
    )

    data.resources["Sheet1"].apply_hints(
        table_name="test",
        columns={
            "Start Date": {"data_type": "date", "nullable": True},
            "End Date": {"data_type": "date", "nullable": False},
        },
    )

    info = pipeline.run(data)
    print(info)


if __name__ == "__main__":
    url_or_id = "YOUR_SHEET_ID"

    load_pipeline_with_sheets(url_or_id)

How you are using the source?

I run this source in production.

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt destination

duckdb to test, postgres in production

Additional information

Slack thread is here.

The text was updated successfully, but these errors were encountered:

rudolfix · 2024-08-15T14:17:31Z

@francescomucio I assume you do not have any impact on what is in the google sheets? we indeed use first row of data to infer the data types. But that is not happening via the values but via metadata. So to make it work you should just set the data types on the first row of data, even if they are empty.

data_type = val_dict["effectiveFormat"]["numberFormat"]["type"]
                if data_type in ["DATE_TIME", "TIME"]:
                    data_types[idx] = "timestamp"
                elif data_type == "DATE":
                    data_types[idx] = "date"

datetimes are represented as integers in some super convoluted way and it is impossible to infer date type just from content

if you cannot do that then indeed we need to start scanning deeper ie. obtain more rows of metadata.

francescomucio · 2024-08-15T20:25:30Z

Formatting the row is for sure my suggestion, but I am not sure if this is going to work.

I think it will make sense to try to get this metadata information from the first valid cell and not stopping at the first one

github-project-automation bot added this to Verified Sources Aug 13, 2024

github-project-automation bot moved this to Planned in Verified Sources Aug 13, 2024

anuunchin added the community label Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Sheet: Unformatted date column not loaded correctly, if the the first row contains no value for that column #553

Google Sheet: Unformatted date column not loaded correctly, if the the first row contains no value for that column #553

francescomucio commented Aug 13, 2024

rudolfix commented Aug 15, 2024

francescomucio commented Aug 15, 2024

Google Sheet: Unformatted date column not loaded correctly, if the the first row contains no value for that column #553

Google Sheet: Unformatted date column not loaded correctly, if the the first row contains no value for that column #553

Comments

francescomucio commented Aug 13, 2024

dlt version

Source name

Describe the problem

Expected behavior

Steps to reproduce

How you are using the source?

Operating system

Runtime environment

Python version

dlt destination

Additional information

rudolfix commented Aug 15, 2024

francescomucio commented Aug 15, 2024