Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAY partitioned BQ table data deleted fully even though 'spark.sql.sources.partitionOverwriteMode' is DYNAMIC #1325

Open
soumikdas-oa opened this issue Dec 13, 2024 · 4 comments
Assignees
Labels
waiting for information Waiting for additional information from the issue opener

Comments

@soumikdas-oa
Copy link

We have a date (YYYY-MM-DD) partitioned BQ table where it partitioned by DAY. We want to update a specific partition data in 'overwrite' mode using PySpark. So to do this, I applied 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC' as per the spark-bigquery-connector documentation. But still it deleted the other partitioned data which should not be happening.

To give more context:

  • The dataframe is filtered by certain partition condition beforehand and then applied the 'write' to bq option. So the dataframe have the filtered partitioned data which supposed to overwrite the specific partition data, but that is not happening.
  • 'spark.sql.sources.partitionOverwriteMode' is set to 'DYNAMIC' in dataframe writer options, which did not work (as mentioned below).
  • The same above config set to the cluster advanced spark config which also did not work.
  • Even if the 'partitionField' & 'partitionType' options are removed from the below code, still the result is not expected one, i.e: its deleting whole table data instead of specific partition data.

df.write.format("bigquery") \ .option("table", f"{bq_table}") \ .option("dataset", f"{bq_dataset}") \ .option("temporaryGcsBucket", f"{temp_gcs_bucket}") \ .option("partitionField", f"{partition_date_col}") \ .option("partitionType", f"{bq_partition_type}") \ .option("spark.sql.sources.partitionOverwriteMode", "DYNAMIC") \ .option("writeMethod", "indirect") \ .mode("overwrite") \ .save()

Databricks Runtime Version: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)

@davidrabinowitz
Copy link
Member

Please verify which connector version does this Databricks runtime version uses.

@soumikdas-oa
Copy link
Author

Please verify which connector version does this Databricks runtime version uses.

image Please refer to the attached screenshot. This is Databricks Cluster's System Classpath where I found spark-bigquery-connector.

/databricks/jars/----ws_3_5--third_party--bigquery-connector--spark-bigquery-connector-hive-2.3__hadoop-3.2_2.12--118181791--fatJar-assembly-0.22.2-SNAPSHOT.jar | System Classpath

/databricks/jars/----ws_3_5--third_party--bigquery-connector--spark-bigquery-connector-upgrade_scala-2.12--118181791--spark-bigquery-with-dependencies_2.12-0.41.0.jar | System Classpath

@davidrabinowitz
Copy link
Member

It is very strange - usually you can't have two connectors in the same spark. Also, version 0.22 is very old. Can you please replace those jars with our latest spark-3.5-bigauery-0.41.0.jar?

@isha97
Copy link
Member

isha97 commented Dec 13, 2024

Also, you don't need to use partitionField and partitionType while using dynamic partition overwrite mode.

@isha97 isha97 added the waiting for information Waiting for additional information from the issue opener label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for information Waiting for additional information from the issue opener
Projects
None yet
Development

No branches or pull requests

3 participants