"spark.sql.sources.partitionOverwriteMode": "DYNAMIC" - creates additional tables #1314

MichalBogoryja · 2024-11-15T14:44:25Z

When writing a spark dataframe to an existing partitioned BQ table I end up with the table modified in an expected way (partition added/modified). However, the additional table is being saved (it consists of the exact data of the dataframe that I was adding to the other table).
To reproduce:
database state: empty

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").config("enableReadSessionCaching", "false").getOrCreate()
spark
sdf.write.format("bigquery").option('partitionField', 'curdate').option('partitionType', 'DAY').mode('overwrite').save(f"{gcp_project_id}.{db}.{table_name}")

database state:
one table named {table_name} - data as in sdf

sdf_2.write.format("bigquery").mode('overwrite').save(f"{gcp_project_id}.{db}.{table_name}")

database state:
one table named {table_name} - data as in sdf with new data from sdf_2 (or if sdf_2 consists of the same partitions as there were in sdf, the original partitions are overwritten)
ADDITIONAL table named {table_name}random_numbers (eg. table_name4467706876500)

Can you modify the saving function to not save this additional table (or drop it after the save process)?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"spark.sql.sources.partitionOverwriteMode": "DYNAMIC" - creates additional tables #1314

"spark.sql.sources.partitionOverwriteMode": "DYNAMIC" - creates additional tables #1314

MichalBogoryja commented Nov 15, 2024

"spark.sql.sources.partitionOverwriteMode": "DYNAMIC" - creates additional tables #1314

"spark.sql.sources.partitionOverwriteMode": "DYNAMIC" - creates additional tables #1314

Comments

MichalBogoryja commented Nov 15, 2024