SNOW-1641644: Drop temp table directly at garbage collection instead of using multi-threading #2214

sfc-gh-jdu · 2024-09-03T20:54:04Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1641644
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
Please describe how your code solves the related issue.

see details in discussion

sfc-gh-jdu · 2024-09-03T20:56:56Z

src/snowflake/snowpark/session.py

@@ -643,11 +639,6 @@ def auto_clean_up_temp_table_enabled(self) -> bool:
        When setting this parameter to ``True``, Snowpark will automatically clean up temporary tables created by
        :meth:`DataFrame.cache_result` in the current session when the DataFrame is no longer referenced (i.e., gets garbage collected).
        The default value is ``False``.
-
-        Note:


We can still provide this guarantee, by checking all entries of the count map whether a count reaches 0 during garbage collection. But given we're not using watch thread, I don't think this guarantee makes much sense. We only need to do our best effort to clean up temp tables when this parameter is enabled. Alternatively, we can always easily add this guarantee later when the customer requests it.

This behavior is irrelevant with the threading work, it is a behavior of the cleaner. if we do not have this, what we are saying is we only clean up temp tables whose reference reach 0 after the cleaner is started. let's make sure this is documented clearly somewhere

We can say the temporary tables will only be dropped when this parameter is turned on during garbage collection, whereas the garbage collection in Python is triggered opportunistically and the timing is not guaranteed. I think it's also clear. Let me know wdyt.

sfc-gh-yzou · 2024-09-04T00:00:39Z

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

@@ -54,53 +46,26 @@ def _delete_ref_count(self, name: str) -> None:
        self.ref_count_map[name] -= 1
        if self.ref_count_map[name] == 0:
            self.ref_count_map.pop(name)


let's do not do pop here, if we keep the stop method, we can probably sent a teletemetry on close for the temp table that we didn't clean up (ones that whose reference that is 0, and ones that whose reference is not 0)

I think we can have a telemetry here about 1) the number of temp tables cleaned up (value is 0) 2) the total number of temp table created by cache_result (length of the dict), when closing the session. So we can understand the percentage of temp table created from cache_result that are cleaned up

I think actually it's a good point, and I added such telemetry and corresponding tests

sfc-gh-yzou · 2024-09-04T00:03:01Z

src/snowflake/snowpark/session.py

@@ -643,11 +639,6 @@ def auto_clean_up_temp_table_enabled(self) -> bool:
        When setting this parameter to ``True``, Snowpark will automatically clean up temporary tables created by
        :meth:`DataFrame.cache_result` in the current session when the DataFrame is no longer referenced (i.e., gets garbage collected).
        The default value is ``False``.
-
-        Note:


This behavior is irrelevant with the threading work, it is a behavior of the cleaner. if we do not have this, what we are saying is we only clean up temp tables whose reference reach 0 after the cleaner is started. let's make sure this is documented clearly somewhere

tests/integ/test_temp_table_cleanup.py

sfc-gh-yzou · 2024-09-04T00:13:36Z

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

        except Exception as ex:
            logging.warning(
-                f"Cleanup Thread: Failed to drop {common_log_text}, exception: {ex}"
+                f"Failed to drop {common_log_text}, exception: {ex}"


we can probably recored a telemetry later for failed table drops

actually we might not need telemetry for this case, because we can group by temp table name from statement params, then we can find the corresponding query is successful or not.

you can only do this for queries that has reached to server, not for the ones that don't even have a query id.
i think what we can do here is check if the async job is not None, and if it doesn't have a query_id, if not, we can sent a separate telemetry over

async job will not be None and only query id might be None. I added such a telemetry.

sfc-gh-helmeleegy · 2024-09-09T21:28:32Z

src/snowflake/snowpark/session.py

-            Even if this parameter is ``False``, Snowpark still records temporary tables when
-            their corresponding DataFrame are garbage collected. Therefore, if you turn it on in the middle of your session or after turning it off,
-            the target temporary tables will still be cleaned up accordingly.
+            The temporary tables will only be dropped when this parameter is turned on during garbage collection. That is, if the


The wording of this note is very hard to read. Can we have another pass over it?

yes fixed. Actually here I would like clarify that only if garbage collection when this parameter is on, table will be dropped. But garbage collection might not happen once the table object is not referenced anymore. It's controlled by python. So if a user turns on this parameter, a table object is not referenced whereas garbage collection doesn't occur, then user turns it off, finally the table will not be dropped.

The most recent writeup is way clearer. Thanks for this update! I would also add that perhaps an example for a simple script where we can show where a temp table gets created, where it becomes not referenced anymore, and the scope (following this last point) within which it may get dropped assuming (1) python garbage collection is triggered, and (2) the parameter is on.

If there are cases where this scenario can play out because of explicitly using cache_result or because of other (implicit) reasons for temp table creation, then we can also have an example for each such case.

yes, added!

src/snowflake/snowpark/session.py

sfc-gh-yzou · 2024-09-10T20:36:55Z

src/snowflake/snowpark/_internal/compiler/telemetry_constants.py

+
+
+class TempTableCleanupTelemetryField(Enum):
+    TYPE_TEMP_TABLE_CLEANUP = "snowpark_temp_table_cleanup"


that probably doesn't belong to the new compiliation stage, more suitable to be added at the current telemetry constants file

Sorry didn't realize it's in new compiliation stage, moved.

sfc-gh-yzou · 2024-09-10T20:40:17Z

tests/integ/test_temp_table_cleanup.py


    del df2
    gc.collect()
    time.sleep(WAIT_TIME)
    assert session._table_exists(table_ids)
    assert session._temp_table_auto_cleaner.ref_count_map[table_name] == 1
+    assert session._temp_table_auto_cleaner.num_temp_tables_created == 1
+    assert session._temp_table_auto_cleaner.num_temp_tables_cleaned == 0

    del df3
    gc.collect()
    time.sleep(WAIT_TIME)


what are the wait time here, the wait here just sounds like it is going to be flaky by default, i think we can just do a while loop check here with a time out of 5 mins here

sfc-gh-yzou · 2024-09-10T20:56:32Z

src/snowflake/snowpark/session.py

@@ -609,8 +606,12 @@ def close(self) -> None:
            raise SnowparkClientExceptionMessages.SERVER_FAILED_CLOSE_SESSION(str(ex))
        finally:
            try:
+                self._conn._telemetry_client.send_temp_table_cleanup_telemetry(


i would suggest you to add back the stop() method, and move those telemetry sent to the stop methods, and we send telemetry on every stop

good idea, added back

sfc-gh-yzou · 2024-09-10T21:00:21Z

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

        except Exception as ex:
            logging.warning(
-                f"Cleanup Thread: Failed to drop {common_log_text}, exception: {ex}"
+                f"Failed to drop {common_log_text}, exception: {ex}"


you can only do this for queries that has reached to server, not for the ones that don't even have a query id.
i think what we can do here is check if the async job is not None, and if it doesn't have a query_id, if not, we can sent a separate telemetry over

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

sfc-gh-yzou · 2024-09-10T21:01:28Z

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

-        self.stop_event.set()
-        if self.is_alive():
-            self.cleanup_thread.join()
+    def reset_reference_count_map(self) -> None:


is that only for testing purpose? can you just access the class to clear the map for just testing purpose?

tests/integ/test_temp_table_cleanup.py

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

sfc-gh-yzou · 2024-09-12T17:54:06Z

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

+            warning_message = f"Failed to drop {common_log_text}, exception: {ex}"
+            logging.warning(warning_message)
+            if query_id is None:
+                self.session._conn._telemetry_client.send_temp_table_cleanup_exception_telemetry(


add a comment here
"""
If no query_id is available that means the query haven't been accepted by gs, and it won't occur in our job_etl_view, send a separate telemetry for recording.
"""

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

sfc-gh-yzou · 2024-09-12T17:57:46Z

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

-        if self.is_alive():
-            self.cleanup_thread.join()
+        self.session._conn._telemetry_client.send_temp_table_cleanup_telemetry(
+            self.session.session_id,


also record the parameter value, so that we know when this telemetry is sent is it due to session close or parameter turn off

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

tests/integ/test_temp_table_cleanup.py

sfc-gh-yzou · 2024-09-12T18:07:06Z

src/snowflake/snowpark/_internal/telemetry.py

+        }
+        self.send(message)
+
+    def send_temp_table_cleanup_exception_telemetry(


we can call this "abnormal_exception" since we are not sending this for all exeception,

sfc-gh-yzou · 2024-09-12T18:07:42Z

src/snowflake/snowpark/_internal/temp_table_auto_cleaner.py

+            if query_id is None:
+                self.session._conn._telemetry_client.send_temp_table_cleanup_exception_telemetry(
+                    self.session.session_id,
+                    warning_message,


can we separate the table name and exception message, so that we do not need to parse the fields for information

sfc-gh-helmeleegy

LGTM.

sfc-gh-jdu requested a review from a team as a code owner September 3, 2024 20:54

sfc-gh-jdu requested review from sfc-gh-yixie, sfc-gh-yuwang and sfc-gh-aalam September 3, 2024 20:54

sfc-gh-jdu commented Sep 3, 2024

View reviewed changes

sfc-gh-jdu added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Sep 3, 2024

sfc-gh-yzou reviewed Sep 4, 2024

View reviewed changes

sfc-gh-jdu force-pushed the jdu-SNOW-1641644-fix-temp-table-cleaner branch from 0311885 to ce05b0b Compare September 6, 2024 23:34

sfc-gh-helmeleegy reviewed Sep 9, 2024

View reviewed changes

src/snowflake/snowpark/session.py Show resolved Hide resolved

sfc-gh-jdu force-pushed the jdu-SNOW-1641644-fix-temp-table-cleaner branch from 9447316 to 083b6a6 Compare September 10, 2024 19:00

sfc-gh-jdu requested review from sfc-gh-yzou and sfc-gh-helmeleegy September 10, 2024 19:08

sfc-gh-yzou reviewed Sep 10, 2024

View reviewed changes

sfc-gh-jdu force-pushed the jdu-SNOW-1641644-fix-temp-table-cleaner branch from 9721d9b to 200daf5 Compare September 10, 2024 23:28

sfc-gh-jdu requested a review from sfc-gh-yzou September 10, 2024 23:29

sfc-gh-yzou reviewed Sep 12, 2024

View reviewed changes

sfc-gh-jdu added 2 commits September 12, 2024 13:56

d

1707707

add test

8bef77a

sfc-gh-jdu force-pushed the jdu-SNOW-1641644-fix-temp-table-cleaner branch from 2b513e3 to 8bef77a Compare September 12, 2024 20:57

sfc-gh-yzou approved these changes Sep 12, 2024

View reviewed changes

sfc-gh-aalam approved these changes Sep 12, 2024

View reviewed changes

add example

f61a58e

sfc-gh-helmeleegy approved these changes Sep 12, 2024

View reviewed changes

sfc-gh-jdu merged commit f566e25 into main Sep 12, 2024
34 checks passed

sfc-gh-jdu deleted the jdu-SNOW-1641644-fix-temp-table-cleaner branch September 12, 2024 23:38

github-actions bot locked and limited conversation to collaborators Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1641644: Drop temp table directly at garbage collection instead of using multi-threading #2214

SNOW-1641644: Drop temp table directly at garbage collection instead of using multi-threading #2214

sfc-gh-jdu commented Sep 3, 2024

sfc-gh-jdu Sep 3, 2024 •

edited

Loading

sfc-gh-yzou Sep 4, 2024

sfc-gh-jdu Sep 6, 2024

sfc-gh-yzou Sep 4, 2024

sfc-gh-jdu Sep 6, 2024

sfc-gh-jdu Sep 6, 2024

sfc-gh-yzou Sep 4, 2024

sfc-gh-yzou Sep 4, 2024

sfc-gh-jdu Sep 6, 2024

sfc-gh-yzou Sep 10, 2024

sfc-gh-jdu Sep 10, 2024

sfc-gh-helmeleegy Sep 9, 2024

sfc-gh-jdu Sep 10, 2024

sfc-gh-helmeleegy Sep 12, 2024

sfc-gh-helmeleegy Sep 12, 2024

sfc-gh-jdu Sep 12, 2024

sfc-gh-yzou Sep 10, 2024

sfc-gh-jdu Sep 10, 2024

sfc-gh-yzou Sep 10, 2024

sfc-gh-yzou Sep 10, 2024

sfc-gh-jdu Sep 10, 2024

sfc-gh-yzou Sep 10, 2024

sfc-gh-yzou Sep 10, 2024

sfc-gh-jdu Sep 10, 2024

sfc-gh-yzou Sep 12, 2024

sfc-gh-jdu Sep 12, 2024

sfc-gh-yzou Sep 12, 2024

sfc-gh-jdu Sep 12, 2024

sfc-gh-yzou Sep 12, 2024

sfc-gh-jdu Sep 12, 2024

sfc-gh-yzou Sep 12, 2024

sfc-gh-jdu Sep 12, 2024

sfc-gh-helmeleegy left a comment



		class TempTableCleanupTelemetryField(Enum):
		TYPE_TEMP_TABLE_CLEANUP = "snowpark_temp_table_cleanup"

SNOW-1641644: Drop temp table directly at garbage collection instead of using multi-threading #2214

SNOW-1641644: Drop temp table directly at garbage collection instead of using multi-threading #2214

Conversation

sfc-gh-jdu commented Sep 3, 2024

sfc-gh-jdu Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-helmeleegy left a comment

Choose a reason for hiding this comment

sfc-gh-jdu Sep 3, 2024 •

edited

Loading