-
Notifications
You must be signed in to change notification settings - Fork 114
tcat_captured_phrases table retains tweet identifiers of deleted bins #365
Comments
Lines 774 to 781 in e75ca2b
|
It's worth noting the I feel it may not make sense to keep that (large) table around unless we can define a clear future purpose for it. |
Thanks. The table is not used and contents not exposed either by rows or in aggragates, but stuff, most notably tweet IDs are still being collected. This, I think, speaks against the table, or at least the current design. Joined with |
I've just found this issue because found our small collection servers are getting full even though we use the API to move the tweets off them. Looking into it, I've found most of the space is being held by these files:
We only have 10Gb of disk space on the server. I've used |
That's an interesting artifact. Will look into whether we can include an automate truncate or get rid of captured_phrases all together. Thanks for the heads-up! |
Is there reason to not prune this table? Ours has been grown since 2016. |
Inspired by the GDPR-related issue #362 by @frederickjansen , greetings from IT University of Copenhagen who is also involved in the VirtEU project (however I am not, directly, but I am in the affiliated Technologies in Practice research group, as well as in the ETHOS Lab). Thanks for opening those discussions which I believe many share and have been thinking about, and I want to open a more detailed sub-topic as a separate issue.
I would like to ask what are your opinions or meditations regarding the table
tcat_captured_tweets
while deleting bins. The schema of this particular table ismysql> EXPLAIN tcat_captured_phrases;
It maintains a persisting, and accumulating record of captured tweets, and more specifically of unique tweet identifiers. The
tcat_query_phrases
andtcat_query_bins_phrases
tables connects the rows through to query phrases and then to query bins.inside
dmi-tcat/capture/common/functions.php
Line 302 in b75131c
A comment says
dmi-tcat/capture/common/functions.php
Line 438 in b75131c
Data is pushed to the table by
dmi-tcat/capture/common/functions.php
Line 2374 in b75131c
which is called at the end of
dmi-tcat/capture/common/functions.php
Line 2667 in b75131c
dmi-tcat/capture/common/functions.php
Line 2892 in b75131c
Finally, there was a perhaps relevant comment about the use of this table in #339 (comment).
Now, and relating to the topic discussed above, reading the function
remove_bin()
incapture/query_manager.php
which the UI invokesdmi-tcat/capture/query_manager.php
Line 158 in b75131c
I am rather convinced deleting a bin does not touch the
tcat_captured_tweets
. That means that the table serves as a "dehydrated" archive of captures, which can be used to "rehydrate" deleted bins.Thus my question to you is this: do you think that deleting a query bin should delete also delete entries of this table? Basically, should deleting a bin also delete the archival record of what originally was a captured? I see the following options (disregarding considerations of computational complexity etc), please suggest others
tcat_captured_phrases
. This is the status quo (however, see the last item in this list).remove_bin()
function, or better use foreign keyCASCADE
constrain to remove each of the tweets mentioned intcat_captured_phrases
from before dropping the tables of the bin.remove_bin()
function to drop the value oftweet_id
column for each of the tweets in the deleted bin, but keep the row to record that something was captured, perhaps by using foreign keySET NULL
constrain. Maybe also round thecreated_at
to the closest hour, for extra protection.tcat_captured_phrases
altogether from TCAT by stopping recording to it, and dropping it from existing installations via an update.I have used SQL queries to evaluate, monitor and estimate rates of query bins as well as individual query phrases by writing SQL queries which use the table in question. I believe once a bin is deleted via the user interface, the
tcat_captured_bins
cannot be cleared post facto.Personally, assuming 5 🙄 is false, I would maybe vote for 3 ✍️.
The text was updated successfully, but these errors were encountered: