Skip to content
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.

tcat_captured_phrases table retains tweet identifiers of deleted bins #365

Open
xmacex opened this issue Jun 3, 2019 · 6 comments
Open

Comments

@xmacex
Copy link
Contributor

xmacex commented Jun 3, 2019

Inspired by the GDPR-related issue #362 by @frederickjansen , greetings from IT University of Copenhagen who is also involved in the VirtEU project (however I am not, directly, but I am in the affiliated Technologies in Practice research group, as well as in the ETHOS Lab). Thanks for opening those discussions which I believe many share and have been thinking about, and I want to open a more detailed sub-topic as a separate issue.

I would like to ask what are your opinions or meditations regarding the table tcat_captured_tweets while deleting bins. The schema of this particular table is

mysql> EXPLAIN tcat_captured_phrases;
+------------+------------+------+-----+---------+-------+
| Field      | Type       | Null | Key | Default | Extra |
+------------+------------+------+-----+---------+-------+
| tweet_id   | bigint(20) | NO   | PRI | NULL    |       |
| phrase_id  | bigint(20) | NO   | PRI | NULL    |       |
| created_at | datetime   | NO   | MUL | NULL    |       |
+------------+------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

It maintains a persisting, and accumulating record of captured tweets, and more specifically of unique tweet identifiers. The tcat_query_phrases and tcat_query_bins_phrases tables connects the rows through to query phrases and then to query bins.

inside

function create_admin() {

A comment says

// This is used to estimate in (analysis/mod.ratelimits.php) how many tweets may have been ratelimited for bins associated with the phrase

Data is pushed to the table by

function insert_captured_phrase_ids($captured_phrase_ids) {

which is called at the end of

function processtweets($capturebucket) {

insert_captured_phrase_ids($captured_phrase_ids);

Finally, there was a perhaps relevant comment about the use of this table in #339 (comment).

Now, and relating to the topic discussed above, reading the function remove_bin() in capture/query_manager.php which the UI invokes

function remove_bin($params) {

I am rather convinced deleting a bin does not touch the tcat_captured_tweets. That means that the table serves as a "dehydrated" archive of captures, which can be used to "rehydrate" deleted bins.

Thus my question to you is this: do you think that deleting a query bin should delete also delete entries of this table? Basically, should deleting a bin also delete the archival record of what originally was a captured? I see the following options (disregarding considerations of computational complexity etc), please suggest others

  1. 🗄 Archival fever. Delete the bin, but keep a record of each collection event in tcat_captured_phrases. This is the status quo (however, see the last item in this list).
  2. 🗑 Historical revisionism. Extend the remove_bin() function, or better use foreign key CASCADE constrain to remove each of the tweets mentioned in tcat_captured_phrases from before dropping the tables of the bin.
  3. ✍️ Compromise: feverish revisionism. Extend the remove_bin() function to drop the value of tweet_id column for each of the tweets in the deleted bin, but keep the row to record that something was captured, perhaps by using foreign key SET NULL constrain. Maybe also round the created_at to the closest hour, for extra protection.
  4. 🔒 Privacy by Design, well that's a bit of a misnomer in case of a thing like TCAT. Anyhow, remove the table tcat_captured_phrases altogether from TCAT by stopping recording to it, and dropping it from existing installations via an update.
  5. 🙄 Chill, it is a non-issue and I have misunderstood something.

I have used SQL queries to evaluate, monitor and estimate rates of query bins as well as individual query phrases by writing SQL queries which use the table in question. I believe once a bin is deleted via the user interface, the tcat_captured_bins cannot be cleared post facto.

Personally, assuming 5 🙄 is false, I would maybe vote for 3 ✍️.

@xmacex
Copy link
Contributor Author

xmacex commented Jun 5, 2019

dmi-tcat/analysis/index.php

Lines 774 to 781 in e75ca2b

<!-- Module has been disabled because of https://twittercommunity.com/t/why-are-track-values-in-limit-notices-out-of-order-and-how-to-interpret-them/35729 -->
<!-- <h3>Export an estimation of the number of rate limited tweets in your data</h3> -->
<!-- <div class="txt_desc">Exports a spreadsheet with an estimation of the ammount of non-captured tweets in your query due to ratelimit occurances.</div> -->
<!-- <div class="txt_desc">Use: gain insight in possible missing data due to hitting the Twitter API rate limits.</div> -->
<!-- <div class="txt_link"> &raquo; <a href="" onclick="$('#whattodo').val('ratelimits'+getInterval());sendUrl('mod.ratelimits.php');return false;">launch</a></div> -->
<!-- <hr/> -->

@dentoir
Copy link
Contributor

dentoir commented Jun 6, 2019

It's worth noting the tcat_captured_phrases table is currently not used on the analytical side of things. It was intended to be used to 1) speed-up queries and 2) to produce fine-grained estimates of the nr. of ratelimited tweets per bin. We're working towards the first goal with modernizations such as issue #348 - and the second goal is hampered by Twitter not providing suitable/sufficient data.

I feel it may not make sense to keep that (large) table around unless we can define a clear future purpose for it.

@xmacex
Copy link
Contributor Author

xmacex commented Jun 7, 2019

Thanks. The table is not used and contents not exposed either by rows or in aggragates, but stuff, most notably tweet IDs are still being collected. This, I think, speaks against the table, or at least the current design.

Joined with tcat_query_phrases it however is an interesting historical record of the organization which is running TCAT.

@brendam
Copy link
Contributor

brendam commented Mar 4, 2020

I've just found this issue because found our small collection servers are getting full even though we use the API to move the tweets off them. Looking into it, I've found most of the space is being held by these files:

-rw-r----- 1 mysql mysql 3.0G Mar  4 13:18 tcat_captured_phrases.MYI
-rw-r----- 1 mysql mysql 1.9G Mar  4 13:18 tcat_captured_phrases.MYD

We only have 10Gb of disk space on the server.

I've used TRUNCATE tcat_captured_phrases and the server has lots of space again.

@ErikBorra
Copy link
Member

That's an interesting artifact. Will look into whether we can include an automate truncate or get rid of captured_phrases all together. Thanks for the heads-up!

@xmacex
Copy link
Contributor Author

xmacex commented Sep 21, 2021

Is there reason to not prune this table? Ours has been grown since 2016.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants