You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.
At least one of the visualization functionalities, Bipartite hashtag-mention graph implemented by mod.mention_graph.php fails and produces no output for the user.
To Reproduce
Steps to reproduce the behavior:
Go to the analysis interface and select a bin
Scroll down to Networks section
Locate the Bipartite hashtag-mention graph functionality
Click » launch
Observe that the results page doesn't offer a network to download
Expected behavior
A .gexf network file to download should be offered.
Screenshots
Environment:
OS Ubuntu 20.04.2 LTS.
MySQL Ver 8.0.25-0ubuntu0.20.04.1.
TCAT isn't versioned but we are on commit 9654fe3.
Additional context
Seems to be collation issue. From Apache log:
[Thu Sep 02 12:25:19.771019 2021] [php7:error] [pid 3633248] [client 10.28.48.169:61208] PHP Fatal error: Uncaught PDOException: SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4' in /var/www/tcat-tip/analysis/mod.mention_graph.php:56\nStack trace:\n#0 /var/www/tcat-tip/analysis/mod.mention_graph.php(56): PDOStatement->execute()\n#1 {main}\n thrown in /var/www/tcat-tip/analysis/mod.mention_graph.php on line 56
Forensic analysis via the code to the database schema
So, the collation for the visualization module defaults to utf8_bin if the *_hashtags table does not have collation of utf8mb4_unicode_ci or utf8mb4_general_ci for any of the rows.
Now looking back at the Apache error message 1253 COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4', it seems that that utf8_bin default is used, but it's not appropriate for the table and therefore the hickup.
Let's investigate the
SHOW full columns FROM very_interesting_research_topic_hashtags;
describes the table as
+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
| id | int | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| tweet_id | bigint | NULL | NO | MUL | NULL | | select,insert,update,references | |
| created_at | datetime | NULL | YES | MUL | NULL | | select,insert,update,references | |
| from_user_name | varchar(255) | utf8mb4_0900_ai_ci | YES | MUL | NULL | | select,insert,update,references | |
| from_user_id | bigint | NULL | YES | | NULL | | select,insert,update,references | |
| text | varchar(255) | utf8mb4_0900_ai_ci | YES | MUL | NULL | | select,insert,update,references | |
+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
6 rows in set (0.00 sec)
Right, either null of utf8mb4_0900_ai_ci collation. TCAT collation defaults to utf8_bin and isn't valid for utf8mb4* as Apache says.
A local fix kludge
If I add utf8mb4_0900_ai_ci to the line 869 of /analysis/common/functions.php by making it
the visualization module works as expected, yielding a .gexf file and no errors in Apache logs. This kludge fixes it for that bin and for that function. I haven't had the change to investigate what are all the possible collations our database has for those %_hashtag columns, and where do they come from originally and if the migration scripts do something to them. There might be various collations, we've been running TCAT for some years.
Suggestions for fixes
The open question is what are possible, expected and preferred values of that Collation column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that $is_utf8mb4 = true, or should the default on line L863 be something that covers a set of different utf8mb4* collations rather then utf8_bin? My expertise on MySQL collation is weak.
Does someone else have this? Please don't hesitate to ask for clarifications or otherwise collaborate for sorting¹ this out.
Have a good day ✨
¹ pun intended
The text was updated successfully, but these errors were encountered:
The open question is what are possible, expected and preferred values of that Collation column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that $is_utf8mb4 = true, or should the default on line L863 be something that covers a set of different utf8mb4* collations rather then utf8_bin? My expertise on MySQL collation is weak.
On the current MySQL 8 release utf8 is still an alias for utf8mb3, so utf8_bin doesn't actually cover any of the utf8mb4* collations and the code should default to utf8mb4_bin instead. The problem I see here is that if there's any utf8mb3 collations hanging around, this change will instead break the code for those instances, in which case the best approach would be to modify those columns to utf8mb4.
Describe the bug
At least one of the visualization functionalities, Bipartite hashtag-mention graph implemented by
mod.mention_graph.php
fails and produces no output for the user.To Reproduce
Steps to reproduce the behavior:
Expected behavior
A
.gexf
network file to download should be offered.Screenshots
Environment:
Additional context
Seems to be collation issue. From Apache log:
Forensic analysis via the code to the database schema
Looking at where the error is coming from
dmi-tcat/analysis/mod.mention_graph.php
Line 56 in 9654fe3
we see that the SQL query is
dmi-tcat/analysis/mod.mention_graph.php
Line 46 in 9654fe3
The variable
collation
is retrieved a little bit beforedmi-tcat/analysis/mod.mention_graph.php
Line 33 in 9654fe3
The function
current_collation()
is defined atdmi-tcat/analysis/common/functions.php
Line 853 in 9654fe3
In there we find first that
dmi-tcat/analysis/common/functions.php
Line 863 in 9654fe3
and then on lines 868-873
So, the collation for the visualization module defaults to
utf8_bin
if the*_hashtags
table does not have collation ofutf8mb4_unicode_ci
orutf8mb4_general_ci
for any of the rows.Now looking back at the Apache error message
1253 COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4'
, it seems that thatutf8_bin
default is used, but it's not appropriate for the table and therefore the hickup.Let's investigate the
SHOW full columns FROM very_interesting_research_topic_hashtags;
describes the table as
Right, either null of utf8mb4_0900_ai_ci collation. TCAT collation defaults to
utf8_bin
and isn't valid forutf8mb4*
as Apache says.A local fix kludge
If I add
utf8mb4_0900_ai_ci
to the line 869 of/analysis/common/functions.php
by making it(reflowed for legibility)
the visualization module works as expected, yielding a .gexf file and no errors in Apache logs. This kludge fixes it for that bin and for that function. I haven't had the change to investigate what are all the possible collations our database has for those
%_hashtag
columns, and where do they come from originally and if the migration scripts do something to them. There might be various collations, we've been running TCAT for some years.Suggestions for fixes
The open question is what are possible, expected and preferred values of that
Collation
column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that$is_utf8mb4 = true
, or should the default on line L863 be something that covers a set of differentutf8mb4*
collations rather thenutf8_bin
? My expertise on MySQL collation is weak.Does someone else have this? Please don't hesitate to ask for clarifications or otherwise collaborate for sorting¹ this out.
Have a good day ✨
¹ pun intended
The text was updated successfully, but these errors were encountered: