Skip to content
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.

Visualization module(s?) fail because of unexpected collation #440

Open
xmacex opened this issue Sep 4, 2021 · 3 comments
Open

Visualization module(s?) fail because of unexpected collation #440

xmacex opened this issue Sep 4, 2021 · 3 comments

Comments

@xmacex
Copy link
Contributor

xmacex commented Sep 4, 2021

Describe the bug

At least one of the visualization functionalities, Bipartite hashtag-mention graph implemented by mod.mention_graph.php fails and produces no output for the user.

To Reproduce

Steps to reproduce the behavior:

  1. Go to the analysis interface and select a bin
  2. Scroll down to Networks section
  3. Locate the Bipartite hashtag-mention graph functionality
  4. Click » launch
  5. Observe that the results page doesn't offer a network to download

Expected behavior

A .gexf network file to download should be offered.

Screenshots

Screenshot 2021-09-04 at 17 32 36

Environment:

  • OS Ubuntu 20.04.2 LTS.
  • MySQL Ver 8.0.25-0ubuntu0.20.04.1.
  • TCAT isn't versioned but we are on commit 9654fe3.

Additional context

Seems to be collation issue. From Apache log:

[Thu Sep 02 12:25:19.771019 2021] [php7:error] [pid 3633248] [client 10.28.48.169:61208] PHP Fatal error:  Uncaught PDOException: SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4' in /var/www/tcat-tip/analysis/mod.mention_graph.php:56\nStack trace:\n#0 /var/www/tcat-tip/analysis/mod.mention_graph.php(56): PDOStatement->execute()\n#1 {main}\n  thrown in /var/www/tcat-tip/analysis/mod.mention_graph.php on line 56

Forensic analysis via the code to the database schema

Looking at where the error is coming from

$rec->execute();

we see that the SQL query is

$sql = "SELECT m.from_user_name COLLATE $collation as from_user_name, m.to_user COLLATE $collation as to_user FROM " . $esc['mysql']['dataset'] . "_mentions m, " . $esc['mysql']['dataset'] . "_tweets t ";

The variable collation is retrieved a little bit before

$collation = current_collation();

The function current_collation() is defined at

function current_collation() {

In there we find first that

$collation = 'utf8_bin';

and then on lines 868-873

    while ($res = $rec->fetch(PDO::FETCH_ASSOC)) {
        if (array_key_exists('Collation', $res) && ($res['Collation'] == 'utf8mb4_unicode_ci' || $res['Collation'] == 'utf8mb4_general_ci')) {
            $is_utf8mb4 = true;
            break;
        }
    }

So, the collation for the visualization module defaults to utf8_bin if the *_hashtags table does not have collation of utf8mb4_unicode_ci or utf8mb4_general_ci for any of the rows.

Now looking back at the Apache error message 1253 COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4', it seems that that utf8_bin default is used, but it's not appropriate for the table and therefore the hickup.

Let's investigate the

SHOW full columns FROM very_interesting_research_topic_hashtags;

describes the table as

+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
| Field          | Type         | Collation          | Null | Key | Default | Extra          | Privileges                      | Comment |
+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
| id             | int          | NULL               | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| tweet_id       | bigint       | NULL               | NO   | MUL | NULL    |                | select,insert,update,references |         |
| created_at     | datetime     | NULL               | YES  | MUL | NULL    |                | select,insert,update,references |         |
| from_user_name | varchar(255) | utf8mb4_0900_ai_ci | YES  | MUL | NULL    |                | select,insert,update,references |         |
| from_user_id   | bigint       | NULL               | YES  |     | NULL    |                | select,insert,update,references |         |
| text           | varchar(255) | utf8mb4_0900_ai_ci | YES  | MUL | NULL    |                | select,insert,update,references |         |
+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
6 rows in set (0.00 sec)

Right, either null of utf8mb4_0900_ai_ci collation. TCAT collation defaults to utf8_bin and isn't valid for utf8mb4* as Apache says.

A local fix kludge

If I add utf8mb4_0900_ai_ci to the line 869 of /analysis/common/functions.php by making it

if (array_key_exists('Collation', $res)
    && ($res['Collation']    == 'utf8mb4_unicode_ci'
        || $res['Collation'] == 'utf8mb4_general_ci'
        || $res['Collation'] == 'utf8mb4_0900_ai_ci')) {

(reflowed for legibility)

the visualization module works as expected, yielding a .gexf file and no errors in Apache logs. This kludge fixes it for that bin and for that function. I haven't had the change to investigate what are all the possible collations our database has for those %_hashtag columns, and where do they come from originally and if the migration scripts do something to them. There might be various collations, we've been running TCAT for some years.

Screenshot 2021-09-04 at 18 00 33

Suggestions for fixes

The open question is what are possible, expected and preferred values of that Collation column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that $is_utf8mb4 = true, or should the default on line L863 be something that covers a set of different utf8mb4* collations rather then utf8_bin? My expertise on MySQL collation is weak.

Does someone else have this? Please don't hesitate to ask for clarifications or otherwise collaborate for sorting¹ this out.

Have a good day ✨

¹ pun intended

@xmacex
Copy link
Contributor Author

xmacex commented Sep 4, 2021

A cool iridescent bug landed on my keyboard just as I finished the debugging session for today

IMG_20210904_183201__01.jpg

@alexandreteles
Copy link

alexandreteles commented Sep 19, 2021

The open question is what are possible, expected and preferred values of that Collation column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that $is_utf8mb4 = true, or should the default on line L863 be something that covers a set of different utf8mb4* collations rather then utf8_bin? My expertise on MySQL collation is weak.

On the current MySQL 8 release utf8 is still an alias for utf8mb3, so utf8_bin doesn't actually cover any of the utf8mb4* collations and the code should default to utf8mb4_bin instead. The problem I see here is that if there's any utf8mb3 collations hanging around, this change will instead break the code for those instances, in which case the best approach would be to modify those columns to utf8mb4.

@xmacex
Copy link
Contributor Author

xmacex commented Nov 2, 2021

Hi. Should I make a PR out of the A local kludge fix above as a starting point, or something else?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants