Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Foreign Languages #496

Closed
ashvinnihalani opened this issue Feb 2, 2022 · 17 comments
Closed

Issue with Foreign Languages #496

ashvinnihalani opened this issue Feb 2, 2022 · 17 comments

Comments

@ashvinnihalani
Copy link

I don't believe that the application handles Foreign Languages very well

If you look at the recent Linus Tech Tips Video. There are several instances of foreign language spam comments getting through.

A simple solution would be to use the googletrans python package to translate the comment text before running the filter.

@ethnh
Copy link
Contributor

ethnh commented Feb 2, 2022

I'm not sure support for foreign languages is implemented, though I'd think the foreign words could just be added to the spam detection word list 🤔
+1 on adding support for foreign languages if not yet added

@UnknownCrafts
Copy link

I feel like making a folder with filter lists and then checking the spam words through there would be way better than translating messages as sometimes google translate can change the original message's meaning

@ashvinnihalani
Copy link
Author

@EthanHindmarsh Created a PR with foreign language support. Review/testing appreciated especially because I will need to spin up a Windows VM to test properly

@ashvinnihalani
Copy link
Author

ashvinnihalani commented Feb 2, 2022

@UnknownCrafts How many filter lists can you realistically keep? It is my understanding that bots cycle through thousands of replies combinations, find one that works, and then propagate that one. We can't keep a dictionary of every possible language combo. Are you worried about false positives spiking?

Also, side note: If the bots are truly automated then they would be using Google Translate, to begin with, right? Because they are trying to drive traffic to their site and Youtube's translate feature probably used Google Translate. That way when English speakers, the majority of Youtubes audience, click the translate button it gives the best English translation

@UnknownCrafts
Copy link

@UnknownCrafts How many filter lists can you realistically keep? It is my understanding that bots cycle through thousands of replies combinations, find one that works, and then propagate that one. We can't keep a dictionary of every possible language combo. Are you worried about false positives spiking?

Also, side note: If the bots are truly automated then they would be using Google Translate, to begin with, right? Because they are trying to drive traffic to their site and Youtube's translate feature probably used Google Translate. That way when English speakers, the majority of Youtubes audience, click the translate button it gives the best English translation

my only worry was false positives rising but I understand that we can't just keep on adding filter lists. I guess google translate is a good option but again I worry that false positives might rise because of it.

@ethnh
Copy link
Contributor

ethnh commented Feb 2, 2022

@UnknownCrafts How many filter lists can you realistically keep? It is my understanding that bots cycle through thousands of replies combinations, find one that works, and then propagate that one. We can't keep a dictionary of every possible language combo. Are you worried about false positives spiking?

Also, side note: If the bots are truly automated then they would be using Google Translate, to begin with, right? Because they are trying to drive traffic to their site and Youtube's translate feature probably used Google Translate. That way when English speakers, the majority of Youtubes audience, click the translate button it gives the best English translation

A decent idea would be to use google translate to detect the language and compare against the spam list for that language

@ethnh
Copy link
Contributor

ethnh commented Feb 2, 2022

Sending a bunch of requests to the Google Translate API for every message would not be a great solution though 🤔 Would definitely slow down the process greatly

@rcmaehl
Copy link

rcmaehl commented Feb 2, 2022

Sending a bunch of requests to the Google Translate API for every message would not be a great solution though 🤔 Would definitely slow down the process greatly

Google translate has quotas as well. We shouldn't be forcing users to balance so many quotas.

https://cloud.google.com/translate/quotas

@Rairye
Copy link

Rairye commented Feb 2, 2022

@ashvinnihalani Are there timestamps? Also, are there cases of specific foreign languages getting through or cases of non-ASCII text getting through?

I don't think translation is a good idea because (in addition to it being resource-intensive) the methods that spammers might use to evade spam filters in English are not necessarily the same in other languages.

@KendallDoesCoding
Copy link
Contributor

As stated by @ThioJoe in #477 , this is not possible.

Yea it's not really feasible for me to create filters for every single language. You'd be better off using entering your own search terms using one of the other filtering modes.

@KendallDoesCoding
Copy link
Contributor

@ashvinnihalani Also, this is more of a discussion then a issue. @ThioJoe Please move this to the discussions page with the ideas tag, thanks.

@KendallDoesCoding
Copy link
Contributor

KendallDoesCoding commented Feb 3, 2022

I'm not sure support for foreign languages is implemented, though I'd think the foreign words could just be added to the spam detection word list 🤔
+1 on adding support for foreign languages if not yet added

Yeah, ThioJoe can add a couple of scam words either in his spam-lists repo or even directory into the python script,

words like:
robux
vbucks

in other languages.

@KendallDoesCoding
Copy link
Contributor

Sending a bunch of requests to the Google Translate API for every message would not be a great solution though 🤔 Would definitely slow down the process greatly

Google translate has quotas as well. We shouldn't be forcing users to balance so many quotas.

https://cloud.google.com/translate/quotas

Yes and Google Translate is also not the best for transalating stuff for certain languages, if you get what I mean.

@KendallDoesCoding
Copy link
Contributor

I'm not sure support for foreign languages is implemented, though I'd think the foreign words could just be added to the spam detection word list 🤔
+1 on adding support for foreign languages if not yet added

Yeah, ThioJoe can add a couple of scam words either in his spam-lists repo or even directory into the python script,

words like: robux vbucks

in other languages.

@ThioJoe I can do this in your YT-Spam-Lists repo, if required.

@ashvinnihalani
Copy link
Author

ashvinnihalani commented Feb 3, 2022

So a couple of follow up comments:

  1. If you add specific foreign words what's to stop the spammers from learning to avoid those specifically. YouTube has a built in feature to block words a scammers get around it by either putting random accents on letters or something similar
  2. The rational behind using Google Translate is not to translate foreign comments on non English channels but rather target people using foreign languages to evade spam filters on English channels. These people are probably using Google Translate to translate their spam comments to begin with. Like people have mentioned the spam requirements for people speaking natively in other languages may be different so while this may be helpful in that scenario its not the primary goal
  3. In order to mitigate the slow down while waiting for API requests we can implement the following
    • Batch the translational requests in order to reduce both the API request limits like people are saying and the slowdown. The library I am using implements this with another method.
    • Create an optional parameter in the initial menu so people if they choose are able to use the parameter. That way the end user can decide not to use the feature if they are in a low-bandwidth area or something similar
    • Use async to make the API requests to be non blocking

With all of these improvements I think that the slow down will be negligible. Thoughts @KendallDoesCoding @ThioJoe @UnknownCrafts.

@ThioJoe
Copy link
Owner

ThioJoe commented Feb 3, 2022

It really wouldn't make a difference because very little of the filter even looks for whole words. I'm just going to close this because tbh I don't really intend to implement any kind of translation functionality.

If there is a pattern of a certain type of spammer in another language I'll take a look and see what I can do. But I'd need actual specific examples

@ThioJoe ThioJoe closed this as completed Feb 3, 2022
@KendallDoesCoding
Copy link
Contributor

It really wouldn't make a difference because very little of the filter even looks for whole words. I'm just going to close this because tbh I don't really intend to implement any kind of translation functionality.

If there is a pattern of a certain type of spammer in another language I'll take a look and see what I can do. But I'd need actual specific examples

fair, but ig there should be a comment somewhere in README saying this only works for english spam comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants