Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect community from confusable homoglyphs #1470

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

MartinDelille
Copy link
Contributor

This PR aims to fix #1469 by protecting the creation of confusable name for community.

I learned how to use the library during the implementation but there is something I don't know very well how to do.

The documentation says that we have to update the data file by executing the two following commands:

pip install confusable_homoglyphs[cli]
confusable_homoglyphs update

I suppose the call the confusable_homoglyphs update has to be added to the Makefile during the make env step, am I right?

@MartinDelille MartinDelille requested a review from Changaco March 16, 2019 20:18
@mattbk
Copy link
Contributor

mattbk commented Mar 18, 2019

How often do they need to be updated? Could this be done in cron, or in the payday script?

I'm not certain make env gets run very often in production unless needed by code changes.

@MartinDelille
Copy link
Contributor Author

I think updating it every time the site is updated would be fair enough. The script update if from this page: http://ftp.unicode.org/Public/security/latest/confusables.txt

Last update was november 2018.

liberapay/models/community.py Outdated Show resolved Hide resolved
for existing_name in all_names:
if cls._unconfusable(name) == cls._unconfusable(existing_name):
raise CommunityAlreadyExists

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very inefficient implementation, the more communities there are the slower it becomes. Also you're repeating an identical function call (cls._unconfusable(name)) in every loop iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here: 9b496bc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only fixed the repeated call, not the wider issue: this implementation is still inefficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part is the most inefficient:

  • the database query?
  • the unconfusable lookup?
  • the unconfusable_string function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetching of all the existing community names from the database and the subsequent loop. That's linear complexity, O(n), with n being the number of existing communities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably do something like this instead:

                possible_collisions = get_confusables(name)
                possible_collisions.append(name)
                collides_with = cursor.one("""
                    SELECT name
                      FROM communities
                     WHERE name IN %s
                     LIMIT 1
                """, (tuple(possible_collisions),))
                if collides_with:
                    raise CommunityAlreadyExists(collides_with)

liberapay/models/community.py Outdated Show resolved Hide resolved
requirements_base.txt Show resolved Hide resolved
tests/py/test_communities.py Outdated Show resolved Hide resolved
@Changaco
Copy link
Member

I'm not certain make env gets run very often in production unless needed by code changes.

make is not used at all in production.

@Changaco
Copy link
Member

Changaco commented Mar 19, 2019

We should probably have a copy of the confusables.txt file in the repo so that we can fall back to it if fetching the current file from the ftp.unicode.org server fails.

@Changaco
Copy link
Member

Strike that, the confusable_homoglyphs package already ships with a copy of the data transformed into JSON, so we don't need to deal with fetching the confusables.txt file at all.

@MartinDelille
Copy link
Contributor Author

Ok I removed the update in the make env step here: 64033bd

@MartinDelille MartinDelille requested a review from Changaco March 21, 2019 14:41
for existing_name in all_names:
if cls._unconfusable(name) == cls._unconfusable(existing_name):
raise CommunityAlreadyExists

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only fixed the repeated call, not the wider issue: this implementation is still inefficient.

@@ -0,0 +1,11 @@
from confusable_homoglyphs import confusables

def unconfusable_string(name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a docstring explaining what this function does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here: a73af92

Tell me if there is a specific format for that.

for existing_name in all_names:
if cls._unconfusable(name) == cls._unconfusable(existing_name):
raise CommunityAlreadyExists

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetching of all the existing community names from the database and the subsequent loop. That's linear complexity, O(n), with n being the number of existing communities.

from confusable_homoglyphs import confusables

# Convert an Unicode string to its equivalent replacing all confusable homoglyphs
# to its common/latin equivalent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a comment, not a docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Confusable homoglyphs
3 participants