Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DuplicateFinder] Consider project-specific similarity adjustments #3475

Open
imnasnainaec opened this issue Dec 9, 2024 · 1 comment
Open
Assignees

Comments

@imnasnainaec
Copy link
Collaborator

imnasnainaec commented Dec 9, 2024

Sometimes in a language, there may be a pair (or set) of letters such that substituting one for another commonly results in a different valid word. If there is such a pair of letters that are also not likely to be switched as the result of a typo (e.g., not near each other on the language's keyboard), then entries with vernacular forms that only differ by such a pair are not likely to be duplicates. Perhaps we could add a project setting for specifying pairs that should have altered weight in the duplicate finder algorithm. Word pairs with those differences wouldn't show up as early among sets of potential duplicates.

@imnasnainaec
Copy link
Collaborator Author

imnasnainaec commented Dec 9, 2024

Questions:

  • Can the Levenshtein distance be easily modified for this?
  • While we're altering our distance algorithm, do we consider Damerau–Levenshtein distance (i.e., also allow transpositions)?

@imnasnainaec imnasnainaec self-assigned this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant