[DuplicateFinder] Consider project-specific similarity adjustments #3475

imnasnainaec · 2024-12-09T19:49:35Z

Sometimes in a language, there may be a pair (or set) of letters such that substituting one for another commonly results in a different valid word. If there is such a pair of letters that are also not likely to be switched as the result of a typo (e.g., not near each other on the language's keyboard), then entries with vernacular forms that only differ by such a pair are not likely to be duplicates. Perhaps we could add a project setting for specifying pairs that should have altered weight in the duplicate finder algorithm. Word pairs with those differences wouldn't show up as early among sets of potential duplicates.

imnasnainaec · 2024-12-09T19:52:52Z

Questions:

Can the Levenshtein distance be easily modified for this?
While we're altering our distance algorithm, do we consider Damerau–Levenshtein distance (i.e., also allow transpositions)?

imnasnainaec added backend enhancement New feature or request goal: MergeDup project labels Dec 9, 2024

imnasnainaec self-assigned this Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DuplicateFinder] Consider project-specific similarity adjustments #3475

[DuplicateFinder] Consider project-specific similarity adjustments #3475

imnasnainaec commented Dec 9, 2024 •

edited

Loading

imnasnainaec commented Dec 9, 2024 •

edited

Loading

[DuplicateFinder] Consider project-specific similarity adjustments #3475

[DuplicateFinder] Consider project-specific similarity adjustments #3475

Comments

imnasnainaec commented Dec 9, 2024 • edited Loading

imnasnainaec commented Dec 9, 2024 • edited Loading

imnasnainaec commented Dec 9, 2024 •

edited

Loading

imnasnainaec commented Dec 9, 2024 •

edited

Loading