-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement phone number analyzer #15915
Implement phone number analyzer #15915
Conversation
74429fe
to
d844ea9
Compare
❌ Gradle check result for 74429fe: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for d844ea9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
d844ea9
to
24e60a5
Compare
❌ Gradle check result for 24e60a5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
24e60a5
to
f7669e2
Compare
❕ Gradle check result for f7669e2: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15915 +/- ##
============================================
+ Coverage 71.88% 71.94% +0.05%
- Complexity 64496 64535 +39
============================================
Files 5291 5296 +5
Lines 301668 301744 +76
Branches 43576 43585 +9
============================================
+ Hits 216863 217094 +231
+ Misses 67031 66764 -267
- Partials 17774 17886 +112 ☔ View full report in Codecov by Sentry. |
this is a flaky test: #14304 and the failure of the "mend security check" also seems to be random (but i don't have the rights to re-trigger it) |
done (let's hope it stays mergeable for longer this time!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @rursprung ! @msfroh anything left on your side?
❌ Gradle check result for a3ac6dc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-15915-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d1fd47c652b4c6a2c0ec5d0ee574a0ff0d263177
# Push it to GitHub
git push --set-upstream origin backport/backport-15915-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x Then, create a pull request where the |
@rursprung apologies, mind please sending a manual backport to |
no worries, done: #16187 i'm a big fan of having a changelog, but it's causing a lot of merge conflicts here 🙁 on another note: squash-merging destroys my nice atomic commits 🙁 |
Please feel free to open an issue or kick off discussion!
The clean repo history is useful, but this is a tradeoff for sure |
this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389
this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389 Signed-off-by: Ralph Ursprung <[email protected]>
this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389 Co-authored-by: Fanit Kolchina <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]>
* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <[email protected]> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]> --------- Signed-off-by: Ralph Ursprung <[email protected]>
* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <[email protected]> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]> --------- Signed-off-by: Ralph Ursprung <[email protected]>
* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <[email protected]> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]> --------- Signed-off-by: Ralph Ursprung <[email protected]>
* document the new `analysis-phonenumber` plugin this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves #8389 Co-authored-by: Fanit Kolchina <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]> * Minor rewrites Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Update _analyzers/supported-analyzers/phone-analyzers.md Signed-off-by: kolchfa-aws <[email protected]> * Update _analyzers/supported-analyzers/phone-analyzers.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Apply suggestions from code review Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new test group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). Signed-off-by: Ralph Ursprung <[email protected]>
Description
this is largely based on elasticsearch-phone and internally uses
libphonenumber.
this intentionally only ports a subset of the features: only
phone
andphone-search
are supported right now,phone-email
can be addedif/when there's a clear need for it.
using
libphonenumber
is required since parsing phone numbers is anon-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list falsehoods programmers believe about phone
numbers.
this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (
ZZ
) had been usedwhich worked as long as international numbers were prefixed with
+
butdid not work when using local numbers (e.g. a number stored as
+4158...
was not matched against a number entered as004158...
or058...
).example configuration for an index:
this creates four analyzers:
phone
andphone-search
which do notexplicitly specify a region and thus fall back to
ZZ
(unknown region,regional version of international dialing prefix (e.g.
00
instead of+
in most of europe) will not be recognised) andphone-ch
andphone-search-ch
which will try to parse the phone number as a swissphone number (thus e.g.
00
as a prefix is recognised as theinternational dialing prefix).
note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a
String
in memory,making it unsuitable for large field values.
this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in
analysis-common
(see the discussion on the issue and thePR for further details).
note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.
closes #11326
Signed-off-by: Ralph Ursprung [email protected]
Related Issues
Resolves #11326
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.