-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word delimiters #2709
Word delimiters #2709
Conversation
Text extraction variant "words" treats all strings not containing white spaces as "words". This change allows specifying additional characters serving the same purpose. This is useful for separating words from any punctuation suffixes or prefixes, identifying the single components of e-mail addresses, etc.
# Standard words extraction: | ||
# only spaces and line breaks start a new word | ||
words0 = [w[4] for w in page.get_text("words")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we assert that words0
is ['word1,word2', '-', 'word3.', 'word4?word5.']
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks fine to me apart from above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but that would mean we test original functionality. What would that prove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would prove that the change to the code hasn't broken the original functionality.
(And also make the test a little clearer.)
Adding additional assertion.
I have added an assertion showing that delimiters indeed have had an effect on the result. |
Add assertion for not having broken old functionality.
# Standard words extraction: | ||
# only spaces and line breaks start a new word | ||
words0 = [w[4] for w in page.get_text("words")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks fine to me apart from above comment.
# Standard words extraction: | ||
# only spaces and line breaks start a new word | ||
words0 = [w[4] for w in page.get_text("words")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would prove that the change to the code hasn't broken the original functionality.
(And also make the test a little clearer.)
This feature introduces the option to define extra word delimiting characters (beyond the standard white space) for text extraction variant "words".
A typical use is splitting strings into words also at punctuations. By default, "[email protected]" will be returned as one word. Using
delimiters=".@"
will return the 4 word components.