Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word delimiters #2709

Merged
merged 4 commits into from
Oct 2, 2023
Merged

Word delimiters #2709

merged 4 commits into from
Oct 2, 2023

Conversation

JorjMcKie
Copy link
Collaborator

This feature introduces the option to define extra word delimiting characters (beyond the standard white space) for text extraction variant "words".
A typical use is splitting strings into words also at punctuations. By default, "[email protected]" will be returned as one word. Using delimiters=".@" will return the 4 word components.

Text extraction variant "words" treats all strings not containing white spaces as "words".
This change allows specifying additional characters serving the same purpose. This is useful for separating words from any punctuation suffixes or prefixes, identifying the single components of e-mail addresses, etc.
Comment on lines +12 to +14
# Standard words extraction:
# only spaces and line breaks start a new word
words0 = [w[4] for w in page.get_text("words")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we assert that words0 is ['word1,word2', '-', 'word3.', 'word4?word5.'] here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks fine to me apart from above comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that would mean we test original functionality. What would that prove?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would prove that the change to the code hasn't broken the original functionality.

(And also make the test a little clearer.)

Adding additional assertion.
@JorjMcKie
Copy link
Collaborator Author

I have added an assertion showing that delimiters indeed have had an effect on the result.
We are already confirming the expected result.
We are now confirming, that the old functionality cannot / did not deliver this.
We should not check correct functioning of the old functionality (in this test).

Add assertion for not having broken old functionality.
Comment on lines +12 to +14
# Standard words extraction:
# only spaces and line breaks start a new word
words0 = [w[4] for w in page.get_text("words")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks fine to me apart from above comment.

Comment on lines +12 to +14
# Standard words extraction:
# only spaces and line breaks start a new word
words0 = [w[4] for w in page.get_text("words")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would prove that the change to the code hasn't broken the original functionality.

(And also make the test a little clearer.)

@JorjMcKie JorjMcKie merged commit b5fb574 into main Oct 2, 2023
@JorjMcKie JorjMcKie deleted the word-delimiters branch October 2, 2023 17:19
@github-actions github-actions bot locked and limited conversation to collaborators Oct 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants