Word delimiters #2709

JorjMcKie · 2023-10-02T15:21:36Z

This feature introduces the option to define extra word delimiting characters (beyond the standard white space) for text extraction variant "words".
A typical use is splitting strings into words also at punctuations. By default, "[email protected]" will be returned as one word. Using delimiters=".@" will return the 4 word components.

Text extraction variant "words" treats all strings not containing white spaces as "words". This change allows specifying additional characters serving the same purpose. This is useful for separating words from any punctuation suffixes or prefixes, identifying the single components of e-mail addresses, etc.

julian-smith-artifex-com · 2023-10-02T15:31:13Z

tests/test_word_delimiters.py

+    # Standard words extraction:
+    # only spaces and line breaks start a new word
+    words0 = [w[4] for w in page.get_text("words")]


Could we assert that words0 is ['word1,word2', '-', 'word3.', 'word4?word5.'] here?

All looks fine to me apart from above comment.

Sure, but that would mean we test original functionality. What would that prove?

It would prove that the change to the code hasn't broken the original functionality.

(And also make the test a little clearer.)

Adding additional assertion.

JorjMcKie · 2023-10-02T16:35:41Z

I have added an assertion showing that delimiters indeed have had an effect on the result.
We are already confirming the expected result.
We are now confirming, that the old functionality cannot / did not deliver this.
We should not check correct functioning of the old functionality (in this test).

Add assertion for not having broken old functionality.

julian-smith-artifex-com · 2023-10-02T15:32:07Z

tests/test_word_delimiters.py

+    # Standard words extraction:
+    # only spaces and line breaks start a new word
+    words0 = [w[4] for w in page.get_text("words")]


All looks fine to me apart from above comment.

julian-smith-artifex-com · 2023-10-02T15:38:20Z

tests/test_word_delimiters.py

+    # Standard words extraction:
+    # only spaces and line breaks start a new word
+    words0 = [w[4] for w in page.get_text("words")]


It would prove that the change to the code hasn't broken the original functionality.

(And also make the test a little clearer.)

JorjMcKie added 2 commits October 2, 2023 09:12

Minor bug removals

e0550da

JorjMcKie requested a review from julian-smith-artifex-com October 2, 2023 15:21

julian-smith-artifex-com reviewed Oct 2, 2023

View reviewed changes

Update test_word_delimiters.py

96967b4

Adding additional assertion.

Update test_word_delimiters.py

d570cbb

Add assertion for not having broken old functionality.

julian-smith-artifex-com approved these changes Oct 2, 2023

View reviewed changes

JorjMcKie merged commit b5fb574 into main Oct 2, 2023

JorjMcKie deleted the word-delimiters branch October 2, 2023 17:19

github-actions bot locked and limited conversation to collaborators Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word delimiters #2709

Word delimiters #2709

JorjMcKie commented Oct 2, 2023

julian-smith-artifex-com Oct 2, 2023

julian-smith-artifex-com Oct 2, 2023

JorjMcKie Oct 2, 2023

julian-smith-artifex-com Oct 2, 2023

JorjMcKie commented Oct 2, 2023

julian-smith-artifex-com Oct 2, 2023

julian-smith-artifex-com Oct 2, 2023

Word delimiters #2709

Word delimiters #2709

Conversation

JorjMcKie commented Oct 2, 2023

julian-smith-artifex-com Oct 2, 2023

Choose a reason for hiding this comment

julian-smith-artifex-com Oct 2, 2023

Choose a reason for hiding this comment

JorjMcKie Oct 2, 2023

Choose a reason for hiding this comment

julian-smith-artifex-com Oct 2, 2023

Choose a reason for hiding this comment

JorjMcKie commented Oct 2, 2023

julian-smith-artifex-com Oct 2, 2023

Choose a reason for hiding this comment

julian-smith-artifex-com Oct 2, 2023

Choose a reason for hiding this comment