-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent word swap #752
Consistent word swap #752
Conversation
@k-ivey please add test function showing the revised parts are working as intended. |
@qiyanjun I added two new tests in
Both tests modify an initial text with duplicated location or name and verify that the transformed text contains no instances of the initial location or name. Since the PyTest action appears to be failing due to memory issues, I'll introduces changes to the workflow to increase the swap size of the machine. |
@k-ivey please sync your branch with the main |
@qiyanjun I synced my branch and the tests now pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name change seems ok.
But the location change needs some debugging.
for instance:
s = "I am in New York. I love living in San Diego. San Diego is better than New York"
>>> s_augmented = augmenter.augment(s)
>>> print(s_augmented)
['I am in New York. I love living in San Diego. Everett is better than New York']
>>>
I've looked a bit into this and it appears to be a consequence of the NER tagger along with multi-word locations. For the sentence
The first instance of "New York" is tagged as a location. However, for the second instance of "New York", only "York" is tagged as a location. Similarly, the second instance of "San Diego" is tagged as a location, but for the first instance, only "San" is tagged as a location. I'll work some more on this, as it will be tricky to handle edge cases such as text that has "San Diego" and "San Francisco". |
Resolved: Updating my local versions of |
I've made changes to address the issue. In short, the issue was that the NER tagger may not tag all words in a multi-word location (e.g. New York). To fix this, we can find all windows that the NER tagger identifies as a location, sort these by the number of words in the window, and then extend the shorter windows to see if the extended window is a location we've already seen. If so, the update the map of location to its appearances. As a concrete example:
The locations identified by NER are [ [[3, 4], "New York"], [[9], "San"], [[11, 12], "San Diego"], [[17], "York"], where the indices are word indices in the sentence. The longest window is 2 and the shortest is 1, so we need to expand the windows of length 1 to length 2:
This does not handle all issues with the NER tagger. For example, in the example sentence above, if the NER tagger only tagged the first instance of "New York" as a location and did not tag either "New" or "York" as a location in the second instance, then the second instance would never be updated. |
It still has not totally solved the change location issues.. but better than the current version.. so it merged
|
What does this PR do?
Summary
This PR adds a new
consistent
parameter toWordSwapChangeLocation
andWordSwapChangeName
so that all original instances of the same word get updated to the same new word. This allows for consistency when the semantics of the words matter. For example, "Alice likes Bob. Alice does not like Eve." could be transformed into "Kate likes Bob. Kate does not like Eve" ifconsistent = True
. Otherwise, the text could be transformed into "Alice likes Bob. Kate does not like Eve." which alters the semantics of the sentence.Additions
consistent
parameter toWordSwapChangeLocation
andWordSwapChangeName
.Changes
WordSwapChangeLocation
has been updated to only make one modification to the original text for marginal performance gains.WordSwapChangeLocation
has been updated to correctly capitalize locations consisting of more than one word to match the naming convention used intextattack/shared/data.py
. This change allows for replacement words for such multi-word locations to be found.Checklist
.rst
file inTextAttack/docs/apidoc
.'