Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Style paraphrasers work best in a two-stage pipeline, can re-use HuggingFace generate(...) APIs #298

Open
martiansideofthemoon opened this issue Sep 21, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@martiansideofthemoon
Copy link

martiansideofthemoon commented Sep 21, 2021

Hi everyone, I'm the original author of the STRAP paraphrasers (paper link) which were recently accepted to NL-Augmenter (#227), an effort led by @Filco306. Excited to see these models in NL-Augmenter!

After discussing with @Filco306 and seeing the PR, I saw that 6 different variants of the paraphraser have been provided, a "Basic" style agnostic paraphraser as well as five style-specific paraphrasers (link). While the "Basic" paraphraser is implemented fine, for the style-specific paraphrasers it's recommended to use a two-step pipelined process ---

(1) normalize the text using the "Basic" paraphraser;
(2) pass the output from (1) through the style-specific paraphraser.

This is important since all style-specific paraphrasers were trained on the outputs of "Basic", so any other text is technically out-of-distribution. In an ablation study (-Inf PP. in Table 3 of the paper) we saw a significant drop in style transfer performance without this step. Moreover, the two-step process helps boost output diversity since the "Basic" paraphraser strips input style. This should be fairly simple to implement.

Another minor point is that the models are fully compatible with the new HuggingFace generate(...) APIs, which provide additional functionality compared to what was originally implemented in my repository (in other words, this import can be avoided). Here's an example of how to do it,

out = gpt2.generate(
    input_ids=gpt2_sentences[:, 0:init_context_size],
    max_length=gpt2_sentences.shape[1],
    return_dict_in_generate=True,
    eos_token_id=eos_token_id,
    output_scores=True,
    do_sample=top_k > 0 or top_p > 0.0,
    top_k=top_k,
    top_p=top_p,
    temperature=temperature,
    num_beams=beam_size,
    token_type_ids=segments[:, 0:init_context_size]
)

Also CCing the NL-Augmenter reviewers for the style paraphraser to keep them in the loop --- @sebastianGehrmann @Nickeilf @juand-r @kaustubhdhole

@Filco306
Copy link
Contributor

I agree fully here, and will do a separate PR with this. I see two options:

  1. Have a boolean as an argument to the StyleParaphraser such as use_twostep_pipeline or
  2. Have a separate class/ wrapper for the two step pipeline version.

WDYT?

@Filco306
Copy link
Contributor

@martiansideofthemoon have a look at this branch; it is not done yet, but it is on the way. Let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants