Possible discrepancies between training pipeline in code vs paper #14

redfarg · 2020-03-21T12:41:10Z

I'm trying to reproduce the results of the associated paper, but I have trouble making sense how the code fits the text. In the paper, the pipeline seems fairly straightforward (fig.2, p.3). A QA data set is augmented to obtain ACS-aware datasets. With these, a QG model is trained and, in a third step, its result are refined via filtering.

In the code, various "experiments" exist, some resembling certain steps of the pipeline in the paper. However, what I don't really understand: One of the first experiments, experiments_1_QG_train_seq2seq.sh, trains a QG model, via QG_main.py. This, however, without the augmented data. Is this just for comparison (e.g. as a baseline)?

A later experiment, experiments_3_repeat_da_de.sh, seems closer to the pipeline in the paper. Augmented data is created and then used for another QG model, this one in QG_augment_main.py. However, this model actually doesn't seem to be trained at all. In the code, it only gets tested (see here). I don't really understand then where the training step of a model with augmented data actually takes place? Or am I just missing something?

Apologies if this is not really fitting for an issue. And great work on the paper!

The text was updated successfully, but these errors were encountered:

flackbash · 2020-04-07T13:25:45Z

I think QG_main.py does use augmented data. Here is how I see it:

QG_main.py takes <p, q, a> tuples from the SQuAD dataset (SQuAD1.1-Zhou) and transforms them into <p, q, a, c, s> tuples in a pre-processing step (in FQG_data.py:prepro() and FQG_data.py:get_loader()).
The clue info is generated in FQG_data_augmentor.py:get_clue_info().
The style info is generated in FQG_data_utils:get_question_type().
This data is then used to train the model.
This seems to correspond directly to the figure you mentioned.

I'm currently working on a more self-contained version of the code of that helps: https://github.com/flackbash/ACS-QG

redfarg · 2020-04-17T11:35:06Z

Thanks a lot for your answer @flackbash , that definitely helps! Clue and style indeed seem to be generated right from the start (aka in experiments_1_QG_*). I think what threw me off was the difference for training/inference data, where only for the latter, good <p, a, c, s>-candidates need to be sampled (whereas during training, they are extracted from the dataset via the two algorithms mentioned in the paper).
However, I'm then also not sure where the conditional probabilistic distributions for data sampling are actually learned? I think I'm just still confused about what actually happens in the different experiment steps (ignoring the GPT2 variant for now):

experiments_1_ET_train.sh
Trains the entailment model.
experiments_1_QG_train_seq2seq.sh
Takes SQuAD 1.1, extracts clue and style, and with that trains the QG-model.
experiments_2_DA_file2sents.sh
We are "simulating" data augmentation on SQuAD2.0 and Wiki1000 (instead of completely raw data). First step, get singular sentences.
experiments_3_DA_sents2augsents.sh
Second step, sample augmented data from previously extracted sentences, using probabilities obtained in experiments_1_QG_train_seq2seq.sh (?)
experiments_4_QG_generate_seq2seq.sh
We now generate questions from the resulting augmented data. For some reason, this uses a different model file (QG_augment_main.py), but will use the trained model from experiment_1_QG_train... (?)
experiments_5_uniq_seq2seq.sh
Throw out non-unique results.
experiments_6_postprocess_seq2seq.sh
This just filter out duplicate words from the generated answer? Which seems a bit of a... brute force improvement? ;)
experiments_3_repeat_da_de.sh
This seems to be an alternative, combined pipeline - a combination of experiments_3_DA_sents2augsents.sh, experiments_4_QG_generate_seq2seq.sh, experiments_5_uniq_seq2seq.sh (all for a certain data index range), PLUS entailment score calculation and filtering. The latter two only take place in this experiment and seem therefore not to be part of the other "pipeline"?

Additionally, in the paper, the post-generation data filtering section (3.4) mentions a BERT-based "normal" QA model (in addition to the entailment model) to filter generated questions. This, however does not seem to be part of the code? At least QG_postprocess_seq2seq.py and DE_main.py (used in experiments_3_repeat_da_de.sh) don't contain anything like it as far as I can see (the latter does however apply some other filtering not mentioned in the paper, like a readability score)?

In summary, the pipeline in experiments_3_repeat_da_de.sh seems to represent the more complete path akin to whats described in the paper? I'll definitely also have a look at your repo, too @flackbash , maybe this will also help me understand things a bit more clearly.

flackbash · 2020-04-22T11:45:30Z

Without having examined this thoroughly, it seems to me the probability distribution is learned the first time DA_main.py is executed. That is, when running experiments_2_DA_file2sents.sh here. experiments_3_DA_sents2augsents.sh then seems to use this probability distribution.

Regarding experiments_4_QG_generate_seq2seq.sh: Yes, I think it uses the model trained in experiment_1_QG_train_seq2seq.sh.

Regarding experiments_3_repeat_da_de.sh: I also find this quite confusing, but I agree with you, that this seems to be the pipeline that relates closer to the one described in the paper. More specifically, I think the proper pipeline would be:

experiments_1_ET_train.sh
experiments_1_QG_train_seq2seq.sh
experiments_2-DA_file2sents.sh
experiments_3_DA_sents2augsents.sh
experiments_4_QG_generate_seq2seq.sh
experiments_5_uniq_seq2seq.sh
(maybe experiments_6_postprocess_seq2seq.sh)
and then from the experiments_3_repeat_da_de.sh file run_glue.py (I'm not sure what this does, but I assume this is part of the filtering process mentioned in the paper? I can't find any reference to a BERT-based QA-model either. Maybe they switched to XLNet instead in the code, see here?) and DE_main.py for the index ranges given e.g. in experiments_3_DA_sents2augsents.sh.

redfarg · 2020-04-22T13:35:45Z

You seem to be spot-on with the sampling probabilities, thanks!

Concerning the pipeline:
I agree, that's what I did for my recent tests as well, leaving out experiments_6_postprocess_seq2seq.sh and including run_glue.py and DE_main.py (which also adds values for perplexity and readability, although the metrics used for the latter seem a bit questionable).

I think run_glue.py is actually the entailment model (as it is based on a GLUE benchmark task, MRPC, and they seem to have copy-pasted most of its code with some modifications). It gets trained in experiments_1_ET_train.sh (again via run_glue.py) and then applied to the generated questions (as seen in experiments_3_repeat_da_de.sh). As you mentioned, they switched this model from BERT to XLNet, but that still leaves out the separate QA-model for filtering - unless this is somehow also contained in run_glue.py? I'll have to take a closer look at that. I ultimately want to try training this with German text, but before I start looking for appropriate datasets I definitely have to try understanding it a bit better...

flackbash · 2020-04-22T14:41:43Z

I just finished some experiments and I would now include experiments_6_postprocess_seq2seq.sh into the pipeline after all since I get quite a lot of word duplications in the generated questions (might be brute force, but seems like it gets the job done ;) ).

Ah ok, right, the entailment model is also supposed to be BERT-based (but actually XLNet-based in the code). Thanks for the clarifications :)

Even after a closer look I can't find the QA-model anywhere. Maybe they didn't include it in the code? If you find anything please let me know.

redfarg · 2020-05-05T15:12:41Z

It's true, it does filter out quite a lot! Still feels a bit like "cheating" to me, though, as I feel that an adequate generative model shouldn't even make those kinds of errors. ;) At least not as many...

I experimented around a bit more and I'm now pretty sure the QA-model mentioned in the paper is contained in run_squad.py. At least, you can train a QA model on SQuAD with this. However, it seems the inference for the purpose of filtering is not really there (maybe unfinished?), as there seems to be no implemented way to let a trained model generate answer spans for given inputs. I did some (very ugly) hacking around it to test it on sentence-answer-pairs generated via the other experiments, but the results so far are pretty bad. Might also be bugs on my end, though.

flackbash · 2020-05-12T12:52:46Z

I see... I haven't even had a look at run_squad.py until now as it was not included in any experiment.

I have contacted Bang Liu in the meantime and he said that both the BERT-QA and BERT-based filter modules were implemented using the Huggingface Transformer library but that this part was implemented by the second author of the paper, Haojie Wei, and is therefore not included in this repo. Unfortunately, Haojie Wei has not replied to my email until now. However, Bang also said it should be easy to implement using the Huggingface Transformer library.

This seems to be in line with your experiments and also matches the fact that run_squad.py seems to be mostly just copied from the transformers library. So parts of it are there, but it's not really integrated in the experiments in this repo.

oppasource · 2020-05-16T17:37:00Z

Thanks @flackbash and @redfarg for the discussion. It has been helpful for me to get to know the inner workings of the code.

I tried to regenerate the questions given in the paper as example in Fig. 5. There were a couple of questions generated that were kind of similar but still the quality was bad. Here is what I did....

Trained QG model using experiments_1_QG_train_seq2seq.sh
Used experiments_3_DA_sents2augsents.sh to augment the sentences with <a, c, s>
Generated questions using experiments_4_QG_generate_seq2seq.sh
Removed duplicates and did post-processing using experiments_5_uniq_seq2seq.sh and experiments_6_postprocess_seq2seq.sh

I didn't use experiments_2-DA_file2sents.sh and manually set the format for the two example sentences.

As far as I understand, the above pipeline is suppose to generate the superset of questions which reduces after filtering. So I examine this superset of questions if it contained the example questions from the figure without doing the filtering.

I am yet to experiment with GPT2 model and will do it soon.

Apart from that, Is there anything that I am doing wrong or missing something? Can you try the two example sentences mentioned in the paper and verify the results?

Thanks.

redfarg · 2020-05-18T10:55:30Z

Hi @oppasource, your pipeline seems right to me. You skipped training and applying the entailment model (in experiments_1_ET_train.sh and parts of experiments_3_repeat_da_de.sh) but that would only give you an additional feature to filter with, so the main generation part would be the same nonetheless.

I generated questions for the two mentioned sentences as well, and my results also look pretty bad. There are barely any viable questions among the output (and none that come close to the examples in the paper). However, on my side it seems to be the fault of the input sampler, at least partially: for example, it never sampled "The New York Amsterdam News", "the United States", or "Manhattan" as possible answers (which, from a human perspective, seem like very obvious candidates). Did you observe better sampled answers?

Plus, if I may ask: how long did you train your QG model for? I trained mine for 10 epochs, which also just might be too few?

oppasource · 2020-05-18T11:48:54Z

Good point you rise about sampling. Some answer phrases were not sampled in my case as well. Perhaps its because of the randomness in sampling of <a, c, s>. I ran the code again with same input and got different answer phrases in the output this time.

As far as epochs are concerned, I also trained for 10 epochs. In the paper also they have mentioned that they trained Seq2Seq model for 10 epochs. However I got the best model at 8th epoch, so that is being used.

I guess the GPT2 model would give better results. As seen in Table 2 in the paper, only 40% of questions are reported to be well-formed (for seq2seq) and 74.5% (for GPT2). Which does seem to be the case because I am looking at the questions generated by Seq2Seq model and many questions are syntactically as well as semantically incorrect.

Training GPT2 model is taking some time, I'll update whenever I get the results.

oppasource · 2020-05-26T12:49:17Z

Update: Questions generated from GPT2 model are certainly way better in terms of syntactic structure compared to seq2seq model. It did generate exact same 2 of the 6 questions given in Fig. 5 of the paper. For remaining questions, I did not see those exact answer phrases and question type sampled out. So maybe if the sampling also goes same as the examples, it will probably generate the same questions.

redfarg · 2020-05-31T15:32:31Z

Thanks @oppasource for your update! I managed to train a GTP2-based model now as well, and I can report similar results: its a lot better at generating coherent language and could reproduce some of the questions reported in the paper. Likewise, for the others, the sampler didn't pick the appropriate answers. It might be interesting to query the generation model with pre-made answers to test its capabilities separate from the sampler...

Madhur-1 · 2023-02-07T05:20:14Z

Hey @redfarg @oppasource could you publish the GPT2 based code used for training and if possible the trained model too?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible discrepancies between training pipeline in code vs paper #14

Possible discrepancies between training pipeline in code vs paper #14

redfarg commented Mar 21, 2020

flackbash commented Apr 7, 2020

redfarg commented Apr 17, 2020 •

edited

Loading

flackbash commented Apr 22, 2020

redfarg commented Apr 22, 2020 •

edited

Loading

flackbash commented Apr 22, 2020

redfarg commented May 5, 2020

flackbash commented May 12, 2020

oppasource commented May 16, 2020 •

edited

Loading

redfarg commented May 18, 2020 •

edited

Loading

oppasource commented May 18, 2020

oppasource commented May 26, 2020

redfarg commented May 31, 2020

Madhur-1 commented Feb 7, 2023

Possible discrepancies between training pipeline in code vs paper #14

Possible discrepancies between training pipeline in code vs paper #14

Comments

redfarg commented Mar 21, 2020

flackbash commented Apr 7, 2020

redfarg commented Apr 17, 2020 • edited Loading

flackbash commented Apr 22, 2020

redfarg commented Apr 22, 2020 • edited Loading

flackbash commented Apr 22, 2020

redfarg commented May 5, 2020

flackbash commented May 12, 2020

oppasource commented May 16, 2020 • edited Loading

redfarg commented May 18, 2020 • edited Loading

oppasource commented May 18, 2020

oppasource commented May 26, 2020

redfarg commented May 31, 2020

Madhur-1 commented Feb 7, 2023

redfarg commented Apr 17, 2020 •

edited

Loading

redfarg commented Apr 22, 2020 •

edited

Loading

oppasource commented May 16, 2020 •

edited

Loading

redfarg commented May 18, 2020 •

edited

Loading