-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible discrepancies between training pipeline in code vs paper #14
Comments
I think
I'm currently working on a more self-contained version of the code of that helps: https://github.com/flackbash/ACS-QG |
Thanks a lot for your answer @flackbash , that definitely helps! Clue and style indeed seem to be generated right from the start (aka in
Additionally, in the paper, the post-generation data filtering section (3.4) mentions a BERT-based "normal" QA model (in addition to the entailment model) to filter generated questions. This, however does not seem to be part of the code? At least In summary, the pipeline in |
Without having examined this thoroughly, it seems to me the probability distribution is learned the first time Regarding Regarding
|
You seem to be spot-on with the sampling probabilities, thanks! Concerning the pipeline: I think |
I just finished some experiments and I would now include Ah ok, right, the entailment model is also supposed to be BERT-based (but actually XLNet-based in the code). Thanks for the clarifications :) Even after a closer look I can't find the QA-model anywhere. Maybe they didn't include it in the code? If you find anything please let me know. |
It's true, it does filter out quite a lot! Still feels a bit like "cheating" to me, though, as I feel that an adequate generative model shouldn't even make those kinds of errors. ;) At least not as many... I experimented around a bit more and I'm now pretty sure the QA-model mentioned in the paper is contained in |
I see... I haven't even had a look at I have contacted Bang Liu in the meantime and he said that both the BERT-QA and BERT-based filter modules were implemented using the Huggingface Transformer library but that this part was implemented by the second author of the paper, Haojie Wei, and is therefore not included in this repo. Unfortunately, Haojie Wei has not replied to my email until now. However, Bang also said it should be easy to implement using the Huggingface Transformer library. This seems to be in line with your experiments and also matches the fact that |
Thanks @flackbash and @redfarg for the discussion. It has been helpful for me to get to know the inner workings of the code. I tried to regenerate the questions given in the paper as example in Fig. 5. There were a couple of questions generated that were kind of similar but still the quality was bad. Here is what I did....
I didn't use As far as I understand, the above pipeline is suppose to generate the superset of questions which reduces after filtering. So I examine this superset of questions if it contained the example questions from the figure without doing the filtering. I am yet to experiment with GPT2 model and will do it soon. Apart from that, Is there anything that I am doing wrong or missing something? Can you try the two example sentences mentioned in the paper and verify the results? Thanks. |
Hi @oppasource, your pipeline seems right to me. You skipped training and applying the entailment model (in I generated questions for the two mentioned sentences as well, and my results also look pretty bad. There are barely any viable questions among the output (and none that come close to the examples in the paper). However, on my side it seems to be the fault of the input sampler, at least partially: for example, it never sampled "The New York Amsterdam News", "the United States", or "Manhattan" as possible answers (which, from a human perspective, seem like very obvious candidates). Did you observe better sampled answers? Plus, if I may ask: how long did you train your QG model for? I trained mine for 10 epochs, which also just might be too few? |
Good point you rise about sampling. Some answer phrases were not sampled in my case as well. Perhaps its because of the randomness in sampling of As far as epochs are concerned, I also trained for 10 epochs. In the paper also they have mentioned that they trained Seq2Seq model for 10 epochs. However I got the best model at 8th epoch, so that is being used. I guess the GPT2 model would give better results. As seen in Table 2 in the paper, only 40% of questions are reported to be well-formed (for seq2seq) and 74.5% (for GPT2). Which does seem to be the case because I am looking at the questions generated by Seq2Seq model and many questions are syntactically as well as semantically incorrect. Training GPT2 model is taking some time, I'll update whenever I get the results. |
Update: Questions generated from GPT2 model are certainly way better in terms of syntactic structure compared to seq2seq model. It did generate exact same 2 of the 6 questions given in Fig. 5 of the paper. For remaining questions, I did not see those exact answer phrases and question type sampled out. So maybe if the sampling also goes same as the examples, it will probably generate the same questions. |
Thanks @oppasource for your update! I managed to train a GTP2-based model now as well, and I can report similar results: its a lot better at generating coherent language and could reproduce some of the questions reported in the paper. Likewise, for the others, the sampler didn't pick the appropriate answers. It might be interesting to query the generation model with pre-made answers to test its capabilities separate from the sampler... |
Hey @redfarg @oppasource could you publish the GPT2 based code used for training and if possible the trained model too? |
I'm trying to reproduce the results of the associated paper, but I have trouble making sense how the code fits the text. In the paper, the pipeline seems fairly straightforward (fig.2, p.3). A QA data set is augmented to obtain ACS-aware datasets. With these, a QG model is trained and, in a third step, its result are refined via filtering.
In the code, various "experiments" exist, some resembling certain steps of the pipeline in the paper. However, what I don't really understand: One of the first experiments,
experiments_1_QG_train_seq2seq.sh
, trains a QG model, viaQG_main.py
. This, however, without the augmented data. Is this just for comparison (e.g. as a baseline)?A later experiment,
experiments_3_repeat_da_de.sh
, seems closer to the pipeline in the paper. Augmented data is created and then used for another QG model, this one inQG_augment_main.py
. However, this model actually doesn't seem to be trained at all. In the code, it only gets tested (see here). I don't really understand then where the training step of a model with augmented data actually takes place? Or am I just missing something?Apologies if this is not really fitting for an issue. And great work on the paper!
The text was updated successfully, but these errors were encountered: