Replies: 3 comments 3 replies
-
May I ask why you prefer to skip that step? |
Beta Was this translation helpful? Give feedback.
2 replies
-
Hi Sara, thanks again for your help. Here is what I am trying to do: In
the tutorial for fine-tuning with my own data, *I want to use a bunch of
text files*, instead of the wiki_gameofthrone_txt1.zip file. I have the
text files as "docs" obtained using
https://haystack.deepset.ai/tutorials/preprocessing:
all_docs = convert_files_to_docs(dir_path=doc_dir) preprocessor =
PreProcessor( clean_empty_lines=True, clean_whitespace=True,
clean_header_footer=False, split_by="word", split_length=100,
split_respect_sentence_boundary=True, ) docs = preprocessor.process(all_docs
)
*My question is there way to use this "docs"* (the same format the
wiki_gameofthrone_txt1.zip is converted to?) to* "Build a QA System
Without Elasticsearch?* as in
https://haystack.deepset.ai/tutorials/without-elasticsearch. I can't seem
to have the "Retriever" access "docs".
Thanks
…-George
On Thu, Oct 20, 2022 at 6:06 AM Sara Zan ***@***.***> wrote:
Ok I think I understand it better now (please correct me if I'm wrong).
I think there's a misunderstanding here. The problem is that Documents are
not the expected format for the augment_squad.py script, because
Documents alone can't be used to train a model. What you need is a dataset
in squad format, and to create such dataset you need Labels
<https://docs.haystack.deepset.ai/docs/documents_answers_labels#label>,
which you can create using the annotation tool:
https://docs.haystack.deepset.ai/docs/annotation
Let me know if this helps 🙂
—
Reply to this email directly, view it on GitHub
<#3412 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACKTPMSTJQIACNLHNHAOBSTWEEKSLANCNFSM6AAAAAARIRRZBQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
-
🙏Yes, I was trying to mix tutorials 2 and 3. Instead, I just did tutorial
3 after document_store.write_documents(docs) . Now I am able to use the
TDIF Retriever and complete the Q&A.
Haystack is great, I am still exploring how to use all aspects of it. Next
I am going to do fine-tuning after labeling some question-answer pairs.
Also do a bit more preprocessing as my documents come from different
decades.
Thank you so much Sara.
…-George
On Thu, Oct 20, 2022 at 11:54 AM Sara Zan ***@***.***> wrote:
You're mentioning two different tutorials performing two very different
tasks. It's unclear to me, at this point, if you're trying to do
fine-tuning (tutorial 2) or inference (tutorial 3).
So:
- For fine-tuning (tutorial 2) it's not sufficient to have the docs.
You need labels, as explained above.
- For inference (tutorial 3), the docs need to be written into the
document store to be read by the Retriever. This is covered in Tutorial 3
in the Preprocessing of Documents section:
document_store.write_documents(docs)
If I'm still not answering to your question, can you share your non
working code? That will clarify most of my doubts.
—
Reply to this email directly, view it on GitHub
<#3412 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACKTPMVPVVGPLF3JUFZWBILWEFTL5ANCNFSM6AAAAAARIRRZBQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
gavirapp
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Sorry this might be a naive question but I can't get past this (even after looking at the relevant tutorials):
From my documents, I preprocessed them to create a haystack document store "docs" which is a list of haystack.schema.Documents. I am unable to use it to finetune a model (for example in teacher/student pair where the student is trained with my data. I could replace the files in "tutorial2" with mine, but I did many preprocessing steps on those files to get to "docs". Here is how I got the "docs"
all_docs = convert_files_to_docs(dir_path=folder)
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True,
)
docs = preprocessor.process(all_docs)
My question is there a way/need to use:
!python augment_squad.py --squad_path file_path --output_path augmented_dataset.json --multiplication_factor 2 --glove_path glove.6B.300d.txt
and have "docs" to produce the augmented_dataset.json?
I am using Google Colab.
Thanks much,
-George
Beta Was this translation helpful? Give feedback.
All reactions