This is a streamlit app for a simple english improvement and summarisation agent.
Create a separate virtual environment ideally and/or run
pip install -r requirments.txt
Essentially you need to do
pip install langchain streamlit openai
After that from the directory where you have the app.py file, run:
streamlit run app.py
Note: You need to provide your own OpenAI API key for that. You can create one by following this link: https://platform.openai.com/docs/quickstart?context=python
The english improvement agent is based on OpenAI's GPT-3.5 model and GPT-4 model (only used for evaluation).
The three tasks are completed only using prompt engineering and there is no RAG needed. Reasons for not using RAG are:
- No external data or knowledge or facts needed to perform the three tasks.
- Summarisation could use some reference summaries, but that's more about doing few shot prompting than RAG.
- Style improvement could have instead used fine-tuning to teach the agent a particular style, but it was not allowed in the assignment description.
Basic prompt engineering (zero-shot) is being used in the three tasks. It is always mentioned in the prompt that the text is written by a native english speaker which could contain ambiguity and inconsistency and that the agent needs to keep that in mind while performing the three tasks.
In the summarisation task, where input text is more than 500 words (split by whitespace), I consider the 'refine' prompt method which sends chunks of the context via multiple LLM calls and refines the summary, instead of sending all the context in one single prompt.
Evaluation is generally done by comparing the output text of an LLM with a reference text or through human feedback.
Since we lack both of these, a third, relatively new method of using an LLM to evaluate another LLM is used here.
Specifically, I use the G-Eval evaluation criteria and use code inspired by this link: https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization.
Essentially for all three tasks, there are 4 criteria to evaluate which are listed here. These criteria and associated steps are fed as a prompt to a superior model (GPT-4 in this case), to evaluate the performance of the model used in the three tasks.
Grammar is evaluated as part of the fluency metric.
Summaries are evaluated as part of the relevance and coherence metric.
Style is evaluated as part of the consistency metric.