-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a18e964
commit f16200b
Showing
12 changed files
with
709 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,324 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "39207008-8ea1-4a0e-bbe1-a8b5e73f0de0", | ||
"metadata": {}, | ||
"source": [ | ||
"# Zyte Serp Reader" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d8503fec-8c5f-49cd-acb9-a128b701ddc6", | ||
"metadata": {}, | ||
"source": [ | ||
"Zyte Serp Reader allows you to access the organic results from google search. Given a query string, it provides the URLs of the top search results and the text string associated with those." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "98850676-106a-484e-85df-eaf6e1ea52d0", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# %pip install llama-index llama-index-readers-zyte-serp" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "50861c50-b62c-4c13-8d38-9562a4c4093e", | ||
"metadata": {}, | ||
"source": [ | ||
"In this notebook we show how Zyte Serp Reader (along with web reader) can be used collect information about a particular topic. Given these documents we can perform queries on this topic. \n", | ||
"\n", | ||
"Recently the Govt. of Ireland announced fiscal budget for 2024 and here we show how we can query information regarding the budget. First we get the relevant information using the Zyte Serp Reader, then the information from these URLs is extracted using web reader and finally a queries are answered using a openai chatgpt model. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "3a5f674f-5be8-48da-b447-63edd84a6779", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"from llama_index.readers.zyte_serp import ZyteSerpReader\n", | ||
"from llama_index.readers.web.zyte_web.base import ZyteWebReader" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "5d787224-245e-41fa-aa6e-eca68d06d92c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# This is needed to run it in juypter notebook\n", | ||
"# import nest_asyncio\n", | ||
"# nest_asyncio.apply()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "86685d68-8b88-4883-afae-4767c41b1b88", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"zyte_api_key = os.environ.get(\"ZYTE_API_KEY\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "4a45e4ae-2ee5-41e4-aa23-a5b927154cdb", | ||
"metadata": {}, | ||
"source": [ | ||
"### Get relevant resources (using ZyteSerp)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7889631c-4650-423c-93fa-dd08473c7848", | ||
"metadata": {}, | ||
"source": [ | ||
"Given a topic, we use the search results from google to get the links to the relevant pages. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "52ce02e0-9ed0-452a-876d-ecaab50e94dd", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"topic = \"Ireland Budget 2025\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "1eaf5a9c-c597-4cff-9446-287529b9e15a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"serp_reader = ZyteSerpReader(api_key=zyte_api_key)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "a6c5e727-9d1c-470c-a839-690d433ca3cf", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"search_results = serp_reader.load_data(topic)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "5c838258-219a-4ffb-96a3-263a0f08829e", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"7" | ||
] | ||
}, | ||
"execution_count": null, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"len(search_results)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "03ef22f3-295a-4e46-820b-357559d8324c", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"https://www.gov.ie/en/publication/e8315-budget-2025/\n", | ||
"{'name': 'Budget 2025', 'rank': 1}\n", | ||
"https://www.citizensinformation.ie/en/money-and-tax/budgets/budget-2025/\n", | ||
"{'name': 'Budget 2025', 'rank': 2}\n", | ||
"https://www.gov.ie/en/publication/cb193-your-guide-to-budget-2025/\n", | ||
"{'name': 'Your guide to Budget 2025', 'rank': 3}\n", | ||
"https://www.irishtimes.com/your-money/2024/10/01/budget-2025-ireland-main-points/\n", | ||
"{'name': 'Budget 2025 main points: Energy credits, bonus welfare ...', 'rank': 4}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"for r in search_results[:4]:\n", | ||
" print(r.text)\n", | ||
" print(r.metadata)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "7f02511f-6edc-4022-a5b4-ca4b5dbbd9d8", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"urls = [r.text for r in search_results]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ec61510f-3717-4805-b9b2-5af7405bc4f1", | ||
"metadata": {}, | ||
"source": [ | ||
"Seems we have a list of relevant URL with regard to our topic (\"Ireland budget 2024\"). Metadata also shows the text and rank associated with the search result entry. Next we get the content of these webpages using web reader." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "edb07727-2111-4aa7-b52f-51943a42f4bd", | ||
"metadata": {}, | ||
"source": [ | ||
"### Get topic content" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "f0249f60-3262-46f3-bfea-d8299e55faa9", | ||
"metadata": {}, | ||
"source": [ | ||
"Given the urls of the webpages which contain information about the topic, we get the content. Since the webpages contain a lot of non-relevant content, we can obtain the filtered content using the \"article\" mode of the ZyteWebReader which returns only the article content of from the webpage." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "20426fd7-5b8d-40e0-aeb9-b16ee58ea036", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"web_reader = ZyteWebReader(api_key=zyte_api_key, mode=\"article\")\n", | ||
"documents = web_reader.load_data(urls)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "8e7f2b75-d583-4f86-9d9d-9bb7cfc58d1d", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Budget 2025 - Tax Highlights Ireland\n", | ||
"\n", | ||
"Budget 2025 announced on 1 October 2024 included a substantial \"cost-of-living\" package including many one-off payments, as well as outlining a framework to direc\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(documents[0].text[:200])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "4f4abb80-9838-4335-b384-304abce0fde9", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"7" | ||
] | ||
}, | ||
"execution_count": null, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"len(documents)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "377d4120-f21a-4003-82cd-c225892b595f", | ||
"metadata": {}, | ||
"source": [ | ||
"### Query engine" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e635e402-8ce8-4ebe-8144-948647ff6d5a", | ||
"metadata": {}, | ||
"source": [ | ||
"Here a very basic query is performed using VectorStoreIndex. Please make sure that the OPENAI_API_KEY env variable is set before running the following code. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e870bde0-f956-4d2a-a922-1653f0ab1a06", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from llama_index.core import VectorStoreIndex\n", | ||
"\n", | ||
"index = VectorStoreIndex.from_documents(documents)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6aece266-f08a-42eb-91ba-8d6c2a0237d0", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Two €125 electricity credits will be provided - one this year and one in 2025.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"query_engine = index.as_query_engine()\n", | ||
"response = query_engine.query(\n", | ||
" \"What kind of energy credits are provided in the budget?\"\n", | ||
")\n", | ||
"print(response)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "forked-llama", | ||
"language": "python", | ||
"name": "forked-llama" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.