-
Notifications
You must be signed in to change notification settings - Fork 15.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new use case docs for Web Scraping, Chromium loader, BS4 transformer #8732
Added new use case docs for Web Scraping, Chromium loader, BS4 transformer #8732
Conversation
trancethehuman
commented
Aug 4, 2023
- Description: Added a new use case category called "Web Scraping", and a tutorial to scrape websites using OpenAI Functions Extraction chain to the docs.
- Tag maintainer:@baskaryan @hwchase17 ,
- Twitter handle: https://www.linkedin.com/in/haiphunghiem/ (I'm on LinkedIn mostly)
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the structure should be - keep index.mdx in docs/docs_skeleton/docs/use_cases/web_scraping but then move the notebook to /docs/extras/use_cases/web_scraping (and move the .mdx file, it will get built at build time)
I guess I don't need to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great use case!
"source": [ | ||
"# Web scraping using OpenAI Functions Extraction chain\n", | ||
"\n", | ||
"Web scraping is challenging for many reasons; one of them is the changing nature of modern websites' layouts and content, which requires modifying scraping scripts to accommodate the changes.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great use case!
"id": "5ef7f514", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create a simple scraper function\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is cool.
1/ Let's create a new web loader for Chromium that wraps this logic.
2/ It can follow what we did here:
In particular, see.
Simply move this code to create a new loader (e.g., chromium_loader.py
or similar).
Will launch a headless instance of Chromium to scrape.
I ran what you have and compared to this. (See docs here):
loader = AsyncHtmlLoader(url)
docs = loader.load()
html2text = Html2TextTransformer()
docs = html2text.transform_documents(docs)
html_content=docs[0].page_content
I found found that Chromium is better in this case.
For some reason, html2text
is loosing the news article summaries.
We should add it as a new web loader and simply import here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Will change mine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took care of this for you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woo hoo
"\n", | ||
"openai_api_key = \"OPENAI_API_KEY\"\n", | ||
"\n", | ||
"llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\", openai_api_key=openai_api_key)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions work w/ default gpt3.5 / 4 now, AFAIK.
} | ||
], | ||
"source": [ | ||
"pip install -q openai langchain playwright beautifulsoup4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to add:
! playwright install
to download the necessary browser binaries (Chromium, Firefox, WebKit).
I cleaned this up a bit more. Main issue: the extraction is sensitive to the transformation of raw HTML (HTML2Text vs BS4). Have a look at the ntbk. Also title / summary extraction doesn't look quite right. |
"id": "97f7de42", | ||
"metadata": {}, | ||
"source": [ | ||
"# Run the web scraper w/ BeautifulSoup\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the notebook to use the BeautifulSoupTransformer
here?
2edafe6
to
e266287
Compare
e266287
to
3553df1
Compare
8577adb
to
976bd1c
Compare
976bd1c
to
7a96ae4
Compare
…ormer (langchain-ai#8732) - Description: Added a new use case category called "Web Scraping", and a tutorial to scrape websites using OpenAI Functions Extraction chain to the docs. - Tag maintainer:@baskaryan @hwchase17 , - Twitter handle: https://www.linkedin.com/in/haiphunghiem/ (I'm on LinkedIn mostly) --------- Co-authored-by: Lance Martin <[email protected]>