Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new use case docs for Web Scraping, Chromium loader, BS4 transformer #8732

Merged

Conversation

trancethehuman
Copy link
Contributor

@vercel
Copy link

vercel bot commented Aug 4, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 11, 2023 5:29pm

@dosubot dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Aug 4, 2023
@trancethehuman trancethehuman changed the title added docs for web scraping Added new use case category (Web Scraping) and a tutorial for using OpenAI Functions extraction chain for that Aug 4, 2023
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the structure should be - keep index.mdx in docs/docs_skeleton/docs/use_cases/web_scraping but then move the notebook to /docs/extras/use_cases/web_scraping (and move the .mdx file, it will get built at build time)

@rlancemartin rlancemartin self-assigned this Aug 4, 2023
@trancethehuman
Copy link
Contributor Author

i think the structure should be - keep index.mdx in docs/docs_skeleton/docs/use_cases/web_scraping but then move the notebook to /docs/extras/use_cases/web_scraping (and move the .mdx file, it will get built at build time)

I guess I don't need to run yarn build and push the markdown file that was generated from the notebook on here because that's in the build CI/CD?

Copy link
Collaborator

@rlancemartin rlancemartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great use case!

"source": [
"# Web scraping using OpenAI Functions Extraction chain\n",
"\n",
"Web scraping is challenging for many reasons; one of them is the changing nature of modern websites' layouts and content, which requires modifying scraping scripts to accommodate the changes.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great use case!

"id": "5ef7f514",
"metadata": {},
"source": [
"## Create a simple scraper function\n",
Copy link
Collaborator

@rlancemartin rlancemartin Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is cool.

1/ Let's create a new web loader for Chromium that wraps this logic.

2/ It can follow what we did here:

#8036

In particular, see.

Simply move this code to create a new loader (e.g., chromium_loader.py or similar).

Will launch a headless instance of Chromium to scrape.

I ran what you have and compared to this. (See docs here):

loader = AsyncHtmlLoader(url)
docs = loader.load()
html2text = Html2TextTransformer()
docs = html2text.transform_documents(docs)
html_content=docs[0].page_content

I found found that Chromium is better in this case.

For some reason, html2text is loosing the news article summaries.

We should add it as a new web loader and simply import here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Will change mine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took care of this for you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woo hoo

"\n",
"openai_api_key = \"OPENAI_API_KEY\"\n",
"\n",
"llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\", openai_api_key=openai_api_key)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions work w/ default gpt3.5 / 4 now, AFAIK.

}
],
"source": [
"pip install -q openai langchain playwright beautifulsoup4"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to add:

! playwright install

to download the necessary browser binaries (Chromium, Firefox, WebKit).

@rlancemartin
Copy link
Collaborator

I cleaned this up a bit more.

Main issue: the extraction is sensitive to the transformation of raw HTML (HTML2Text vs BS4).

Have a look at the ntbk.

Also title / summary extraction doesn't look quite right.

"id": "97f7de42",
"metadata": {},
"source": [
"# Run the web scraper w/ BeautifulSoup\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the notebook to use the BeautifulSoupTransformer here?

@rlancemartin rlancemartin changed the title Added new use case category (Web Scraping) and a tutorial for using OpenAI Functions extraction chain for that Added new use case docs for Web Scraping, Chromium loader, BS4 transformer Aug 9, 2023
@rlancemartin rlancemartin merged commit e4418d1 into langchain-ai:master Aug 11, 2023
22 checks passed
danielchalef pushed a commit to danielchalef/langchain that referenced this pull request Aug 11, 2023
…ormer (langchain-ai#8732)

- Description: Added a new use case category called "Web Scraping", and
a tutorial to scrape websites using OpenAI Functions Extraction chain to
the docs.
  - Tag maintainer:@baskaryan @hwchase17 ,
- Twitter handle: https://www.linkedin.com/in/haiphunghiem/ (I'm on
LinkedIn mostly)

---------

Co-authored-by: Lance Martin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants