Skip to content

Commit

Permalink
Merge branch 'refactoring-root-folder'
Browse files Browse the repository at this point in the history
  • Loading branch information
VinciGit00 committed Feb 15, 2024
2 parents a60e157 + 5d85760 commit c0e1b5c
Show file tree
Hide file tree
Showing 37 changed files with 307 additions and 194 deletions.
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contributing to YOSO-ai
# Contributing to ScrapeGraphAI

Thank you for your interest in contributing to **YOSO-ai**! We welcome contributions from the community to help improve and grow the project. This document outlines the guidelines and steps for contributing.
Thank you for your interest in contributing to **ScrapeGraphAI**! We welcome contributions from the community to help improve and grow the project. This document outlines the guidelines and steps for contributing.

## Table of Contents

Expand Down
226 changes: 57 additions & 169 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,203 +1,82 @@
# 🤖 YOSO-ai: You Only Scrape Once
# 🕷️ ScrapeGraphAI: You Only Scrape Once

YOSO-ai is a Python **Open Source** library that uses LLM and Langchain for faster and efficient web scraping. Just say which information you want to extract and the library will do it for you.
ScrapeGraphAI is a *web scraping* python library based on LangChain which uses LLM and direct graph logic to create scraping pipelines.
Just say which information you want to extract and the library will do it for you!

Official documentation page: [yoso-ai.readthedocs.io](https://yoso-ai.readthedocs.io/en/latest/index.html)

# 🔍 Demo

Try out YOSO-ai in your browser:

[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/VinciGit00/YOSO-ai)

# 🔧 Quick Setup
<p align="center">
<img src="docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
</p>

Follow the following steps:

1.
## 🚀 Quick install

```bash
git clone https://github.com/VinciGit00/yoso-ai.git
pip install scrapegraphai
```
## 🔍 Demo

2. (Optional)
Try out ScrapeGraphAI in your browser:

```bash
python -m venv venv
source ./venv/bin/activate
```
[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/VinciGit00/Scrapegraph-ai)

3.
## 📖 Documentation

```bash
pip install -r requirements.txt
# if you want to install it as a library
pip install .

# or if you plan on developing new features it is best to also install the extra dependencies using
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).

pip install -r requirements-dev.txt
# if you want to install it as a library
pip install .[dev]
```
## 💻 Usage

4. Create your personal OpenAI API key from [here](https://platform.openai.com/api-keys)
5. (Optional) Create a .env file inside the main and paste the API key
### Case 1: Extracting information using a prompt

```config
API_KEY="your openai.com api key"
```
You can use the `SmartScraper` class to extract information from a website using a prompt.

6. You are ready to go! 🚀
7. Try running the examples using:

```bash
python -m examples.html_scraping
# or if you are outside of the project folder
python -m yoso-ai.examples.html_scraping
```

# 📖 Examples
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).

```python
import os
from dotenv import load_dotenv
from yosoai import _get_function, send_request

load_dotenv()

def main():
# Get OpenAI API key from environment variables
openai_key = os.getenv("API_KEY")
if not openai_key:
print("Error: OpenAI API key not found in environment variables.")
return

# Example values for the request
request_settings = [
{
"title": "title_news",
"type": "str",
"description": "Give me the name of the news"
}
]

# Choose the desired model and other parameters
selected_model = "gpt-3.5-turbo"
temperature_value = 0.7

# Mockup World URL
mockup_world_url = "https://sport.sky.it/nba?gr=www"

# Invoke send_request function
result = send_request(openai_key, _get_function(mockup_world_url), request_settings, selected_model, temperature_value, 'cl100k_base')

# Print or process the result as needed
print("Result:", result)

if __name__ == "__main__":
main()
```
from scrapegraphai.graphs import SmartScraper

### Case 2: Passing your own HTML code
OPENAI_API_KEY = "YOUR_API_KEY"

```python
import os
from dotenv import load_dotenv
from yosoai import send_request

load_dotenv()

# Example using a HTML code
query_info = '''
Given this code extract all the information in a json format about the news.
<article class="c-card__wrapper aem_card_check_wrapper" data-cardindex="0">
<div class="c-card__content">
<h2 class="c-card__title">Booker show with 52 points, whoever has the most games over 50</h2>
<div class="c-card__label-wrapper c-label-wrapper">
<span class="c-label c-label--article-heading">Standings</span>
</div>
<p class="c-card__abstract">The Suns' No. 1 dominated the match won in New Orleans, scoring 52 points. It's about...</p>
<div class="c-card__info">
<time class="c-card__date" datetime="20 gen - 07:54">20 gen - 07:54</time>
...
</div>
</div>
<div class="c-card__img-wrapper">
<figure class="o-aspect-ratio o-aspect-ratio--16-10 ">
<img crossorigin="anonymous" class="c-card__img j-lazyload" alt="Partite con 50+ punti: Booker in Top-20" data-srcset="..." sizes="..." loading="lazy" data-src="...">
<noscript>
<img crossorigin="anonymous" class="c-card__img" alt="Partite con 50+ punti: Booker in Top-20" srcset="..." sizes="..." src="...">
</noscript>
</figure>
<i class="icon icon--media icon--gallery icon--medium icon--c-primary">
</i>
</div>
</article>
'''
def main():
# Get OpenAI API key from environment variables
openai_key = os.getenv("API_KEY")
if not openai_key:
print("Error: OpenAI API key not found in environment variables.")
return

# Example values for the request
request_settings = [
{
"title": "title",
"type": "str",
"description": "Title of the news"
}
]

# Choose the desired model and other parameters
selected_model = "gpt-3.5-turbo"
temperature_value = 0.7

# Invoke send_request function
result = send_request(openai_key, query_info, request_settings, selected_model, temperature_value, 'cl100k_base')

# Print or process the result as needed
print("Result:", result)

if __name__ == "__main__":
main()
```
llm_config = {
"api_key": OPENAI_API_KEY,
"model_name": "gpt-3.5-turbo",
}

Note: all the model are available at the following link: [https://platform.openai.com/docs/models](https://platform.openai.com/docs/models), be sure you have enabled that keys
smart_scraper = SmartScraper("List me all the titles and project descriptions",
"https://perinim.github.io/projects/", llm_config)

# Example of output

Given the following input
answer = smart_scraper.run()
print(answer)
```

```python
[
{
"title": "title",
"type": "str",
"description": "Title of the news"
}
]
The output will be a dictionary with the extracted information, for example:

```bash
{
'titles': [
'Rotary Pendulum RL'
],
'descriptions': [
'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
]
}
```

using as a input the website [https://sport.sky.it/nba?gr=www](https://sport.sky.it/nba?gr=www)
## 🤝 Contributing

Contributions are welcome! Please check out the todos below, and feel free to open a pull request.
For more information, please see the [contributing guidelines](CONTRIBUTING.md).

The oputput format is a dict and its the following:
After installing and activating the virtual environment, please remember to install the library using the "dev" extra parameter to have the extra dependencies for development.

```bash
{
'title': 'Booker show with 52 points, whoever has the most games over 50'
}
pip install -e .[dev]
```

# Credits
Thanks to:
- [nicolapiazzalunga](https://github.com/nicolapiazzalunga): for inspiring yosoai/convert_to_csv.py and yosoai/convert_to_json.py functions
## Contributors

# Developed by
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)

## Authors

<p align="center">
<a href="https://vincigit00.github.io/">
Expand All @@ -210,3 +89,12 @@ Thanks to:
<img src="docs/assets/logo_perinilab.png" alt="PeriniLab Logo" style="width: 30%;">
</a>
</p>

## 📜 License

ScrapeGraphAI is licensed under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more information.

## Acknowledgements

- We would like to thank all the contributors to the project and the open-source community for their support.
- ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.
Binary file added docs/assets/scrapegraphai_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 4 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,13 @@

# -- Path setup --------------------------------------------------------------

import os, sys
import os
import sys

# import all the modules
sys.path.insert(0, os.path.abspath('../../'))

project = 'yosoai'
project = 'scrapegraphai'
copyright = '2024, Marco Vinciguerra'
author = 'Marco Vinciguerra'

Expand All @@ -29,4 +30,4 @@
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']
html_static_path = ['_static']
14 changes: 14 additions & 0 deletions docs/source/introduction/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@ This is an open source project aimed at developing a scraping library using LLM
The goal is to be able to scrape data using natural language queries and store them in a structured format.

.. image:: ../../assets/apikey_1.png
:align: center
:width: 400px
:alt: OpenAI Key

.. image:: ../../assets/apikey_2.png
:align: center
:width: 400px
:alt: OpenAI Key

.. image:: ../../assets/apikey_3.png
:align: center
:width: 400px
:alt: OpenAI Key
.. image:: ../../assets/apikey_4.png
:align: center
:width: 400px
:alt: OpenAI Key
7 changes: 7 additions & 0 deletions docs/source/modules/modules.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
scrapegraphai
======

.. toctree::
:maxdepth: 4

scrapegraphai
29 changes: 29 additions & 0 deletions docs/source/modules/yosoai.graphs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
scrapegraphai.graphs package
=====================

Submodules
----------

scrapegraphai.graphs.base\_graph module
--------------------------------

.. automodule:: scrapegraphai.graphs.base_graph
:members:
:undoc-members:
:show-inheritance:

scrapegraphai.graphs.smart\_scraper\_graph module
------------------------------------------

.. automodule:: scrapegraphai.graphs.smart_scraper_graph
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------

.. automodule:: scrapegraphai.graphs
:members:
:undoc-members:
:show-inheritance:
Loading

0 comments on commit c0e1b5c

Please sign in to comment.