From a20f005d08f4c87cabedadd1ded00ab8491ad663 Mon Sep 17 00:00:00 2001 From: Tyler Date: Thu, 10 Aug 2023 15:21:38 -0700 Subject: [PATCH] edited pdf transform documentation for consistency (#523) * edited pdf transform documentation for consistency * updated webpage for beautiful soup 4 --- docs/guide/transforms/pdf_transform.md | 58 +++++++++++----------- docs/guide/transforms/webpage_transform.md | 9 ++-- 2 files changed, 35 insertions(+), 32 deletions(-) diff --git a/docs/guide/transforms/pdf_transform.md b/docs/guide/transforms/pdf_transform.md index 946d4341..b42993a6 100644 --- a/docs/guide/transforms/pdf_transform.md +++ b/docs/guide/transforms/pdf_transform.md @@ -1,8 +1,8 @@ The PDF transform allows users to extract text from pdf files. Autolabel offers both direct text extraction, useful for extracting text from pdfs that contain text, and optical character recognition (OCR) text extraction, useful for extracting text from pdfs that contain images. To use this transform, follow these steps: -
    -
  1. Install dependencies - For direct text extraction, install the pdfplumber package: +## Installation + +For direct text extraction, install the pdfplumber package: ```bash pip install pdfplumber @@ -14,10 +14,32 @@ For OCR text extraction, install the pdf2image and pytesserac pip install pdf2image pytesseract ``` -
  2. -
  3. Add the transform to your config file +## Parameters for this transform + +
      +
    1. file_path_column: the name of the column containing the file paths of the pdf files to extract text from
    2. +
    3. ocr_enabled: a boolean indicating whether to use OCR text extraction or not
    4. +
    5. page_format: a string containing the format to use for each page of the pdf file. The following fields can be used in the format string: +
        +
      • page_num: the page number of the page
      • +
      • page_content: the content of the page
      • +
      +
    6. page_sep: a string containing the separator to use between each page of the pdf file +
    + +### Output Format + +The page_format and page_sep parameters define how the text extracted from the pdf will be formatted. For example, if the pdf file contained 2 pages with "Hello," on the first page and "World!" on the second, a page_format of {page_num} - {page_content} and a page_sep of \n would result in the following output: + +```python +"1 - Hello,\n2 - World!" +``` + +The metadata column contains a dict with the field "num_pages" indicating the number of pages in the pdf file. + +## Using the transform -below is an example of a pdf transform to extract text from a pdf file: +Below is an example of a pdf transform to extract text from a pdf file: ```json { @@ -41,29 +63,7 @@ below is an example of a pdf transform to extract text from a pdf file: } ``` -The `params` field contains the following parameters: - -
      -
    • file_path_column: the name of the column containing the file paths of the pdf files to extract text from
    • -
    • ocr_enabled: a boolean indicating whether to use OCR text extraction or not
    • -
    • page_format: a string containing the format to use for each page of the pdf file. The following fields can be used in the format string: -
        -
      • page_num: the page number of the page
      • -
      • page_content: the content of the page
      • -
      -
    • page_sep: a string containing the separator to use between each page of the pdf file -
    - -For example, if the pdf file contained 2 pages with "Hello," on the first page and "World!" on the second, a page_format of {page_num} - {page_content} and a page_sep of \n would result in the following output: - -```python -"1 - Hello,\n2 - World!" -``` - -The metadata column contains a dict with the field "num_pages" indicating the number of pages in the pdf file. - -
  4. -
  5. Run the transform +## Run the transform ```python from autolabel import LabelingAgent, AutolabelDataset diff --git a/docs/guide/transforms/webpage_transform.md b/docs/guide/transforms/webpage_transform.md index 639ca138..a8742fe8 100644 --- a/docs/guide/transforms/webpage_transform.md +++ b/docs/guide/transforms/webpage_transform.md @@ -1,4 +1,4 @@ -The Webpage transform supports loading and processing webpage urls. Given a url, this transform will send the request to load the webpage and then parse the webpage returned to collect the text to send to the LLM. +The Webpage transform supports loading and processing webpage urls. Given a url, this transform will send the request to load the webpage and then parse the webpage returned to collect the text to send to the LLM. Use this transform yourself here in a Colab - [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PwrdBUUX1u4X2SWjgKYNxB11Gb7XEIZs#scrollTo=1f17f05a) @@ -6,10 +6,12 @@ In order to use this transform, use the following steps: ## Installation -Use the following command to download all dependencies for the webpage transform. +Use the following command to download all dependencies for the webpage transform. `beautifulsoup4` must be version `4.12.2` or higher. + ```bash -pip install bs4 httpx fake_useragent +pip install beautifulsoup4 httpx fake_useragent ``` + Make sure to do this before running the transform. ## Parameters for this transform @@ -41,6 +43,7 @@ Below is an example of a webpage transform to extract text from a webpage: ``` ## Run the transform + ```python from autolabel import LabelingAgent, AutolabelDataset agent = LabelingAgent(config)