This project provides a tool to convert tables from PDF files into CSV or XLSX format using the Docling library. It extracts tables from PDFs and saves them as CSV or XLSX files, optionally reversing text for right-to-left languages.
- PDF Input: Provide the path to the PDF file you want to convert.
- Table Extraction: The tool uses Docling's
DocumentConverter
to extract tables from the PDF. - DataFrame Conversion: Each extracted table is converted into a pandas DataFrame.
- Optional Text Reversal: If the
rtl
option is enabled, text in the DataFrame is reversed. - CSV/XLSX Output: The DataFrames are saved as CSV or XLSX files in the specified output directory.
This project heavily depends on the Docling library for PDF table extraction, it will be installed automatically when you install this package.
You can install the package from PyPI using pip:
pip install pdf2csv
You can use the CLI tool to convert PDF files to CSV or XLSX:
pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose
Example:
pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose
You can use the CLI tool with uvx
:
uvx pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose
Example:
uvx pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose
You can also use the converter directly in your Python code:
from pdf2csv.converter import convert
pdf_path = "example.pdf"
output_dir = "./output"
rtl = True
output_format = "xlsx"
dfs = convert(pdf_path, output_dir=output_dir, rtl=rtl, output_format=output_format)
for df in dfs:
print(df)
- Convert datatype to numeric
- Support for XLSX output