PDF to CSV Converter

This project provides a tool to convert tables from PDF files into CSV or XLSX format using the Docling library. It extracts tables from PDFs and saves them as CSV or XLSX files, optionally reversing text for right-to-left languages.

How It Works

PDF Input: Provide the path to the PDF file you want to convert.
Table Extraction: The tool uses Docling's DocumentConverter to extract tables from the PDF.
DataFrame Conversion: Each extracted table is converted into a pandas DataFrame.
Optional Text Reversal: If the rtl option is enabled, text in the DataFrame is reversed.
CSV/XLSX Output: The DataFrames are saved as CSV or XLSX files in the specified output directory.

Dependencies

This project heavily depends on the Docling library for PDF table extraction, it will be installed automatically when you install this package.

Installation

You can install the package from PyPI using pip:

pip install pdf2csv

CLI Usage

You can use the CLI tool to convert PDF files to CSV or XLSX:

pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose

Example:

pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose

With uvx

You can use the CLI tool with uvx:

uvx pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose

Example:

uvx pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose

Python Usage

You can also use the converter directly in your Python code:

from pdf2csv.converter import convert

pdf_path = "example.pdf"
output_dir = "./output"
rtl = True
output_format = "xlsx"

dfs = convert(pdf_path, output_dir=output_dir, rtl=rtl, output_format=output_format)
for df in dfs:
    print(df)

TODO:

Convert datatype to numeric
Support for XLSX output

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
pdf2csv		pdf2csv
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to CSV Converter

How It Works

Dependencies

Installation

CLI Usage

With uvx

Python Usage

TODO:

About

Releases

Packages

Languages

License

ghodsizadeh/pdf2csv

Folders and files

Latest commit

History

Repository files navigation

PDF to CSV Converter

How It Works

Dependencies

Installation

CLI Usage

With uvx

Python Usage

TODO:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages