Figure Extractor API

Extract figures and tables from PDF documents using this Flask-based service. The Figure Extractor API provides a straightforward HTTP interface for PDFFigures 2.0, a robust figure extraction system developed by the Allen Institute for AI.

This API wrapper makes it ideal for integration into various applications and workflows, particularly for Retrieval-Augmented Generation (RAG) applications.

About PDFFigures 2.0

This API service is built on top of PDFFigures 2.0, a Scala-based project by the Allen Institute for AI. PDFFigures 2.0 is specifically designed to extract figures, captions, tables, and section titles from scholarly documents in computer science domain. The original work is described in their academic paper: "PDFFigures 2.0: Mining Figures from Research Papers" (Clark and Divvala, 2016). You can read the paper here and visit the PDFFigures 2.0 website.

┌─────────────────┐      ┌──────────────────┐      ┌────────────────┐
│   Your App      │ HTTP │ Figure Extractor │ JNI  │  PDFFigures    │
│  (Any Language) │──────►      API         │──────►     2.0        │
│                 │      │  Python(Flask)   │      │  (Scala/JVM)   │
└─────────────────┘      └──────────────────┘      └────────────────┘

Features

PDF figure and table extraction
Support for local and remote PDF files
Batch processing capabilities for directories
Statistics of the extracted tables and figures
Docker support for easy deployment
Visualization options for PDF parsing

Use Cases

Machine Learning Dataset Creation Extract visual data from clinical trial reports and research papers to build training datasets for medical image analysis and AI models, enabling researchers to efficiently aggregate figures for training machine learning algorithms in healthcare diagnostics.
Clinical Research Data Mining Automatically extract and catalog figures from medical research articles, capturing key visualizations like treatment effect graphs, patient outcome charts, and experimental result diagrams to support systematic reviews and meta-analysis.
Academic Literature Review and Education Quickly compile comprehensive visual libraries from academic publications, allowing researchers and educators to create teaching resources, compare research methodologies, and track visual trends across scientific disciplines.

Setup

Step 1: Build and Run the Docker Container

Clone the repository:

git clone https://github.com/Huang-lab/figure-extractor.git
cd pdf-extraction

Build the Docker image:
```
docker build -t pdf-extraction .
```
Run the Docker container:
```
docker run -p 5001:5001 pdf-extraction
```
API Documentation

For detailed API documentation, visit API Docs

Usage: `how-to.ipynb`

Extract Figures and Tables from a PDF

""" Processes a document and performs various operations on each page.

Average processing time per page: ~1.06 - 1.55 seconds (based on a 29-page document with a total processing time of ~45 seconds) """

Using the Module in Python Code

For example code snippets, please refer to the how-to.ipynb notebook.

Using the CLI

Default behavior

python figure_extractor.py 2404.18021v1.pdf

This saves the extracted figures to ./output

Specifying output directory:

python figure_extractor.py path/to/pdf/file --output_dir ./figures

This saves the extracted figures to ./figures, creating the directory if it does not exist.

Processing a folder:

python figure_extractor.py /path/to/pdf/folder --output-dir ./custom_output

Specifying a custom URL if you run the docker service on another port:

    python figure_extractor.py path/to/pdf/file --url http://localhost:5001/extract --output-dir ./output

    python figure_extractor.py /path/to/pdf/folder --url http://localhost:5001/extract_batch --output-dir ./output

App Structure

project/
├── Dockerfile                # Defines the Docker image for the Flask web service
├── Dockerignore    
├── app/                      # Contains the Flask web service code
│   ├── __init__.py           # Initializes the Flask app
│   ├── routes.py             # Defines the API endpoints
│   ├── service.py            # Contains the logic for running `pdffigures2`
│   └── utils.py              # Utility functions for file handling
├── figure_extractor.py       # CLI & Module for extracting figures and tables from a PDF file
├── how-to.ipynb 
└── README.md

License

This project is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Figure Extractor API

About PDFFigures 2.0

Features

Use Cases

Setup

Step 1: Build and Run the Docker Container

API Documentation

Usage: `how-to.ipynb`

Extract Figures and Tables from a PDF

Using the Module in Python Code

Using the CLI

App Structure

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
app		app
data		data
figures		figures
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
2404.18021v1.pdf		2404.18021v1.pdf
Dockerfile		Dockerfile
README.md		README.md
figure_extractor.py		figure_extractor.py
how-to.ipynb		how-to.ipynb
requirements.txt		requirements.txt
run.py		run.py

Huang-lab/figure-extractor

Folders and files

Latest commit

History

Repository files navigation

Figure Extractor API

About PDFFigures 2.0

Features

Use Cases

Setup

Step 1: Build and Run the Docker Container

API Documentation

Usage: how-to.ipynb

Extract Figures and Tables from a PDF

Using the Module in Python Code

Using the CLI

App Structure

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Usage: `how-to.ipynb`

Packages