This repo contains a PDF parsing toolkit for preparing text corpus to transfer PDF to Markdown. Based on PDF Parser ToolKits, gathering most-use PDF OCR tools for academic papers, and inspired by grobid_tei_xml
, an open-sourced PyPI package, we develop sciparser 1.0 for text corpus pre-processing, in recent works like K2 and GeoGalactica, we use this tool and upgrade grobid backend solution to pre-process the text corpus. Moreover, the online demo is publicly available.
- Try DEMO
In this repo and demo, we only share the secondary processing solution on Grobid. In the near future, we will share the multiple-backend combination solution on PDF parsing.
git clone https://github.com/Acemap/pdf_parser.git
cd pdf_parser
pip install -r requirements.txt
python setup install
git clone https://github.com/davendw49/sciparser.git
cd sciparser
pip install -r requirements.txt
- python
First we should clone the hold repo.
git clone https://github.com/davendw49/sciparser.git
Then import
the pipeline
file to do the parsing.
from pipeline import pipeline
data = pipeline('/path/to/your/pdf/')
- gradio
python main.py
@misc{sciparser,
author = {Cheng Deng},
title = {Sciparser: PDF parsing toolkit for preparing text corpus},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/davendw49/sciparser}},
}
- PDF Parser ToolKits: https://github.com/Acemap/pdf_parser
- TEI-XML Parser (grobid_tei_xml): https://gitlab.com/internetarchive/grobid_tei_xml