Dataset of (mostly German) PDFs used to develop pd3f.
This repository contains the code to scrape and download some public documents (PDFs). The can files be downloaded here:
- Downloaded "Stellungnahmen zu Referententwürfen" from the BMJV, around 02.04.2022
- Prepend filenames with numbers
- OCRd for German and English with OCRmyPDF
- Sort / group by language
- Redo broken OCR (manually detecting errors while working on the PDFs)