Dataset of (mostly German) PDFs used to develop pd3f.
This repository contains the code to scrape and download some public documents (PDFs). The can files be downloaded here: https://data.jfilter.de/nlp/pd3f/bmjv_v1.zip.
- Downloaded "Stellungnahmen zu Referententwürfen" from the BMJV, around 02.04.2022
- Prepend filenames with numbers
- OCRd for German and English with OCRmyPDF
- Sort / group by language
- Redo broken OCR (manually detecting errors while working on the PDFs)
GPLv3