Address and fix issues related to processing PDFs within Harmony, as identified in the Kaggle competition. Improve PDF handling for more seamless integration. Handle Excels and Word docs.
Harmony has developed a cutting-edge functionality that allows users to upload a PDF document, which the system then processes to identify and extract the text of questionnaire questions. This technology represents a significant advancement in the field of document processing and data extraction. You can try Harmony at harmonydata.ac.uk.
Demo of Harmony's Functionality: For a better understanding of what we aim to achieve, participants can view a demo of Harmony's current PDF processing functionality on YouTube.
The objective is to build upon Harmony's existing technology to create a more efficient, accurate, and robust tool for extracting questionnaire questions from a variety of documents. Participants are encouraged to innovate and develop solutions that can handle a wide range of document formats and structures.
We have lots of example PDFs, together with the ground truths (what questions should be extracted), here:
https://github.com/harmonydata/pdf-questionnaire-extraction/tree/main/data
Issue: harmonydata/harmony#11
Try our Kaggle competition: https://www.kaggle.com/competitions/harmony-pdf-and-word-questionnaires-extract-v2
Github repo for PDF parsing: https://github.com/harmonydata/pdf-questionnaire-extraction
Code Repository: Participants may find it beneficial to explore Harmony's existing code repository related to PDF processing. This can serve as a starting point or reference for developing their solutions. The repository is available at (Harmony GitHub Repository.) and https://github.com/harmonydata/pdf-questionnaire-extraction
You might also find lists like this useful: https://ipip.ori.org/AlphabeticalItemList.htm
We have a partially completed branch with updated PDF model: https://github.com/Notysoty/harmony/tree/updated_files_for_forntend but it needs work to complete it.
The training scripts are here: https://github.com/harmonydata/pdf-text-models-amol
You can try using our non-Spacy branch here:
git clone -b nospacy --recurse-submodules https://github.com/harmonydata/harmonyapi
then to ensure you have the correct Harmony library you can make sure the submodule is this one:
git clone -b updated_files_for_forntend https://github.com/Notysoty/harmony.git