Skip to content

Latest commit

 

History

History
52 lines (27 loc) · 2.58 KB

1-pdf-parsing.md

File metadata and controls

52 lines (27 loc) · 2.58 KB

Task 1. PDF parsing

Address and fix issues related to processing PDFs within Harmony, as identified in the Kaggle competition. Improve PDF handling for more seamless integration. Handle Excels and Word docs.

About Harmony's Functionality

Harmony has developed a cutting-edge functionality that allows users to upload a PDF document, which the system then processes to identify and extract the text of questionnaire questions. This technology represents a significant advancement in the field of document processing and data extraction. You can try Harmony at harmonydata.ac.uk.

Demo of Harmony's Functionality: For a better understanding of what we aim to achieve, participants can view a demo of Harmony's current PDF processing functionality on YouTube.

Objective

The objective is to build upon Harmony's existing technology to create a more efficient, accurate, and robust tool for extracting questionnaire questions from a variety of documents. Participants are encouraged to innovate and develop solutions that can handle a wide range of document formats and structures.

Data

We have lots of example PDFs, together with the ground truths (what questions should be extracted), here:

https://github.com/harmonydata/pdf-questionnaire-extraction/tree/main/data

Resources and References

Issue: harmonydata/harmony#11

Try our Kaggle competition: https://www.kaggle.com/competitions/harmony-pdf-and-word-questionnaires-extract-v2

Github repo for PDF parsing: https://github.com/harmonydata/pdf-questionnaire-extraction

Code Repository: Participants may find it beneficial to explore Harmony's existing code repository related to PDF processing. This can serve as a starting point or reference for developing their solutions. The repository is available at (Harmony GitHub Repository.) and https://github.com/harmonydata/pdf-questionnaire-extraction

You might also find lists like this useful: https://ipip.ori.org/AlphabeticalItemList.htm

Work done so far

We have a partially completed branch with updated PDF model: https://github.com/Notysoty/harmony/tree/updated_files_for_forntend but it needs work to complete it.

The training scripts are here: https://github.com/harmonydata/pdf-text-models-amol

You can try using our non-Spacy branch here:

 git clone -b nospacy --recurse-submodules https://github.com/harmonydata/harmonyapi

then to ensure you have the correct Harmony library you can make sure the submodule is this one:

git clone -b updated_files_for_forntend https://github.com/Notysoty/harmony.git