pip install spicejack
Currently, SpiceJack only supports pdf files. This will be extended in the future, create an issue to request another file type.
To use SpiceJack, first import the processor:
from spicejack.pdf import PDFprocessor
And then create a processor:
processor = PDFprocessor(
filepath,
filters,
use_legitimate,
model
)
filepath
Path of the PDF file.
filters
List of extra custom filters. See Custom Filters
use_legitimate
Use the official OpenAI API
model
Model to use for generation
processor.run(
thread,
process,
logging,
autosave
)
thread
Whether to run the processor in a child thread.
process
Whether to run the processor in a child process.
logging
Whether to print the JSON responses from the LLM.
autosave
Whether to save the result to result.json
every time a sentence is parsed.
processor.run
also returns the result.
Now you can save the result to a file.
processor.save(
jsonpath
)
jsonpath
Path of the json file to save the result.
The way SpiceJack works is that it reads the pdf file, cleans it up using a few filters, and then splits it into sentences. Then it converts the sentences into json questions and answers using an LLM
You can create custom filters
from spicejack.pdf import PDFprocessor
def filter1(list):
return [
i.replace(" percent","%")
for i in list
]
processor = PDFprocessor(
filters=[filter1],
)