I put my PDFs on Reor #102
Replies: 3 comments 3 replies
-
After importing the files, they all appear with the correct name. |
Beta Was this translation helpful? Give feedback.
-
Great @Arcovoltaire! So everything worked as expected except the slowness of indexing? |
Beta Was this translation helpful? Give feedback.
-
Result: From what I studied, Xenova uses browser processing, and from tests, the larger the note size, and the greater the quantity of them, the slower it is to process everything. Some Xenovas process more than others, and understand and deliver context to the LLM in completely different ways depending on how they understand and process the language. For Portuguese, for my type of material, the perfect Xenova was Xenova/multilingual-e5-large - I analyzed about 10 to get to it, and observed what it processed both in the LMStudio terminal and in the left bar that shows which pieces of notes will be sent. Well, all this to say that since RAG is working well, I put 12 in RAG, but as a notes program, it was too heavy, I can't even touch it. But for my purpose, which is extracting information from notes, I'm loving it. |
Beta Was this translation helpful? Give feedback.
-
I'm not a programmer, but someone who really wants to use the program. To really use the full capacity of the program I needed to transform my database which is also in PDF (1500 files) to md with file names without accents and special characters. After searching the internet for something that could do this massively, I decided to try a script with CHATGPT. He told me to install the PyMuPDF library.
And he created this script for me that worked, it converted the 1500 PDFs at once, so I put it in the Reor folder and it imported it.
import fitz # PyMuPDF
import os
import unicodedata
import re
def normalize_filename(filename):
# Normaliza para a forma NFD (Normalização de Decomposição de Forma Canônica)
# e então remove os caracteres de combinação.
normalized = unicodedata.normalize('NFD', filename)
# Mantém apenas caracteres válidos (letras, números, espaços e alguns símbolos)
return re.sub(r'[^\w\s-]', '', normalized).replace(' ', '_')
def pdf_to_markdown(pdf_path, markdown_path):
doc = fitz.open(pdf_path)
markdown_content = ""
Caminho para o diretório com os PDFs
input_directory = "./pdf"
output_directory = "./md"
for pdf_file in os.listdir(input_directory):
if pdf_file.endswith(".pdf"):
so it has been indexing the files for 1 hour
Beta Was this translation helpful? Give feedback.
All reactions