Replies: 1 comment 4 replies
-
Hello @kaumanns! I think the most flexible way would be to make a small custom node and add it to the pipeline between PreProcessor and DocumentStore. Custom nodes are surprisingly easy to make! Let me make a quick example here below: from typing import Tuple, Dict, Any, Optional
from pathlib import Path
from haystack import Pipeline, BaseComponent, Document, MultiLabel
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import PreProcessor, TextConverter
#
# Your custom language classifier node
#
class MyLanguageDetectorNode(BaseComponent):
outgoing_edges: int = 1
def run(self,
query: Optional[str] = None,
file_paths: Optional[List[str]] = None,
labels: Optional[MultiLabel] = None,
documents: Optional[List[Document]] = None,
meta: Optional[Any] = None,
) -> Tuple[Dict, str]:
for document in documents:
document.meta["language"] = SomeLanguageClassifier(text=document.content)
return {'documents': documents}, 'output_1'
def run_batch(self):
raise NotImplementedError() # Or you can implement it, if you want to use Pipeline.run_batch()
#
# An indexing pipeline using the node above
#
document_store = InMemoryDocumentStore()
preprocessor = PreProcessor()
pdf_converter = TextConverter()
language_detector = MyLanguageDetectorNode()
indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=converter, name="converter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="preprocessor", inputs=["converter"])
indexing_pipeline.add_node(component=language_detector, name="language_detector", inputs=["preprocessor"])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=["language_detector"]) Code should be almost runnable (I haven't run it, there might be minor typos or missing imports): just replace |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
SOLVED. See solution below.
Hi.
My original documents are multilingual and are split up into passages.
What is the best way to plug language detection in between PreProcessor and DocumentStore so each passage gets a language flag as metadata?
Do I have to subclass PreProcessor?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions