Adding language detection between Preprocessor and DocumentStore #3475

tjakk399 · 2022-10-25T14:53:45Z

tjakk399
Oct 25, 2022

SOLVED. See solution below.

Hi.

My original documents are multilingual and are split up into passages.

What is the best way to plug language detection in between PreProcessor and DocumentStore so each passage gets a language flag as metadata?

Do I have to subclass PreProcessor?

Thanks!

ZanSara · 2022-10-25T15:19:32Z

ZanSara
Oct 25, 2022

Hello @kaumanns! I think the most flexible way would be to make a small custom node and add it to the pipeline between PreProcessor and DocumentStore.

Custom nodes are surprisingly easy to make! Let me make a quick example here below:

from typing import Tuple, Dict, Any, Optional
from pathlib import Path

from haystack import Pipeline, BaseComponent, Document, MultiLabel
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import PreProcessor, TextConverter

# 
# Your custom language classifier node 
#
class MyLanguageDetectorNode(BaseComponent):

    outgoing_edges: int = 1

    def run(self,
        query: Optional[str] = None,
        file_paths: Optional[List[str]] = None,
        labels: Optional[MultiLabel] = None,
        documents: Optional[List[Document]] = None,
        meta: Optional[Any] = None,
    ) -> Tuple[Dict, str]:
        
        for document in documents:
            document.meta["language"] = SomeLanguageClassifier(text=document.content)
        return {'documents': documents}, 'output_1'

    def run_batch(self):
        raise NotImplementedError()  # Or you can implement it, if you want to use Pipeline.run_batch()


# 
# An indexing pipeline using the node above
# 
document_store = InMemoryDocumentStore()
preprocessor = PreProcessor()
pdf_converter = TextConverter()
language_detector = MyLanguageDetectorNode()

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=converter, name="converter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="preprocessor", inputs=["converter"])
indexing_pipeline.add_node(component=language_detector, name="language_detector", inputs=["preprocessor"])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=["language_detector"])

Code should be almost runnable (I haven't run it, there might be minor typos or missing imports): just replace SomeLanguageClassifier with you language classification code. I hope it gives you an idea of the process. Let me know if you manage to get by with this example or you need some clarification 😊

4 replies

tjakk399 Oct 26, 2022
Author

Neat solution. This worked, thank you. My implementation with further remarks for using it within a YAML pipeline in a Haystack fork:

In haystack/nodes/language_detector/language_detector.py:

import langdetect
from typing import Tuple, Dict, Any, Optional, List, Union

from haystack.nodes.base import BaseComponent, Document, MultiLabel


class LanguageDetectorNode(BaseComponent):
    outgoing_edges: int = 1

    def run(
            self,
            query: Optional[str] = None,
            file_paths: Optional[List[str]] = None,
            labels: Optional[MultiLabel] = None,
            documents: Optional[List[Document]] = None,
            meta: Optional[Any] = None,
    ) -> Tuple[Dict, str]:
        for document in documents:
            try:
                document.meta["_language"] = langdetect.detect(document.content)
            except Exception as e:
                document.meta["_language"] = "UNK"

        return {'documents': documents}, 'output_1'

    def run_batch(
            self,
            queries: Optional[Union[str, List[str]]] = None,
            file_paths: Optional[List[str]] = None,
            labels: Optional[Union[MultiLabel, List[MultiLabel]]] = None,
            documents: Optional[Union[List[Document], List[List[Document]]]] = None,
            meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
            params: Optional[dict] = None,
            debug: Optional[bool] = None,
    ):
        raise NotImplementedError()  # Or you can implement it, if you want to use Pipeline.run_batch()

In haystack/nodes/language_detector/__init__.py:

from haystack.nodes.language_detector.language_detector import LanguageDetectorNode

In haystack/nodes/__init__.py:

...
from haystack.nodes.language_detector import LanguageDetectorNode
...

Then your pipeline YAML may contain e.g. this:

...
components:    # define all the building-blocks for Pipeline
...
  - name: LanguageDetectorNode
    type: LanguageDetectorNode

pipelines:
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1]
      - name: PDFFileConverter
        inputs: [FileTypeClassifier.output_2]
      - name: Preprocessor
        inputs: [PDFFileConverter, TextFileConverter]
      - name: LanguageDetectorNode
        inputs: [Preprocessor]
      - name: Retriever
        inputs: [LanguageDetectorNode]
      - name: DocumentStore
        inputs: [Retriever]

Note the LanguageDetectorNode receiving input from the Preprocessor.

ZanSara Oct 26, 2022

Interesting! Thank you for sharing. I have one more idea for you: in case you're forking Haystack to add this node, you don't need to. Haystack doesn't need the nodes to be under haystack.nodes: any subclass of BaseComponent, coming from any package, will be recognized by a Pipeline even loading from YAML.

I use to create custom nodes just above the Pipeline.load_from_yaml() call, and if the YAML contains the custom nodes I defined there it still works.

Just an example of a self-sufficient script. I simplified your pipeline for brevity but the idea stays the same:

from pathlib import Path
import langdetect
from typing import Tuple, Dict, Any, Optional, List, Union

from haystack import Pipeline, BaseComponent, Document, MultiLabel

class LanguageDetectorNode(BaseComponent):
    outgoing_edges: int = 1

    def run(
            self,
            query: Optional[str] = None,
            file_paths: Optional[List[str]] = None,
            labels: Optional[MultiLabel] = None,
            documents: Optional[List[Document]] = None,
            meta: Optional[Any] = None,
    ) -> Tuple[Dict, str]:
        for document in documents:
            try:
                document.meta["language"] = langdetect.detect(document.content)
            except Exception as e:
                document.meta["language"] = "UNK"

        return {'documents': documents}, 'output_1'

    def run_batch(
            self,
            queries: Optional[Union[str, List[str]]] = None,
            file_paths: Optional[List[str]] = None,
            labels: Optional[Union[MultiLabel, List[MultiLabel]]] = None,
            documents: Optional[Union[List[Document], List[List[Document]]]] = None,
            meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
            params: Optional[dict] = None,
            debug: Optional[bool] = None,
    ):
        raise NotImplementedError()  # Or you can implement it, if you want to use Pipeline.run_batch()


yaml_def = """
version: ignore
components: 
  - name: DocumentStore
    type: InMemoryDocumentStore
  - name: TextConverter
    type: TextConverter
  - name: LanguageDetectorNode
    type: LanguageDetectorNode
  - name: Preprocessor
    type: PreProcessor

pipelines:
  - name: indexing
    nodes:
      - name: TextConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextConverter]
      - name: LanguageDetectorNode
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [LanguageDetectorNode]
"""

yaml_path = Path(__file__).parent / 'example.haystack-pipeline.yml'
with open(yaml_path, 'w') as f:
    f.write(yaml_def)

sample_file = Path(__file__).parent / 'some_file.txt'
with open(sample_file, 'w') as f:
    f.write("Hello! This is some sample text in english!")

p = Pipeline.load_from_yaml(yaml_path)

p.run(file_paths=[sample_file])

for doc in p.get_document_store().get_all_documents():
    print(doc.meta)

The script is runnable. Just copy-paste it into your editor and it should work

tjakk399 Oct 26, 2022
Author

I didn't know where to put the node class file. Thank you for clarifying the magic happening in the background.

ZanSara Oct 26, 2022

We definitely need better docs for custom nodes creation... Glad it helped!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding language detection between Preprocessor and DocumentStore #3475

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Adding language detection between Preprocessor and DocumentStore #3475

tjakk399 Oct 25, 2022

Replies: 1 comment · 4 replies

ZanSara Oct 25, 2022

tjakk399 Oct 26, 2022 Author

ZanSara Oct 26, 2022

tjakk399 Oct 26, 2022 Author

ZanSara Oct 26, 2022

tjakk399
Oct 25, 2022

Replies: 1 comment 4 replies

ZanSara
Oct 25, 2022

tjakk399 Oct 26, 2022
Author

tjakk399 Oct 26, 2022
Author