You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem Description
When using a set of documents in a component like DocumentSplitter esp in a pipeline, the current working is that the same parameters of the component like split_by, split_length etc are applied to all documents. But that may not always be the case, as it is for my need.
Suggested Solution
The suggestion is to use the meta properties of the document as a potent way for the developer to pass dynamic parameters. Hence, if the meta data has a parameter the same as the component parameter (for e.g. "split_by" then that parameter will be taken for that document. Since all documents anyway work with the "content" field of each document while processing, it can extend it to the meta fields in case they exist.
Current Alternative Solution
The need I am working on is a typical RAG pipeline. But due to the requirement that each file in the pipeline may want to be split in a different strategy, I am constraint to treat each document and its pre and post processing as a batch and I loop through the documents. Thus, it is not a batch of documents in a pipeline but a batch of pipelines with 1 document each.
Additional context
I was told by some data scientists that the choice of a splitting strategy is based on the contents of the document and in their opinion a standard process to follow.
Thanks.
The text was updated successfully, but these errors were encountered:
Problem Description
When using a set of documents in a component like DocumentSplitter esp in a pipeline, the current working is that the same parameters of the component like split_by, split_length etc are applied to all documents. But that may not always be the case, as it is for my need.
Suggested Solution
The suggestion is to use the meta properties of the document as a potent way for the developer to pass dynamic parameters. Hence, if the meta data has a parameter the same as the component parameter (for e.g. "split_by" then that parameter will be taken for that document. Since all documents anyway work with the "content" field of each document while processing, it can extend it to the meta fields in case they exist.
Current Alternative Solution
The need I am working on is a typical RAG pipeline. But due to the requirement that each file in the pipeline may want to be split in a different strategy, I am constraint to treat each document and its pre and post processing as a batch and I loop through the documents. Thus, it is not a batch of documents in a pipeline but a batch of pipelines with 1 document each.
Additional context
I was told by some data scientists that the choice of a splitting strategy is based on the contents of the document and in their opinion a standard process to follow.
Thanks.
The text was updated successfully, but these errors were encountered: