Releases: deepset-ai/haystack
v2.8.0-rc1
Release Notes
⬆️ Upgrade Notes
- Remove
is_greedy
deprecated argument from@component
decorator. Change theVariadic
input of your Component toGreedyVariadic
instead.
🚀 New Features
- We've added a new
DALLEImageGenerator
component, bringing image generation with OpenAI's DALL-E to the Haystack- Easy to Use: Just a few lines of code to get started:
`python from haystack.components.generators import DALLEImageGenerator image_generator = DALLEImageGenerator() response = image_generator.run("Show me a picture of a black cat.") print(response)
`
- Easy to Use: Just a few lines of code to get started:
- Add warning logs to the
PDFMinerToDocument
andPyPDFToDocument
to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to theDocumentSplitter
that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening. - We have added a new
MetaFieldGroupingRanker
component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM. - Added a new
store_full_path
parameter to the__init__
methods of the following converters:
JSONConverter
,CSVToDocument
,DOCXToDocument
,HTMLToDocument
MarkdownToDocument
,PDFMinerToDocument
,PPTXToDocument
,TikaDocumentConverter
andTextFileToDocument
. The default value isTrue
, which stores full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. - When making function calls via
OpenAPI
, allow both switching SSL verification off and specifying a certificate authority to use for it. - Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
- Added a new option to the required_variables parameter to the
PromptBuilder
andChatPromptBuilder
. By passingrequired_variables="*"
you can automatically set all variables in the prompt to be required.
⚡️ Enhancement Notes
- Added the Maximum Margin Relevance (MMR) strategy to the
SentenceTransformersDiversityRanker
. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents. - Introduces optional parameters in the
ConditionalRouter
component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters. - Added split by line to
DocumentSplitter
, which will split the document at n - Change
OpenAIDocumentEmbedder
to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches.
⚠️ Deprecation Notes
- The default value of the
store_full_path
parameter will change to False in Haysatck 2.9.0 to enhance privacy.
🐛 Bug Fixes
-
Fix
DocumentCleaner
not preserving allDocument
fields when run -
Fix
DocumentJoiner
failing when ran with an empty list of Documents -
For the
NLTKDocumentSplitter
we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases ofDoc1 = [s1, s2], Doc2 = [s1, s2, s3]
. -
Finished adding function support for this component by updating the
_split_into_units
function and added thesplitting_function
init
parameter. -
Add specific
to_dict
method to overwrite the underlying one fromDocumentSplitter
. This is needed to properly save the settings of the component to yaml. -
Fix
OpenAIChatGenerator
andOpenAIGenerator
crashing when using a streaming_callback andgeneration_kwargs
contain{"stream_options": {"include_usage": True}}
. -
Fix tracing
Pipeline
with cycles to correctly track components execution -
When meta is passed into
AnswerBuilder.run()
, it is now merged intoGeneratedAnswer
meta -
Fix
DocumentSplitter
to handle customsplitting_function
without requiringsplit_length.
Previously thesplitting_function
provided would not override other settings.
v2.7.0
Release Notes
✨ Highlights
🚅 Rework Pipeline.run()
logic to better handle cycles
Pipeline.run()
internal logic has been heavily reworked to be more robust and reliable than before. This new implementation makes it easier to run Pipeline
s that have cycles in their graph. It also fixes some corner cases in Pipeline
s that don't have any cycle.
📝 Introduce LoggingTracer
With the new LoggingTracer
, users can inspect the logs in real-time to see everything that is happening in their Pipelines. This feature aims to improve the user experience during experimentation and prototyping.
import logging
from haystack import tracing
from haystack.tracing.logging_tracer import LoggingTracer
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.DEBUG)
tracing.tracer.is_content_tracing_enabled = True # to enable tracing/logging content (inputs/outputs)
tracing.enable_tracing(LoggingTracer())
⬆️ Upgrade Notes
-
Removed
Pipeline
init argumentdebug_path
. We do not support this anymore. -
Removed
Pipeline
init argumentmax_loops_allowed
. Usemax_runs_per_component
instead. -
Removed
PipelineMaxLoops
exception. UsePipelineMaxComponentRuns
instead. -
The deprecated default converter class
haystack.components.converters.pypdf.DefaultConverter
used byPyPDFToDocument
has been removed.Pipeline YAMLs from
haystack<2.7.0
that use the default converter must be updated in the following manner:# Old components: Comp1: init_parameters: converter: type: haystack.components.converters.pypdf.DefaultConverter type: haystack.components.converters.pypdf.PyPDFToDocument # New components: Comp1: init_parameters: converter: null type: haystack.components.converters.pdf.PDFToTextConverter
Pipeline YAMLs from
haystack<2.7.0
that use custom converter classes can be upgraded by simply loading them withhaystack==2.6.x
and saving them to YAML again. -
Pipeline.connect()
will now raise aPipelineConnectError
ifsender
andreceiver
are the same Component. We do not support this use case anymore.
🚀 New Features
-
Added component
StringJoiner
to join strings from different components to a list of strings. -
Improved serialization/deserialization errors to provide extra context about the delinquent components when possible.
-
Enhanced DOCX converter to support table extraction in addition to paragraph content. The converter supports both CSV and Markdown table formats, providing flexible options for representing tabular data extracted from DOCX documents.
-
Added a new parameter
additional_mimetypes
to the FileTypeRouter component. This allows users to specify additional MIME type mappings, ensuring correct file classification across different runtime environments and Python versions. -
Introduce a
LoggingTracer
, that sends all traces to the logs.It can enabled as follows:
import logging from haystack import tracing from haystack.tracing.logging_tracer import LoggingTracer logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING) logging.getLogger("haystack").setLevel(logging.DEBUG) tracing.tracer.is_content_tracing_enabled = True # to enable tracing/logging content (inputs/outputs) tracing.enable_tracing(LoggingTracer())
-
Fundamentally rework the internal logic of
Pipeline.run()
. The rework makes it more reliable and covers more use cases. We fixed some issues that madePipeline
s with cycles unpredictable and with unclear Components execution order. -
Each tracing span of a component run is now attached with the pipeline run span object. This allows users to trace the execution of multiple pipeline runs concurrently.
⚡️ Enhancement Notes
- Add
streaming_callback
run parameter toHuggingFaceAPIGenerator
andHuggingFaceLocalGenerator
to allow users to pass a callback function that will be called after each chunk of the response is generated. - The
SentenceWindowRetriever
now supports thewindow_size
parameter at run time, overwriting the value set in the constructor. - Add output type validation in
ConditionalRouter
. Settingvalidate_output_type
toTrue
will enable a check to verify if the actual output of a route returns the declared type. If it doesn't match aValueError
is raised. - Reduced
numpy
usage to speed up imports. - Improved file type detection in
FileTypeRouter
, particularly for Microsoft Office file formats like .docx and .pptx. This enhancement ensures more consistent behavior across different environments, including AWS Lambda functions and systems without pre-installed office suites. - The
FiletypeRouter
now supports passing metadata (meta
) in therun
method. When metadata is provided, the sources are internally converted toByteStream
objects and the metadata is added. This new parameter simplifies working with preprocessing/indexing pipelines. SentenceTransformersDocumentEmbedder
now supportsconfig_kwargs
for additional parameters when loading the model configurationSentenceTransformersTextEmbedder
now supportsconfig_kwargs
for additional parameters when loading the model configuration- Previously,
numpy
was pinned to<2.0
to avoid compatibility issues in several core integrations. This pin has been removed, and haystack can work with bothnumpy
1.x
and2.x
. If necessary, we will pinnumpy
version in specific core integrations that require it.
⚠️ Deprecation Notes
- The
DefaultConverter
class used by thePyPDFToDocument
component has been deprecated. Its functionality will be merged into the component in 2.7.0.
🐛 Bug Fixes
- Serialized data of components are now explicitly enforced to be one of the following basic Python datatypes:
str
,int
,float
,bool
,list
,dict
,set
,tuple
orNone
. - Addressed an issue where certain file types (e.g., .docx, .pptx) were incorrectly classified as 'unclassified' in environments with limited MIME type definitions, such as AWS Lambda functions.
- Fixes logs containing JSON data getting lost due to string interpolation.
- Use forward references for Hugging Face Hub types in the
HuggingFaceAPIGenerator
component to prevent import errors. - Fix the serialization of
PyPDFToDocument
component to prevent the default converter from being serialized unnecessarily. - Revert change to
PyPDFConverter
that broke the deserialization of pre2.6.0
YAMLs.
v1.26.4
Release Notes
v1.26.4
⚡️ Enhancement Notes
- Upgrade the
transformers
dependency requirement totransformers>=4.46,<5.0
- Updated
tokenizer.json
URL for Anthropic models as the old URL was no longer available.
v2.7.0-rc1
Release Notes
✨ Highlights
🚅 Rework Pipeline.run()
logic to better handle cycles
Pipeline.run()
internal logic has been heavily reworked to be more robust and reliable than before. This new implementation makes it easier to run Pipeline
s that have cycles in their graph. It also fixes some corner cases in Pipeline
s that don't have any cycle.
📝 Introduce LoggingTracer
With the new LoggingTracer
, users can inspect in the logs everything that is happening in their Pipelines in real time. This feature aims to improve the user experience during experimentation and prototyping.
⬆️ Upgrade Notes
-
Removed
Pipeline
init argumentdebug_path
. We do not support this anymore. -
Removed
Pipeline
init argumentmax_loops_allowed
. Usemax_runs_per_component
instead. -
Removed
PipelineMaxLoops
exception. UsePipelineMaxComponentRuns
instead. -
The deprecated default converter class
haystack.components.converters.pypdf.DefaultConverter
used byPyPDFToDocument
has been removed.Pipeline YAMLs from
haystack<2.7.0
that use the default converter must be updated in the following manner:# Old components: Comp1: init_parameters: converter: type: haystack.components.converters.pypdf.DefaultConverter type: haystack.components.converters.pypdf.PyPDFToDocument # New components: Comp1: init_parameters: converter: null type: haystack.components.converters.pdf.PDFToTextConverter
Pipeline YAMLs from
haystack<2.7.0
that use custom converter classes can be upgraded by simply loading them withhaystack==2.6.x
and saving them to YAML again. -
Pipeline.connect()
will now raise aPipelineConnectError
ifsender
andreceiver
are the same Component. We do not support this use case anymore.
🚀 New Features
-
Added component
StringJoiner
to join strings from different components to a list of strings. -
Improved serialization/deserialization errors to provide extra context about the delinquent components when possible.
-
Enhanced DOCX converter to support table extraction in addition to paragraph content. The converter supports both CSV and Markdown table formats, providing flexible options for representing tabular data extracted from DOCX documents.
-
Added a new parameter
additional_mimetypes
to the FileTypeRouter component.This allows users to specify additional MIME type mappings, ensuring correct
file classification across different runtime environments and Python versions.
-
Introduce a
LoggingTracer
, that sends all traces to the logs.It can enabled as follows:
import logging from haystack import tracing from haystack.tracing.logging_tracer import LoggingTracer logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING) logging.getLogger("haystack").setLevel(logging.DEBUG) tracing.tracer.is_content_tracing_enabled = True # to enable tracing/logging content (inputs/outputs) tracing.enable_tracing(LoggingTracer())
-
Fundamentally rework the internal logic of
Pipeline.run()
. The rework makes it more reliable and covers more use cases. We fixed some issues that madePipeline
s with cycles unpredictable and with unclear Components execution order. -
Each tracing span of a component run is now attached with the pipeline run span object. This allows users to trace the execution of multiple pipeline runs concurrently.
⚡️ Enhancement Notes
- Add
streaming_callback
run parameter toHuggingFaceAPIGenerator
andHuggingFaceLocalGenerator
to allow users to pass a callback function that will be called after each chunk of the response is generated. - The
SentenceWindowRetriever
now supports thewindow_size
parameter at run time, overwriting the value set in the constructor. - Add output type validation in
ConditionalRouter
. Settingvalidate_output_type
toTrue
will enable a check to verify if the actual output of a route returns the declared type. If it doesn't match aValueError
is raised. - Reduced
numpy
usage to speed up imports. - Improved file type detection in
FileTypeRouter
, particularly for Microsoft Office file formats like .docx and .pptx. This enhancement ensures more consistent behavior across different environments, including AWS Lambda functions and systems without pre-installed office suites. - The
FiletypeRouter
now supports passing metadata (meta
) in therun
method. When metadata is provided, the sources are internally converted toByteStream
objects and the metadata is added. This new parameter simplifies working with preprocessing/indexing pipelines. SentenceTransformersDocumentEmbedder
now supportsconfig_kwargs
for additional parameters when loading the model configurationSentenceTransformersTextEmbedder
now supportsconfig_kwargs
for additional parameters when loading the model configuration- Previously,
numpy
was pinned to<2.0
to avoid compatibility issues in several core integrations. This pin has been removed, and haystack can work with bothnumpy
1.x
and2.x
. If necessary, we will pinnumpy
version in specific core integrations that require it.
⚠️ Deprecation Notes
- The
DefaultConverter
class used by thePyPDFToDocument
component has been deprecated. Its functionality will be merged into the component in 2.7.0.
🐛 Bug Fixes
- Serialized data of components are now explicitly enforced to be one of the following basic Python datatypes:
str
,int
,float
,bool
,list
,dict
,set
,tuple
orNone
. - Addressed an issue where certain file types (e.g., .docx, .pptx) were incorrectly classified as 'unclassified' in environments with limited MIME type definitions, such as AWS Lambda functions.
- Fixes logs containing JSON data getting lost due to string interpolation.
- Use forward references for Hugging Face Hub types in the
HuggingFaceAPIGenerator
component to prevent import errors. - Fix the serialization of
PyPDFToDocument
component to prevent the default converter from being serialized unnecessarily. - Revert change to
PyPDFConverter
that broke the deserialization of pre2.6.0
YAMLs.
v2.6.1
Release Notes
v2.6.1
Bug Fixes
- Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.
v2.6.1-rc1
Release Notes
v2.6.1-rc1
Bug Fixes
- Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.
v2.6.0
Release Notes
⬆️ Upgrade Notes
gpt-3.5-turbo
was replaced bygpt-4o-mini
as the default model for all components relying on OpenAI API- Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the docs for more details.
🚀 New Features
-
Added a new component
DocumentNDCGEvaluator
, which is similar toDocumentMRREvaluator
and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important. -
Add new
CSVToDocument
component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter. -
Adds support for zero shot document classification via new
TransformersZeroShotDocumentClassifier
component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face. -
Added the option to use a custom splitting function in
DocumentSplitter
. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialiseDocumentSplitter
withsplit_by="function"
providing the custom splitting function assplitting_function=custom_function
. -
Add new
JSONConverter
Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])
results = converter.run(sources=[source])
documents = results["documents"] print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
-
Added a new
NLTKDocumentSplitter
, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks. -
Updates
SentenceTransformersDocumentEmbedder
andSentenceTransformersTextEmbedder
somodel_max_length
passed throughtokenizer_kwargs
also updates themax_seq_length
of the underlying SentenceTransformer model.
⚡️ Enhancement Notes
-
Adapts how
ChatPromptBuilder
creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly. -
Expose
default_headers
to pass custom headers to Azure API including APIM subscription key. -
Add optional
azure_kwargs
dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI. -
Allow the ability to add the current date inside a template in
PromptBuilder
using the following syntax:{% now 'UTC' %}
: Get the current date for the UTC timezone.{% now 'America/Chicago' + 'hours=2' %}
: Add two hours to the current date in the Chicago timezone.{% now 'Europe/Berlin' - 'weeks=2' %}
: Subtract two weeks from the current date in the Berlin timezone.{% now 'Pacific/Fiji' + 'hours=2', '%H' %}
: Display only the number of hours after adding two hours to the Fiji timezone.{% now 'Etc/GMT-4', '%I:%M %p' %}
: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be
%Y-%m-%d %H:%M:%S
. Please refer to list of tz database for a list of timezones. -
Adds
usage
meta field withprompt_tokens
andcompletion_tokens
keys toHuggingFaceAPIChatGenerator
. -
Add new
GreedyVariadic
input type. This has a similar behaviour toVariadic
input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces theis_greedy
argument in the@component
decorator. If you had a Component with aVariadic
input type and@component(is_greedy=True)
you need to change the type toGreedyVariadic
and removeis_greedy=true
from@component
. -
Add new Pipeline init argument
max_runs_per_component
, this has the same identical behaviour as the existingmax_loops_allowed
argument but is more descriptive of its actual effects. -
Add new
PipelineMaxLoops
to reflect newmax_runs_per_component
init argument -
We added batching during inference time to the
TransformerSimilarityRanker
to help prevent OOMs when ranking large amounts of Documents.
⚠️ Deprecation Notes
- The
DefaultConverter
class used by thePyPDFToDocument
component has been deprecated. Its functionality will be merged into the component in 2.7.0. - Pipeline init argument
debug_path
is deprecated and will be removed in version 2.7.0. @component
decoratoris_greedy
argument is deprecated and will be removed in version 2.7.0. UseGreedyVariadic
type instead.- Deprecate connecting a Component to itself when calling
Pipeline.connect()
, it will raise an error from version 2.7.0 onwards - Pipeline init argument
max_loops_allowed
is deprecated and will be removed in version 2.7.0. Usemax_runs_per_component
instead. PipelineMaxLoops
exception is deprecated and will be removed in version 2.7.0. UsePipelineMaxComponentRuns
instead.
🐛 Bug Fixes
- Fix the serialization of
PyPDFToDocument
component to prevent the default converter from being serialized unnecessarily. - Add constraints to
component.set_input_type
andcomponent.set_input_types
to prevent undefined behaviour when therun
method does not contain a variadic keyword argument. - Prevent
set_output_types
from being called when theoutput_types
decorator is used. - Update the
CHAT_WITH_WEBSITE
Pipeline template to reflect the changes in theHTMLToDocument
converter component. - Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
- Fixing the filters in the
SentenceWindowRetriever
allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant - Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
- The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
- Make the
from_dict
method of thePyPDFToDocument
more robust to cases when the converter is not provided in the dictionary.
v2.6.0-rc3
Release Notes
⬆️ Upgrade Notes
gpt-3.5-turbo
was replaced bygpt-4o-mini
as the default model for all components relying on OpenAI API- Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the docs for more details.
🚀 New Features
-
Added a new component
DocumentNDCGEvaluator
, which is similar toDocumentMRREvaluator
and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important. -
Add new
CSVToDocument
component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter. -
Adds support for zero shot document classification via new
TransformersZeroShotDocumentClassifier
component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face. -
Added the option to use a custom splitting function in
DocumentSplitter
. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialiseDocumentSplitter
withsplit_by="function"
providing the custom splitting function assplitting_function=custom_function
. -
Add new
JSONConverter
Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])
results = converter.run(sources=[source])
documents = results["documents"] print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
-
Added a new
NLTKDocumentSplitter
, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks. -
Updates
SentenceTransformersDocumentEmbedder
andSentenceTransformersTextEmbedder
somodel_max_length
passed throughtokenizer_kwargs
also updates themax_seq_length
of the underlying SentenceTransformer model.
⚡️ Enhancement Notes
-
Adapts how
ChatPromptBuilder
creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly. -
Expose
default_headers
to pass custom headers to Azure API including APIM subscription key. -
Add optional
azure_kwargs
dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI. -
Allow the ability to add the current date inside a template in
PromptBuilder
using the following syntax:{% now 'UTC' %}
: Get the current date for the UTC timezone.{% now 'America/Chicago' + 'hours=2' %}
: Add two hours to the current date in the Chicago timezone.{% now 'Europe/Berlin' - 'weeks=2' %}
: Subtract two weeks from the current date in the Berlin timezone.{% now 'Pacific/Fiji' + 'hours=2', '%H' %}
: Display only the number of hours after adding two hours to the Fiji timezone.{% now 'Etc/GMT-4', '%I:%M %p' %}
: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be
%Y-%m-%d %H:%M:%S
. Please refer to list of tz database for a list of timezones. -
Adds
usage
meta field withprompt_tokens
andcompletion_tokens
keys toHuggingFaceAPIChatGenerator
. -
Add new
GreedyVariadic
input type. This has a similar behaviour toVariadic
input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces theis_greedy
argument in the@component
decorator. If you had a Component with aVariadic
input type and@component(is_greedy=True)
you need to change the type toGreedyVariadic
and removeis_greedy=true
from@component
. -
Add new Pipeline init argument
max_runs_per_component
, this has the same identical behaviour as the existingmax_loops_allowed
argument but is more descriptive of its actual effects. -
Add new
PipelineMaxLoops
to reflect newmax_runs_per_component
init argument -
We added batching during inference time to the
TransformerSimilarityRanker
to help prevent OOMs when ranking large amounts of Documents.
⚠️ Deprecation Notes
- The
DefaultConverter
class used by thePyPDFToDocument
component has been deprecated. Its functionality will be merged into the component in 2.7.0. - Pipeline init argument
debug_path
is deprecated and will be removed in version 2.7.0. @component
decoratoris_greedy
argument is deprecated and will be removed in version 2.7.0. UseGreedyVariadic
type instead.- Deprecate connecting a Component to itself when calling
Pipeline.connect()
, it will raise an error from version 2.7.0 onwards - Pipeline init argument
max_loops_allowed
is deprecated and will be removed in version 2.7.0. Usemax_runs_per_component
instead. PipelineMaxLoops
exception is deprecated and will be removed in version 2.7.0. UsePipelineMaxComponentRuns
instead.
🐛 Bug Fixes
- Fix the serialization of
PyPDFToDocument
component to prevent the default converter from being serialized unnecessarily. - Add constraints to
component.set_input_type
andcomponent.set_input_types
to prevent undefined behaviour when therun
method does not contain a variadic keyword argument. - Prevent
set_output_types
from being called when theoutput_types
decorator is used. - Update the
CHAT_WITH_WEBSITE
Pipeline template to reflect the changes in theHTMLToDocument
converter component. - Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
- Fixing the filters in the
SentenceWindowRetriever
allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant - Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
- The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
- Make the
from_dict
method of thePyPDFToDocument
more robust to cases when the converter is not provided in the dictionary.
v2.6.0-rc2
Release Notes
⬆️ Upgrade Notes
gpt-3.5-turbo
was replaced bygpt-4o-mini
as the default model for all components relying on OpenAI API- The legacy filter syntax support has been completely removed. Users need to use the new filter syntax. See the docs for more details.
🚀 New Features
-
Add new
CSVToDocument
component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter. -
Adds support for zero shot document classification via new
TransformersZeroShotDocumentClassifier
component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face. -
Added the option to use a custom splitting function in
DocumentSplitter
. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialiseDocumentSplitter
withsplit_by="function"
providing the custom splitting function assplitting_function=custom_function
. -
Add new
JSONConverter
Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])
results = converter.run(sources=[source])
documents = results["documents"] print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
-
Added a new
NLTKDocumentSplitter
, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks. -
Updates
SentenceTransformersDocumentEmbedder
andSentenceTransformersTextEmbedder
somodel_max_length
passed throughtokenizer_kwargs
also updates themax_seq_length
of the underlying SentenceTransformer model.
⚡️ Enhancement Notes
-
Adapts how
ChatPromptBuilder
creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly. -
Expose
default_headers
to pass custom headers to Azure API including APIM subscription key. -
Add optional
azure_kwargs
dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI. -
Allow the ability to add the current date inside a template in
PromptBuilder
using the following syntax:{% now 'UTC' %}
: Get the current date for the UTC timezone.{% now 'America/Chicago' + 'hours=2' %}
: Add two hours to the current date in the Chicago timezone.{% now 'Europe/Berlin' - 'weeks=2' %}
: Subtract two weeks from the current date in the Berlin timezone.{% now 'Pacific/Fiji' + 'hours=2', '%H' %}
: Display only the number of hours after adding two hours to the Fiji timezone.{% now 'Etc/GMT-4', '%I:%M %p' %}
: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be
%Y-%m-%d %H:%M:%S
. Please refer to list of tz database for a list of timezones. -
Adds
usage
meta field withprompt_tokens
andcompletion_tokens
keys toHuggingFaceAPIChatGenerator
. -
Add new
GreedyVariadic
input type. This has a similar behaviour toVariadic
input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces theis_greedy
argument in the@component
decorator. If you had a Component with aVariadic
input type and@component(is_greedy=True)
you need to change the type toGreedyVariadic
and removeis_greedy=true
from@component
. -
Add new Pipeline init argument
max_runs_per_component
, this has the same identical behaviour as the existingmax_loops_allowed
argument but is more descriptive of its actual effects. -
Add new
PipelineMaxLoops
to reflect newmax_runs_per_component
init argument -
We added batching during inference time to the
TransformerSimilarityRanker
to help prevent OOMs when ranking large amounts of Documents.
⚠️ Deprecation Notes
- Pipeline init argument
debug_path
is deprecated and will be removed in version 2.7.0. @component
decoratoris_greedy
argument is deprecated and will be removed in version 2.7.0. UseGreedyVariadic
type instead.- Deprecate connecting a Component to itself when calling
Pipeline.connect()
, it will raise an error from version 2.7.0 onwards - Pipeline init argument
max_loops_allowed
is deprecated and will be removed in version 2.7.0. Usemax_runs_per_component
instead. PipelineMaxLoops
exception is deprecated and will be removed in version 2.7.0. UsePipelineMaxComponentRuns
instead.
🐛 Bug Fixes
- Add constraints to
component.set_input_type
andcomponent.set_input_types
to prevent undefined behaviour when therun
method does not contain a variadic keyword argument. - Prevent
set_output_types
from being called when theoutput_types
decorator is used. - Update the
CHAT_WITH_WEBSITE
Pipeline template to reflect the changes in theHTMLToDocument
converter component. - Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
- Fixing the filters in the
SentenceWindowRetriever
allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant - Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
- The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
- Make the
from_dict
method of thePyPDFToDocument
more robust to cases when the converter is not provided in the dictionary.
v2.5.1
Release Notes
⚡️ Enhancement Notes
- Add
default_headers
init argument toAzureOpenAIGenerator
andAzureOpenAIChatGenerator
🐛 Bug Fixes
- Fix the Pipeline visualization issue due to changes in the new release of Mermaid
- Fix
Pipeline
not running Components with Variadic input even if it received inputs only from a subset of its senders - The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError