Skip to content

Commit

Permalink
Replace pickle with safer alternatives (#13067)
Browse files Browse the repository at this point in the history
* Update slack release notification step

* [ENG-1424] Use `pickle` alternatives (#1453)

* use json.dump and json.load in count_vectors_featurizer and lexical_syntactic_featurizer instead of pickle

* update load and persist in sklearn intent classifier

* update persist and load in dietclassifier

* update load and persist in sklearn intent classifier

* use json.dump and json.load in tracker featurizers

* update persist and load of TEDPolicy

* updated unexpected intent policy persist and load of model utilities.

* save and load fake features

* rename patterns.pkl to patterns.json

* update poetry.lock

* ruff formatting

* move skops import

* add comments

* clean up save_features and load_features

* WIP: update model data saving and loading

* add tests for save and load features

* update tests for test_tracker_featurizer

* update tests for test_tracker_featurizer

* WIP: serialization of feature arrays.

* update serialization and deserialization for feature array

* remove not needed tests/utils/tensorflow/test_model_data_storage.py

* start writing tests for feature array

* update feature array tests

* update tests

* fix linting

* add changelog

* add new dependencies to .github/dependabot.yml

* fix some tests

* fix loading and saving of unexpected intent ted policy

* fix linting issue

* fix converting of features in cvf and lsf

* fix lint issues

* convert vocab in cvf

* fix linting

* update crf entity extractor

* fix to_dict of crf_token

* addressed type issues

* ruff formatting

* fix typing and lint issues

* remove cloudpickle dependency

* update logistic_regression_classifier and remove joblib as dependency

* update formatting of pyproject.toml

* next try: update formatting of pyproject.toml

* update logging

* update poetry.lock

* refactor loading of lexical_syntactic_featurizer

* rename FeatureMetadata.type -> FeatureMetadata.data_type

* clean up tests test_features.py and test_crf_entity_extractor.py

* update test_feature_array.py

* check for type when loading tracker featurizer.

* update changelog

* fix line too long

* move import of skops

* Prepared release of version 3.10.9.dev1 (#1496)

* prepared release of version 3.10.9.dev1

* update minimum model version

* Check for 'step_id' and 'active_flow' keys in the metadata when adding 'ActionExecuted' event to flows paths stack.

* fix parsing of commands

* improve logging

* formatting

* add changelog

* fix parse commands for multi step

* [ATO-2985] - Windows model loading test (#1537)

* Add test for model loading on windows

* Improve the error message logged when handling the user message

* Add a changelog

* Fix Code Quality - line too long

* Rasa-sdk-update (#1546)

* all rasa-sdk micro updates

* update poetry lock

* update rasa-sdk in lock file

* Remove trailing white sapce

* Prepared release of version 3.10.11 (#1570)

* prepared release of version 3.10.11

* add comments again in pyproject.toml

* update poetry.lock

* revert changes in github workflows

* undo changes in pyproject.toml

* update changelog

* revert changes in github workflows

* update poetry.lock

* update poetry.lock

* update pyproject.toml

* update poetry.lock

* update setuptools = '>=65.5.1,<75.6.0'

* update setuptools = '~75.3.0'

* reformat code

* undo deleting of ping_slack_about_package_release.sh

* fix formatting and type issues

* downgrade setuptools to 70.3.0

* fixing logging issues (?)

---------

Co-authored-by: sancharigr <[email protected]>
  • Loading branch information
tabergma and sanchariGr authored Jan 10, 2025
1 parent 66296b2 commit 2bb1d77
Show file tree
Hide file tree
Showing 27 changed files with 1,942 additions and 616 deletions.
1 change: 0 additions & 1 deletion .github/workflows/continous-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1307,7 +1307,6 @@ jobs:
with:
args: "💥 New *Rasa Open Source * version `${{ github.ref_name }}` has been released!"


send_slack_notification_for_release_on_failure:
name: Notify Slack & Publish Release Notes
runs-on: ubuntu-24.04
Expand Down
19 changes: 19 additions & 0 deletions changelog/1424.bugfix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Replace `pickle` and `joblib` with safer alternatives, e.g. `json`, `safetensors`, and `skops`, for
serializing components.

**Note**: This is a model breaking change. Please retrain your model.

If you have a custom component that inherits from one of the components listed below and modified the `persist` or
`load` method, make sure to update your code. Please contact us in case you encounter any problems.

Affected components:

- `CountVectorFeaturizer`
- `LexicalSyntacticFeaturizer`
- `LogisticRegressionClassifier`
- `SklearnIntentClassifier`
- `DIETClassifier`
- `CRFEntityExtractor`
- `TrackerFeaturizer`
- `TEDPolicy`
- `UnexpectedIntentTEDPolicy`
454 changes: 342 additions & 112 deletions poetry.lock

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,6 @@ sanic-cors = "~2.0.0"
sanic-jwt = "^1.6.0"
sanic-routing = "^0.7.2"
websockets = ">=10.0,<11.0"
cloudpickle = ">=1.2,<2.3"
aiohttp = ">=3.9.0,<3.10"
questionary = ">=1.5.1,<1.11.0"
prompt-toolkit = "^3.0,<3.0.29"
Expand All @@ -133,10 +132,9 @@ psycopg2-binary = ">=2.8.2,<2.10.0"
python-dateutil = "~2.8"
protobuf = ">=4.23.3,< 4.23.4"
tensorflow_hub = "^0.13.0"
setuptools = ">=65.5.1"
setuptools = "~70.3.0"
ujson = ">=1.35,<6.0"
regex = ">=2020.6,<2022.11"
joblib = ">=0.15.1,<1.3.0"
sentry-sdk = ">=0.17.0,<1.15.0"
aio-pika = ">=6.7.1,<8.2.4"
aiogram = "<2.26"
Expand All @@ -156,6 +154,9 @@ dnspython = "2.3.0"
wheel = ">=0.38.1"
certifi = ">=2023.7.22"
cryptography = ">=41.0.7"
skops = "0.9.0"
safetensors = "~0.4.5"

[[tool.poetry.dependencies.tensorflow-io-gcs-filesystem]]
version = "==0.31"
markers = "sys_platform == 'win32'"
Expand Down Expand Up @@ -285,7 +286,7 @@ version = "~3.2.0"
optional = true

[tool.poetry.dependencies.transformers]
version = ">=4.13.0, <=4.26.0"
version = "~4.36.2"
optional = true

[tool.poetry.dependencies.sentencepiece]
Expand Down
23 changes: 22 additions & 1 deletion rasa/core/featurizers/single_state_featurizer.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import logging
from typing import List, Optional, Dict, Text, Set, Any

import numpy as np
import scipy.sparse
from typing import List, Optional, Dict, Text, Set, Any

from rasa.core.featurizers.precomputation import MessageContainerForCoreFeaturization
from rasa.nlu.extractors.extractor import EntityTagSpec
Expand Down Expand Up @@ -362,6 +363,26 @@ def encode_all_labels(
for action in domain.action_names_or_texts
]

def to_dict(self) -> Dict[str, Any]:
return {
"action_texts": self.action_texts,
"entity_tag_specs": self.entity_tag_specs,
"feature_states": self._default_feature_states,
}

@classmethod
def create_from_dict(
cls, data: Dict[str, Any]
) -> Optional["SingleStateFeaturizer"]:
if not data:
return None

featurizer = SingleStateFeaturizer()
featurizer.action_texts = data["action_texts"]
featurizer._default_feature_states = data["feature_states"]
featurizer.entity_tag_specs = data["entity_tag_specs"]
return featurizer


class IntentTokenizerSingleStateFeaturizer(SingleStateFeaturizer):
"""A SingleStateFeaturizer for use with policies that predict intent labels."""
Expand Down
133 changes: 115 additions & 18 deletions rasa/core/featurizers/tracker_featurizers.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
from __future__ import annotations
from pathlib import Path
from collections import defaultdict
from abc import abstractmethod
import jsonpickle
import logging

from tqdm import tqdm
import logging
from abc import abstractmethod
from collections import defaultdict
from pathlib import Path
from typing import (
Tuple,
List,
Expand All @@ -18,25 +16,30 @@
Set,
DefaultDict,
cast,
Type,
Callable,
ClassVar,
)

import numpy as np
from tqdm import tqdm

from rasa.core.featurizers.single_state_featurizer import SingleStateFeaturizer
from rasa.core.featurizers.precomputation import MessageContainerForCoreFeaturization
from rasa.core.exceptions import InvalidTrackerFeaturizerUsageError
import rasa.shared.core.trackers
import rasa.shared.utils.io
from rasa.shared.nlu.constants import TEXT, INTENT, ENTITIES, ACTION_NAME
from rasa.shared.nlu.training_data.features import Features
from rasa.shared.core.trackers import DialogueStateTracker
from rasa.shared.core.domain import State, Domain
from rasa.shared.core.events import Event, ActionExecuted, UserUttered
from rasa.core.exceptions import InvalidTrackerFeaturizerUsageError
from rasa.core.featurizers.precomputation import MessageContainerForCoreFeaturization
from rasa.core.featurizers.single_state_featurizer import SingleStateFeaturizer
from rasa.shared.core.constants import (
USER,
ACTION_UNLIKELY_INTENT_NAME,
PREVIOUS_ACTION,
)
from rasa.shared.core.domain import State, Domain
from rasa.shared.core.events import Event, ActionExecuted, UserUttered
from rasa.shared.core.trackers import DialogueStateTracker
from rasa.shared.exceptions import RasaException
from rasa.shared.nlu.constants import TEXT, INTENT, ENTITIES, ACTION_NAME
from rasa.shared.nlu.training_data.features import Features
from rasa.utils.tensorflow.constants import LABEL_PAD_ID
from rasa.utils.tensorflow.model_data import ragged_array_to_ndarray

Expand Down Expand Up @@ -64,6 +67,10 @@ def __str__(self) -> Text:
class TrackerFeaturizer:
"""Base class for actual tracker featurizers."""

# Class registry to store all subclasses
_registry: ClassVar[Dict[str, Type["TrackerFeaturizer"]]] = {}
_featurizer_type: str = "TrackerFeaturizer"

def __init__(
self, state_featurizer: Optional[SingleStateFeaturizer] = None
) -> None:
Expand All @@ -74,6 +81,36 @@ def __init__(
"""
self.state_featurizer = state_featurizer

@classmethod
def register(cls, featurizer_type: str) -> Callable:
"""Decorator to register featurizer subclasses."""

def wrapper(subclass: Type["TrackerFeaturizer"]) -> Type["TrackerFeaturizer"]:
cls._registry[featurizer_type] = subclass
# Store the type identifier in the class for serialization
subclass._featurizer_type = featurizer_type
return subclass

return wrapper

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "TrackerFeaturizer":
"""Create featurizer instance from dictionary."""
featurizer_type = data.pop("type")

if featurizer_type not in cls._registry:
raise ValueError(f"Unknown featurizer type: {featurizer_type}")

# Get the correct subclass and instantiate it
subclass = cls._registry[featurizer_type]
return subclass.create_from_dict(data)

@classmethod
@abstractmethod
def create_from_dict(cls, data: Dict[str, Any]) -> "TrackerFeaturizer":
"""Each subclass must implement its own creation from dict method."""
pass

@staticmethod
def _create_states(
tracker: DialogueStateTracker,
Expand Down Expand Up @@ -465,9 +502,7 @@ def persist(self, path: Union[Text, Path]) -> None:
self.state_featurizer.entity_tag_specs = []

# noinspection PyTypeChecker
rasa.shared.utils.io.write_text_file(
str(jsonpickle.encode(self)), featurizer_file
)
rasa.shared.utils.io.dump_obj_as_json_to_file(featurizer_file, self.to_dict())

@staticmethod
def load(path: Union[Text, Path]) -> Optional[TrackerFeaturizer]:
Expand All @@ -481,7 +516,17 @@ def load(path: Union[Text, Path]) -> Optional[TrackerFeaturizer]:
"""
featurizer_file = Path(path) / FEATURIZER_FILE
if featurizer_file.is_file():
return jsonpickle.decode(rasa.shared.utils.io.read_file(featurizer_file))
data = rasa.shared.utils.io.read_json_file(featurizer_file)

if "type" not in data:
logger.error(
f"Couldn't load featurizer for policy. "
f"File '{featurizer_file}' does not contain all "
f"necessary information. 'type' is missing."
)
return None

return TrackerFeaturizer.from_dict(data)

logger.error(
f"Couldn't load featurizer for policy. "
Expand All @@ -508,7 +553,16 @@ def _remove_action_unlikely_intent_from_events(events: List[Event]) -> List[Even
)
]

def to_dict(self) -> Dict[str, Any]:
return {
"type": self.__class__._featurizer_type,
"state_featurizer": (
self.state_featurizer.to_dict() if self.state_featurizer else None
),
}


@TrackerFeaturizer.register("FullDialogueTrackerFeaturizer")
class FullDialogueTrackerFeaturizer(TrackerFeaturizer):
"""Creates full dialogue training data for time distributed architectures.
Expand Down Expand Up @@ -646,7 +700,20 @@ def prediction_states(

return trackers_as_states

def to_dict(self) -> Dict[str, Any]:
return super().to_dict()

@classmethod
def create_from_dict(cls, data: Dict[str, Any]) -> "FullDialogueTrackerFeaturizer":
state_featurizer = SingleStateFeaturizer.create_from_dict(
data["state_featurizer"]
)
return cls(
state_featurizer,
)


@TrackerFeaturizer.register("MaxHistoryTrackerFeaturizer")
class MaxHistoryTrackerFeaturizer(TrackerFeaturizer):
"""Truncates the tracker history into `max_history` long sequences.
Expand Down Expand Up @@ -887,7 +954,25 @@ def prediction_states(

return trackers_as_states

def to_dict(self) -> Dict[str, Any]:
data = super().to_dict()
data.update(
{
"remove_duplicates": self.remove_duplicates,
"max_history": self.max_history,
}
)
return data

@classmethod
def create_from_dict(cls, data: Dict[str, Any]) -> "MaxHistoryTrackerFeaturizer":
state_featurizer = SingleStateFeaturizer.create_from_dict(
data["state_featurizer"]
)
return cls(state_featurizer, data["max_history"], data["remove_duplicates"])


@TrackerFeaturizer.register("IntentMaxHistoryTrackerFeaturizer")
class IntentMaxHistoryTrackerFeaturizer(MaxHistoryTrackerFeaturizer):
"""Truncates the tracker history into `max_history` long sequences.
Expand Down Expand Up @@ -1166,6 +1251,18 @@ def prediction_states(

return trackers_as_states

def to_dict(self) -> Dict[str, Any]:
return super().to_dict()

@classmethod
def create_from_dict(
cls, data: Dict[str, Any]
) -> "IntentMaxHistoryTrackerFeaturizer":
state_featurizer = SingleStateFeaturizer.create_from_dict(
data["state_featurizer"]
)
return cls(state_featurizer, data["max_history"], data["remove_duplicates"])


def _is_prev_action_unlikely_intent_in_state(state: State) -> bool:
prev_action_name = state.get(PREVIOUS_ACTION, {}).get(ACTION_NAME)
Expand Down
Loading

0 comments on commit 2bb1d77

Please sign in to comment.