Merge branch 'NannyML:main' into main

NannyML · Sep 6, 2024 · 8ad2cfd · 8ad2cfd
2 parents 63b3a4d + 0a35cf2
commit 8ad2cfd
Show file tree

Hide file tree

Showing 8 changed files with 141 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -263,6 +263,21 @@ figure.show()
 
 We want to build NannyML together with the community! The easiest to contribute at the moment is to propose new features or log bugs under [issues](https://github.com/NannyML/nannyml/issues). For more information, have a look at [how to contribute](CONTRIBUTING.rst).
 
+Thanks to all of our contributors!
+
+[<img alt="CoffiDev" src="https://avatars.githubusercontent.com/u/6456756?v=4&s=117" width="117">](https://github.com/CoffiDev)[<img alt="smetam" src="https://avatars.githubusercontent.com/u/17511767?v=4&s=117" width="117">](https://github.com/smetam)[<img alt="amrit110" src="https://avatars.githubusercontent.com/u/8986523?v=4&s=117" width="117">](https://github.com/amrit110)[<img alt="bgalvao" src="https://avatars.githubusercontent.com/u/17158288?v=4&s=117" width="117">](https://github.com/bgalvao)[<img alt="SoyGema" src="https://avatars.githubusercontent.com/u/24204714?v=4&s=117" width="117">](https://github.com/SoyGema)
+
+[<img alt="sebasmos" src="https://avatars.githubusercontent.com/u/31293221?v=4&s=117" width="117">](https://github.com/sebasmos)[<img alt="shezadkhan137" src="https://avatars.githubusercontent.com/u/1761188?v=4&s=117" width="117">](https://github.com/shezadkhan137)[<img alt="highstepper" src="https://avatars.githubusercontent.com/u/22987068?v=4&s=117" width="117">](https://github.com/highstepper)[<img alt="WojtekNML" src="https://avatars.githubusercontent.com/u/100422459?v=4&s=117" width="117">](https://github.com/WojtekNML)[<img alt="YYYasin19" src="https://avatars.githubusercontent.com/u/26421646?v=4&s=117" width="117">](https://github.com/YYYasin19)
+
+[<img alt="giodavoli" src="https://avatars.githubusercontent.com/u/79570860?v=4&s=117" width="117">](https://github.com/giodavoli)[<img alt="mireiar" src="https://avatars.githubusercontent.com/u/105557052?v=4&s=117" width="117">](https://github.com/mireiar)[<img alt="baskervilski" src="https://avatars.githubusercontent.com/u/7703701?v=4&s=117" width="117">](https://github.com/baskervilski)[<img alt="rfrenoy" src="https://avatars.githubusercontent.com/u/12834432?v=4&s=117" width="117">](https://github.com/rfrenoy)[<img alt="jrggementiza" src="https://avatars.githubusercontent.com/u/30363148?v=4&s=117" width="117">](https://github.com/jrggementiza)
+
+[<img alt="PieDude12" src="https://avatars.githubusercontent.com/u/86422883?v=4&s=117" width="117">](https://github.com/PieDude12)[<img alt="hakimelakhrass" src="https://avatars.githubusercontent.com/u/100148105?v=4&s=117" width="117">](https://github.com/hakimelakhrass)[<img alt="maciejbalawejder" src="https://avatars.githubusercontent.com/u/47450700?v=4&s=117" width="117">](https://github.com/maciejbalawejder)[<img alt="dependabot[bot]" src="https://avatars.githubusercontent.com/in/29110?v=4&s=117" width="117">](https://github.com/apps/dependabot)[<img alt="Dbhasin1" src="https://avatars.githubusercontent.com/u/56479884?v=4&s=117" width="117">](https://github.com/Dbhasin1)
+
+[<img alt="alexnanny" src="https://avatars.githubusercontent.com/u/124191512?v=4&s=117" width="117">](https://github.com/alexnanny)[<img alt="santiviquez" src="https://avatars.githubusercontent.com/u/10890881?v=4&s=117" width="117">](https://github.com/santiviquez)[<img alt="cartgr" src="https://avatars.githubusercontent.com/u/86645043?v=4&s=117" width="117">](https://github.com/cartgr)[<img alt="BobbuAbadeer" src="https://avatars.githubusercontent.com/u/94649276?v=4&s=117" width="117">](https://github.com/BobbuAbadeer)[<img alt="jnesfield" src="https://avatars.githubusercontent.com/u/23704688?v=4&s=117" width="117">](https://github.com/jnesfield)
+
+[<img alt="NeoKish" src="https://avatars.githubusercontent.com/u/66986430?v=4&s=117" width="117">](https://github.com/NeoKish)[<img alt="michael-nml" src="https://avatars.githubusercontent.com/u/124588413?v=4&s=117" width="117">](https://github.com/michael-nml)[<img alt="jakubnml" src="https://avatars.githubusercontent.com/u/100147443?v=4&s=117" width="117">](https://github.com/jakubnml)[<img alt="nikml" src="https://avatars.githubusercontent.com/u/89025229?v=4&s=117" width="117">](https://github.com/nikml)[<img alt="nnansters" src="https://avatars.githubusercontent.com/u/94110348?v=4&s=117" width="117">](https://github.com/nnansters)
+
+
 # 🙋 Get help
 
 The best place to ask for help is in the [community slack](https://join.slack.com/t/nannymlbeta/shared_invite/zt-16fvpeddz-HAvTsjNEyC9CE6JXbiM7BQ). Feel free to join and ask questions or raise issues. Someone will definitely respond to you.

diff --git a/docs/how_it_works/business_value.rst b/docs/how_it_works/business_value.rst
@@ -55,6 +55,13 @@ observations in that cell of the :term:`confusion matrix<Confusion Matrix>`. Usi
 matrix notation the element on the i-th row and j-column of the business value matrix tells us the value
 of the i-th target when we have predicted the j-th value.
 
+.. note::
+    In Multiclass classification the classes are ordered alphanumerically.
+    This is used in the creation of the confusion matrix. The rows of the confusion matrix
+    represent target values in the corresponding alphanumerical order. And the columns
+    of the confusion matrix represent predicted classes in the same alphanumerical order.
+    Therefore the elements of the business value matrix should be constructed accordingly.
+
 For binary classification this formula is easier to manage hence we will use it as an example. Classificatio problems
 with more classes follow the same pattern.
 Using the `sklearn confusion matrix convention`_ we designate label 0 as negative and label 1 as positive.

diff --git a/...e_calculation/multiclass_performance_calculation/business_value_calculation.rst b/...e_calculation/multiclass_performance_calculation/business_value_calculation.rst
@@ -88,7 +88,10 @@ the following parameter specifications:
     The format of the business value matrix must be specified so that each element represents the business
     value of it's respective confusion matrix element. Hence the element on the i-th row and j-column of the
     business value matrix tells us the value of the i-th target when we have predicted the j-th value.
-    It can be provided as a list of lists or a numpy array.
+    The target values that each column and row refer are sorted alphanumerically for both
+    the confusion matrix and the business value matrices.
+
+    The business value matrix can be provided as a list of lists or a numpy array.
     For more information about the business value matrix,
     check out the :ref:`Business Value "How it Works" page<business-value-deep-dive>`.
 

diff --git a/...ance_estimation/multiclass_performance_estimation/business_value_estimation.rst b/...ance_estimation/multiclass_performance_estimation/business_value_estimation.rst
@@ -80,7 +80,10 @@ parameters:
     The format of the business value matrix must be specified so that each element represents the business
     value of it's respective confusion matrix element. Hence the element on the i-th row and j-column of the
     business value matrix tells us the value of the i-th target when we have predicted the j-th value.
-    It can be provided as a list of lists or a numpy array.
+    The target values that each column and row refer are sorted alphanumerically for both
+    the confusion matrix and the business value matrices.
+
+    The business value matrix can be provided as a list of lists or a numpy array.
     For more information about the business value matrix,
     check out the :ref:`Business Value "How it Works" page<business-value-deep-dive>`.
 

diff --git a/docs/usage_logging.rst b/docs/usage_logging.rst
@@ -65,7 +65,7 @@ What about personal data
 Apart from the hardware ID, there is nothing to link back to your machine, let alone to your identity.
 You have our word on this: we will never collect any Personally Identifiable Information.
 And don't just take our word: verify it! We invite you to review the implementation at
-https://github.com/NannyML/nannyml/blob/feature/usage_logging/nannyml/usage_logging.py.
+https://github.com/NannyML/nannyml/blob/main/nannyml/usage_logging.py.
 
 What about my dataset?
 ######################
@@ -113,7 +113,7 @@ How usage logging works
 We'll give a very brief overview of how we've implemented usage analytics.
 
 1. We've created a `usage_logging` module within the library. It contains all the functionality related to usage analytics.
-   Feel free to browse the source code at https://github.com/NannyML/nannyml/blob/feature/usage_logging/nannyml/usage_logging.py.
+   Feel free to browse the source code at https://github.com/NannyML/nannyml/blob/main/nannyml/usage_logging.py.
 2. We instrument our library by adding a `log_usage` decorator to our key functions, sometimes also providing some additional data (e.g. metric names).
 3. Upon calling one of these key functions, the decorator will capture the required information. Our `usage_logging` module will then try to send it over
    to **Segment**, a third-party service provider specializing in customer data.
@@ -129,7 +129,7 @@ We'll give a very brief overview of how we've implemented usage analytics.
 Whilst our team at NannyML saw the need for usage analytics, we did have some deeper discussions about how to present
 this to you, the end user.
 
-Do we disable usage analytics collection by default and have the end user explicitly opt in? ]
+Do we disable usage analytics collection by default and have the end user explicitly opt in? 
 Whilst it felt very intuitive and "correct” to do so, we asked ourselves the following question.
 “Would I go through the trouble of explicitly enabling this every time I use NannyML?".
 Our answer was no, we probably wouldn't bother. And if we wouldn't, it is only fair we don't expect you to.

diff --git a/nannyml/data_quality/unseen/calculator.py b/nannyml/data_quality/unseen/calculator.py
@@ -35,6 +35,8 @@ def __init__(
         self,
         column_names: Union[str, List[str]],
         normalize: bool = True,
+        y_pred_column_name: Optional[str] = None,
+        y_true_column_name: Optional[str] = None,
         timestamp_column_name: Optional[str] = None,
         chunk_size: Optional[int] = None,
         chunk_number: Optional[int] = None,
@@ -96,6 +98,10 @@ def __init__(
                 "column_names should be either a column name string or a list of columns names strings, "
                 "found\n{column_names}"
             )
+
+        self.y_pred_column_name = y_pred_column_name
+        self.y_true_column_name = y_true_column_name
+
         self.result: Optional[Result] = None
         # Threshold strategy is the same across all columns
         # By default for unseen values there is no lower threshold or threshold limit.
@@ -135,6 +141,12 @@ def _fit(self, reference_data: pd.DataFrame, *args, **kwargs):
         # Included columns of dtype=int should be considered categorical. We'll try converting those explicitly.
         reference_data = _convert_int_columns_to_categorical(reference_data, self.column_names, self._logger)
 
+        # y_true and y_pred columns are treated as categorical for the purpose of this calculator
+        if self.y_pred_column_name:
+            reference_data[self.y_pred_column_name] = reference_data[self.y_pred_column_name].astype('category')
+        if self.y_true_column_name:
+            reference_data[self.y_true_column_name] = reference_data[self.y_true_column_name].astype('category')
+
         # All provided columns must be categorical
         continuous_column_names, categorical_column_names = _split_features_by_type(reference_data, self.column_names)
         if not set(self.column_names) == set(categorical_column_names):