diff --git a/README.md b/README.md index 460b98b86..31b215108 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,28 @@ # julearn +![PyPI](https://img.shields.io/pypi/v/julearn?style=flat-square) +![PyPI - Python Version](https://img.shields.io/pypi/pyversions/julearn?style=flat-square) +![PyPI - Wheel](https://img.shields.io/pypi/wheel/julearn?style=flat-square) +![GitHub](https://img.shields.io/github/license/juaml/julearn?style=flat-square) +![Codecov](https://img.shields.io/codecov/c/github/juaml/julearn?style=flat-square) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit) +## About + The Forschungszentrum Jülich Machine Learning Library Check our full documentation here: https://juaml.github.io/julearn/index.html -Check our video tutorial here: [Julearn Playlist](https://youtube.com/playlist?list=PLvb39y5Ge21CUjccmY_0kXRCwBBaikGf_) +Check our video tutorial here: [julearn Playlist](https://youtube.com/playlist?list=PLvb39y5Ge21CUjccmY_0kXRCwBBaikGf_) + +It is currently being developed and maintained at the [Applied Machine Learning](https://www.fz-juelich.de/en/inm/inm-7/research-groups/applied-machine-learning-aml) group at [Forschungszentrum Juelich](https://www.fz-juelich.de/en), Germany. + +## Installation +Use `pip` to install from PyPI like so: +``` +pip install julearn +``` ## Licensing @@ -28,8 +43,8 @@ GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see . - ## Citing + We still do not have a publication that you can use to cite julearn in your manuscript. However, julearn realies heavily on scikit-learn. diff --git a/docs/api/index.rst b/docs/api/index.rst index ce4a85385..31d1f1f17 100644 --- a/docs/api/index.rst +++ b/docs/api/index.rst @@ -1,5 +1,3 @@ -.. include:: ../links.inc - .. _api: API Reference diff --git a/docs/available_pipeline_steps.rst b/docs/available_pipeline_steps.rst index 053ba22d4..2370764e6 100644 --- a/docs/available_pipeline_steps.rst +++ b/docs/available_pipeline_steps.rst @@ -9,19 +9,23 @@ The following is a list of all available steps that can be used to create a pipeline by name. The overview is sorted based on the type of the step: :ref:`available_transformers` or :ref:`available_models`. -The column 'Name (str)' refers to the string-name of -the respective step, i.e. how it should be specified when passed to e.g. the -``PipelineCreator``. The column 'Description' gives a short -description of what the step is doing. The column 'Class' either indicates the -underlying `scikit-learn`_ class of the respective pipeline-step together with -a link to the class in the `scikit-learn`_ documentation (follow the link to -see the valid parameters) or indicates the class in -the Julearn code, so one can have a closer look at it in Julearn's -:ref:`api`. - -For feature transformations the :ref:`available_transformers` have to be used -with the ``PipelineCreator`` and for target transformation with the -``TargetPipelineCreator``. +* The column ``Name`` refers to the string-name of + the respective step, i.e. how it should be specified when passed to e.g., the + :class:`.PipelineCreator`. + +* The column ``Description`` gives a short + description of what the step is doing. + +* The column ``Class`` either indicates the underlying `scikit-learn`_ class of + the respective pipeline step together with a link to the class in the + `scikit-learn`_ documentation (follow the link to see the valid parameters) or + indicates the class in ``julearn``, so one can have a closer look at it in + ``julearn``'s :ref:`api`. + +For feature transformations, the :ref:`available_transformers` are to be used +with the :class:`.PipelineCreator` and for target transformations, the +:ref:`available_transformers` are to be used with the +:class:`.TargetPipelineCreator`. .. _available_transformers: @@ -34,10 +38,10 @@ Scalers ~~~~~~~ .. list-table:: - :widths: 30 80 40 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class * - ``zscore`` @@ -62,15 +66,14 @@ Scalers - *Gaussianise* data - :class:`~sklearn.preprocessing.PowerTransformer` - Feature Selection ~~~~~~~~~~~~~~~~~ .. list-table:: - :widths: 30 80 40 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class * - ``select_univariate`` @@ -95,31 +98,30 @@ Feature Selection - Remove low variance features - :class:`~sklearn.feature_selection.VarianceThreshold` - DataFrame operations ~~~~~~~~~~~~~~~~~~~~ .. list-table:: - :widths: 30 80 40 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class * - ``confound_removal`` - - Removing confounds from features, - by subtracting the prediction of each feature given all confounds. - By default this is equal to "independently regressing out - the confounds from the features" + - | Removing confounds from features, + | by subtracting the prediction of each feature given all confounds. + | By default this is equal to "independently regressing out + | the confounds from the features" - :class:`.ConfoundRemover` * - ``drop_columns`` - - Drop columns from the dataframe + - Drop columns from the DataFrame - :class:`.DropColumns` * - ``change_column_types`` - - Change the type of a column in a dataframe + - Change the type of a column in a DataFrame - :class:`.ChangeColumnTypes` * - ``filter_columns`` - - Filter columns in a dataframe + - Filter columns in a DataFrame - :class:`.FilterColumns` .. _available_decompositions: @@ -128,10 +130,10 @@ Decomposition ~~~~~~~~~~~~~ .. list-table:: - :widths: 30 80 40 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class * - ``pca`` @@ -142,10 +144,10 @@ Custom ~~~~~~ .. list-table:: - :widths: 30 80 40 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class * - ``cbpm`` @@ -161,10 +163,10 @@ Support Vector Machines ~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: - :widths: 30 80 40 20 20 20 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class - Binary @@ -172,7 +174,8 @@ Support Vector Machines - Regression * - ``svm`` - Support Vector Machine - - :class:`~sklearn.svm.SVC` and :class:`~sklearn.svm.SVR` + - | :class:`~sklearn.svm.SVC` and + | :class:`~sklearn.svm.SVR` - Y - Y - Y @@ -181,10 +184,10 @@ Ensemble ~~~~~~~~ .. list-table:: - :widths: 30 30 70 20 20 20 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class - Binary @@ -192,37 +195,43 @@ Ensemble - Regression * - ``rf`` - Random Forest - - :class:`~sklearn.ensemble.RandomForestClassifier` and :class:`~sklearn.ensemble.RandomForestRegressor` + - | :class:`~sklearn.ensemble.RandomForestClassifier` and + | :class:`~sklearn.ensemble.RandomForestRegressor` - Y - Y - Y * - ``et`` - Extra-Trees - - :class:`~sklearn.ensemble.ExtraTreesClassifier` and :class:`~sklearn.ensemble.ExtraTreesRegressor` + - | :class:`~sklearn.ensemble.ExtraTreesClassifier` and + | :class:`~sklearn.ensemble.ExtraTreesRegressor` - Y - Y - Y * - ``adaboost`` - AdaBoost - - :class:`~sklearn.ensemble.AdaBoostClassifier` and :class:`~sklearn.ensemble.AdaBoostRegressor` + - | :class:`~sklearn.ensemble.AdaBoostClassifier` and + | :class:`~sklearn.ensemble.AdaBoostRegressor` - Y - Y - Y * - ``bagging`` - Bagging - - :class:`~sklearn.ensemble.BaggingClassifier` and :class:`~sklearn.ensemble.BaggingRegressor` + - | :class:`~sklearn.ensemble.BaggingClassifier` and + | :class:`~sklearn.ensemble.BaggingRegressor` - Y - Y - Y * - ``gradientboost`` - Gradient Boosting - - :class:`~sklearn.ensemble.GradientBoostingClassifier` and :class:`~sklearn.ensemble.GradientBoostingRegressor` + - | :class:`~sklearn.ensemble.GradientBoostingClassifier` and + | :class:`~sklearn.ensemble.GradientBoostingRegressor` - Y - Y - Y * - ``stacking`` - Stacking - - :class:`~sklearn.ensemble.StackingClassifier` and :class:`~sklearn.ensemble.StackingRegressor` + - | :class:`~sklearn.ensemble.StackingClassifier` and + | :class:`~sklearn.ensemble.StackingRegressor` - Y - Y - Y @@ -231,10 +240,10 @@ Gaussian Processes ~~~~~~~~~~~~~~~~~~ .. list-table:: - :widths: 30 30 70 20 20 20 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class - Binary @@ -242,7 +251,8 @@ Gaussian Processes - Regression * - ``gauss`` - Gaussian Process - - :class:`~sklearn.gaussian_process.GaussianProcessClassifier` and :class:`~sklearn.gaussian_process.GaussianProcessRegressor` + - | :class:`~sklearn.gaussian_process.GaussianProcessClassifier` and + | :class:`~sklearn.gaussian_process.GaussianProcessRegressor` - Y - Y - Y @@ -251,10 +261,10 @@ Linear Models ~~~~~~~~~~~~~ .. list-table:: - :widths: 30 50 70 10 10 10 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class - Binary @@ -280,19 +290,22 @@ Linear Models - Y * - ``ridge`` - Linear least squares with l2 regularization. - - :class:`~sklearn.linear_model.RidgeClassifier` and :class:`~sklearn.linear_model.Ridge` + - | :class:`~sklearn.linear_model.RidgeClassifier` and + | :class:`~sklearn.linear_model.Ridge` - Y - Y - Y * - ``ridgecv`` - Ridge regression with built-in cross-validation. - - :class:`~sklearn.linear_model.RidgeClassifierCV` and :class:`~sklearn.linear_model.RidgeCV` + - | :class:`~sklearn.linear_model.RidgeClassifierCV` and + | :class:`~sklearn.linear_model.RidgeCV` - Y - Y - Y * - ``sgd`` - Linear model fitted by minimizing a regularized empirical loss with SGD - - :class:`~sklearn.linear_model.SGDClassifier` and :class:`~sklearn.linear_model.SGDRegressor` + - | :class:`~sklearn.linear_model.SGDClassifier` and + | :class:`~sklearn.linear_model.SGDRegressor` - Y - Y - Y @@ -301,10 +314,10 @@ Naive Bayes ~~~~~~~~~~~ .. list-table:: - :widths: 30 50 70 10 10 10 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class - Binary @@ -345,10 +358,10 @@ Dynamic Selection ~~~~~~~~~~~~~~~~~ .. list-table:: - :widths: 30 50 70 10 10 10 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class - Binary @@ -365,10 +378,10 @@ Dummy ~~~~~ .. list-table:: - :widths: 30 50 70 10 10 10 + :widths: auto :header-rows: 1 - * - Name (str) + * - Name - Description - Class - Binary @@ -376,7 +389,8 @@ Dummy - Regression * - ``dummy`` - Use simple rules (without features). - - :class:`~sklearn.dummy.DummyClassifier` and :class:`~sklearn.dummy.DummyRegressor` + - | :class:`~sklearn.dummy.DummyClassifier` and + | :class:`~sklearn.dummy.DummyRegressor` - Y - Y - Y diff --git a/docs/changes/latest.inc b/docs/changes/latest.inc deleted file mode 100644 index 074aff14d..000000000 --- a/docs/changes/latest.inc +++ /dev/null @@ -1,28 +0,0 @@ -.. NOTE: we are now using links to highlight new functions and classes. - Please follow the examples below like - :func:`julearn.api.run_cross_validation`, so the - whats_new page will have a link to the function/class documentation. - -.. NOTE: there are 3 separate sections for changes, based on type: - - "Enhancements" for new features - - "Bugs" for bug fixes - - "API changes" for backward-incompatible changes - -.. _current: - -Current (0.2.6.dev) -------------------- - -Enhancements -~~~~~~~~~~~~ - -- Make CBPM use sum by default (:gh:`170` by `Sami Hamdan`_). - -Bugs -~~~~ - - -- ADD BaseEstimator to ConfoundRemoval for Target (:gh:`151` by `Sami Hamdan`_). - -API changes -~~~~~~~~~~~ diff --git a/docs/changes/newsfragments/235.doc b/docs/changes/newsfragments/235.doc new file mode 100644 index 000000000..6982bed24 --- /dev/null +++ b/docs/changes/newsfragments/235.doc @@ -0,0 +1 @@ +Improve documentation language, fix typos and code snippets by `Synchon Mandal`_ \ No newline at end of file diff --git a/docs/changes/v0.2.5.inc b/docs/changes/v0.2.5.inc deleted file mode 100644 index a7fd94de5..000000000 --- a/docs/changes/v0.2.5.inc +++ /dev/null @@ -1,78 +0,0 @@ -.. NOTE: we are now using links to highlight new functions and classes. - Please follow the examples below like - :func:`julearn.api.run_cross_validation`, so the - whats_new page will have a link to the function/class documentation. - -.. NOTE: there are 3 separate sections for changes, based on type: - - "Enhancements" for new features - - "Bugs" for bug fixes - - "API changes" for backward-incompatible changes - -.. _0.2.5: - -0.2.5 ------ - -Enhancements -~~~~~~~~~~~~ -- Bump minimum python version to 3.7 (by `Fede Raimondo`_). - -- Add *What's new* section in DOC to document changes (by `Fede Raimondo`_). - -- Add information on updating the *What's new* section before releasing (by `Fede Raimondo`_). - -- Update docs to make it more uniform (by `Kaustubh Patil`_). - -- Refactor scoring to allow for registering and callable scorers (by `Sami Hamdan`_). - -- Update :mod:`julearn.model_selection` and add capabilities to register searchers (by `Sami Hamdan`_). - -- Add user facing `create_pipeline` function (by `Sami Hamdan`_). - -- Update default behavior of setting inner cv according to scikit-learn instead of using outer cv as default (by `Sami Hamdan`_). - -- Add tests and more algorithms to `DynamicSelection` (by `Sami Hamdan`_ and `Shammi More`_). - -- Add CV schemes for stratifying based on the grouping variables, useful for regression problems. Check :class:`.ContinuousStratifiedGroupKFold` and :class:`.RepeatedContinuousStratifiedGroupKFold` (by `Fede Raimondo`_ and `Shammi More`_). - -- Add example for `tranform_until` (:gh:`63` by `Shammi More`_). - -- Add `CBPM` transformer (by `Sami Hamdan`_). - -- ADD `register_model` (:gh:`105` by `Sami Hamdan`_). - -- Add documentation/example for parallelization (by `Sami Hamdan`_). - -Bugs -~~~~ - -- Fix a hyperparameters setting issue where the parameter had an iterable of only one element (:gh:`96` by `Sami Hamdan`_). - -- Fix installations instruction for latest development version (add ``--pre`` by `Fede Raimondo`_). - -- Fix target transformers that only normal transformers are wrapped (:gh:`94` by `Sami Hamdan`_). - -- Fix compatibility with new scikit-learn release 0.24 (:gh:`#108` by `Sami Hamdan`_). - -- Fix compatibility with multiprocessing in scikit-learn (by `Sami Hamdan`_). - -- Raise error message when columns in the dataframe are nos strings (:gh:`77` by `Fede Raimondo`_). - -- Fix not implemented bug for decision_function in ExtendedDataFramePipeline (:gh:`135` by `Sami Hamdan`_). - -- Fix Bug in the transformer wrapper implementation (:gh:`122` by `Sami Hamdan`_). - -- Fix Bug of showing Warnings when using confound removal (:gh:`152` by `Sami Hamdan`_). - -- Fix Bug registered scorer not working in dictionary for scoring ( by `Sami Hamdan`_). - - -API changes -~~~~~~~~~~~ -- Make api surrounding registering consistently use overwrite (by `Sami Hamdan`_). - -- Fix Bug Target Transformer missing BaseEstimator (:gh:`151` by `Sami Hamdan`_). - -- Inner `cv` needs to be provided using `search_params`. Deprecating `cv` in `model_params` (:gh:`146` by `Sami Hamdan`_). - -- Add `n_jobs` and `verbose` to `run_cross_validation` (by `Sami Hamdan`_). diff --git a/docs/configuration.rst b/docs/configuration.rst index e2ec3707d..3469f74bd 100644 --- a/docs/configuration.rst +++ b/docs/configuration.rst @@ -2,47 +2,48 @@ .. _configuration: -Configuring julearn -=================== +Configuring ``julearn`` +======================= -While julearn is meant to be a user-friendly tool, this also comes with a cost. -For example, in order to provide the user with information as well as to be -able to detect potential errors, we have implemented several checks. These +While ``julearn`` is meant to be a user-friendly tool, this also comes with a +cost. For example, in order to provide the user with information as well as to +be able to detect potential errors, we have implemented several checks. These checks, however, might yield high computational costs. Therefore, we have -implemented a global configuration module in julearn that allows to set +implemented a global configuration module in ``julearn`` that allows to set flags to enable or disable certain extra functionality. This module is called -``julearn.config`` and it has a single function called ``set_config`` that -given a configuration flag name and a value, it sets the flag to the given +``julearn.config`` and it has a single function called ``set_config`` +that given a configuration flag name and a value, it sets the flag to the given value. Here you can find the comprehensive list of flags that can be set: .. list-table:: - :widths: 30 80 80 + :widths: auto :header-rows: 1 * - Flag - Description - Potential problem(s) * - ``disable_x_check`` - - Disable checking for unmatched column names in ``X``. If set to - ``True``, any element in ``X`` that is not present in the dataframe will - not result in an error. - - The user might think that a certain feature is used in the model when - it is not. + - | Disable checking for unmatched column names in ``X``. + | If set to ``True``, any element in ``X`` that is not present in the + | dataframe will not result in an error. + - | The user might think that a certain feature is used in the model when + | it is not. * - ``disable_xtypes_check`` - - Disable checking for missing/present ``X_types`` in the ``X`` parameter - of the :meth`.run_cross_validation` method. If set to ``True``, the - ``X_types`` parameter will not be checked for consistency with the - ``X`` parameter, including undefined columns in ``X``, missing types - in ``X_types`` or duplicated columns in ``X_types``. - - The user might think that a certain feature is considered in the model - when it is not. + - | Disable checking for missing/present ``X_types`` in the ``X`` parameter + | of the :func:`.run_cross_validation` method. + | If set to ``True``, the ``X_types`` parameter will not be checked for + | consistency with the ``X`` parameter, including undefined columns in + | ``X``, missing types in ``X_types`` or duplicated columns in + | ``X_types``. + - | The user might think that a certain feature is considered in the model + | when it is not. * - ``disable_x_verbose`` - - Disable printing the list of expanded column names in ``X``. If set - to ``True``, the list of column names will not be printed. + - | Disable printing the list of expanded column names in ``X``. + | If set to ``True``, the list of column names will not be printed. - The user will not see the expanded column names in ``X``. * - ``disable_xtypes_verbose`` - - Disable printing the list of expanded column names in ``X_types``. If - set to ``True``, the list of types of X will not be printed. + - | Disable printing the list of expanded column names in ``X_types``. + | If set to ``True``, the list of types of X will not be printed. - The user will not see the expanded ``X_types`` column names. diff --git a/docs/contributing.rst b/docs/contributing.rst index d44a61db7..ee11aa8af 100644 --- a/docs/contributing.rst +++ b/docs/contributing.rst @@ -2,8 +2,8 @@ .. _contribution_guidelines: -Contributing to julearn -======================= +Contributing +============ Setting up the local development environment -------------------------------------------- diff --git a/docs/examples.rst b/docs/examples.rst index 0626860d3..f285cc67e 100644 --- a/docs/examples.rst +++ b/docs/examples.rst @@ -1,7 +1,7 @@ Examples ======== -The following are a set of examples that use julearn. +The following are a set of examples that use ``julearn``. .. this needs to be done manually to avoid TOC issues .. see https://github.com/sphinx-gallery/sphinx-gallery/pull/944/ diff --git a/docs/faq.rst b/docs/faq.rst index e34aa3769..38d945e3c 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -11,19 +11,19 @@ plots. These packages are not installed by default when you install ``julearn``. This libraries are also under development and they might not be as robust as we want. -Usually, installing julearn with the ``[viz]`` option will install the -necessary dependencies using pip. However, if you have issues with the +Usually, installing ``julearn`` with the ``[viz]`` option will install the +necessary dependencies using ``pip``. However, if you have issues with the installation or you want to install them through other package managers, you can install them manually. -Using pip: +Using ``pip``: .. code-block:: bash pip install panel pip install bokeh -Using conda: +Using ``conda``: .. code-block:: bash @@ -37,39 +37,39 @@ How do I use the :mod:`.viz` interactive plots? The interactive plots are based on `bokeh`_ and `panel`_. You can use them in different ways: -1. As a standalone application, in a browser. +#. As a standalone application, in a browser: -To do so, you need to call the function ``show`` on the plot object. For -example: + To do so, you need to call the function ``show`` on the plot object. For + example: -.. code-block:: python + .. code-block:: python - panel = plot_scores(scores1, scores2, scores3) - panel.show() + panel = plot_scores(scores1, scores2, scores3) + panel.show() -2. As part of a Jupyter notebook. +#. As part of a Jupyter notebook: -You will need to install the ``jupyter_bokeh`` package. + You will need to install the ``jupyter_bokeh`` package. -Using conda: + Using ``pip``: -.. code-block:: bash + .. code-block:: bash - conda install -c bokeh jupyter_bokeh + pip install jupyter_bokeh -Using pip: + Using ``conda``: -.. code-block:: bash + .. code-block:: bash - pip install jupyter_bokeh + conda install -c bokeh jupyter_bokeh -This will allow you to see the plots interactively in the notebook. To do so, -you need to call the function ``servable`` on the plot object. For example: + This will allow you to see the plots interactively in the notebook. To do so, + you need to call the function ``servable`` on the plot object. For example: -.. code-block:: python + .. code-block:: python - panel = plot_scores(scores1, scores2, scores3) - panel.servable() + panel = plot_scores(scores1, scores2, scores3) + panel.servable() .. TODO: As part of a Binder notebook to share with colleagues. diff --git a/docs/getting_started.rst b/docs/getting_started.rst index d2a37de69..b34e6b352 100644 --- a/docs/getting_started.rst +++ b/docs/getting_started.rst @@ -7,29 +7,31 @@ Getting started Requirements ------------ -Julearn requires the following packages: +``julearn`` is compatible with `Python`_ >= 3.8 and requires the following +packages: -* `Python`_ >= 3.8 -* `pandas`_ >= 1.4.0, < 1.6 -* `scikit-learn`_ == 1.2.0rc1 +* ``numpy>=1.24,<1.26`` +* ``pandas>=1.5.0,<2.1`` +* ``scikit-learn>=1.2.0`` +* ``statsmodels>=0.13,<0.15`` -Running the examples requires: +Running the examples require: -* `seaborn`_ >= 0.11.2, < 0.12 -* `bokeh`_ >= 3.0.2 -* `panel`_ >= 1.0.0b1 -* `param`_ >= 1.12.0 +* ``seaborn>=0.12.2,<0.13`` +* ``bokeh>=3.0.0`` +* ``panel>=1.0.0b1`` +* ``param>=1.11.0`` Depending on the installation method (e.g. the `pip install` option below), these packages might be installed automatically. It is nevertheless good to be -aware of these dependencies as installing Julearn might lead to changes in +aware of these dependencies as installing ``julearn`` might lead to changes in these packages. Setup suggestion ================ Although not required, we strongly recommend using **virtual environments** and -installing Julearn into a virtual environment. This helps to keep the setup +installing ``julearn`` into a virtual environment. This helps to keep the setup clean. The most prominent options are: * pip: `venv`_ @@ -39,31 +41,31 @@ Installing ========== .. note:: - Julearn keeps on being updated and improved. The latest stable release and - the developer version therefore oftentimes differ quite a bit. - If you want the newest updates it might make more sense for you to use the - developer version until we release the next stable julearn version. + ``julearn`` keeps on being updated and improved. The latest stable release + and the developer version therefore often differ quite a bit. + If you want the newest updates, it might make more sense for you to use the + developer version until we release the next stable ``julearn`` version. -Depending on your aimed usage of Julearn you have two different options -how to install Julearn: +Depending on your aimed usage of ``julearn`` you have two different options +how to install ``julearn``: -1. Install the *latest release*: Likely most suitable for most +#. Install the *latest release*: Likely most suitable for most **end users**. This is done by installing the latest stable release from - PyPi. + PyPI. -.. code-block:: bash + .. code-block:: bash - pip install -U julearn + pip install -U julearn -2. Install the *latest pre-relase*: This version will have the +#. Install the *latest pre-relase*: This version will have the **latest updates**. However, it is still under development and not yet officially released. Some features might still change before the next stable release. -.. code-block:: bash + .. code-block:: bash - pip install -U julearn --pre + pip install -U julearn --pre .. _install_optional_dependencies: @@ -71,9 +73,9 @@ how to install Julearn: Optional Dependencies ===================== -Some functionality of Julearn requires additional packages. These are not +Some functionality of ``julearn`` requires additional packages. These are not installed by default. If you want to use these features, you need to specify -them during installation. For example, if you want to use the `:mod:.viz` +them during installation. For example, if you want to use the :mod:`.viz` module, you need to install the ``viz`` optional dependencies as follows: .. code-block:: bash @@ -82,6 +84,6 @@ module, you need to install the ``viz`` optional dependencies as follows: The following optional dependencies are available: -* ``viz``: Visualization tools for Julearn. This includes the - `:mod:.viz` module. +* ``viz``: Visualization tools for ``julearn``. This includes the + :mod:`.viz` module. * ``deslib``: The :mod:`.dynamic` module requires the `deslib`_ package. diff --git a/docs/index.rst b/docs/index.rst index fa3e2af8d..d1350e46b 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -9,52 +9,52 @@ Welcome to julearn's documentation! =================================== .. image:: images/julearn_logo_it.png - :width: 300 - :alt: julearn - + :width: 300px + :alt: julearn logo ... a user-oriented machine-learning library. -What is Julearn? ----------------- +What is ``julearn``? +-------------------- At the Applied Machine Learning (`AML`_) group, as part of the Institute of Neuroscience and Medicine - Brain and Behaviour (`INM-7`_), we thought that using ML in research could be simpler. In the same way as `seaborn`_ provides an abstraction of `matplotlib`_'s -built julearn on top of `scikit-learn`_. functionality aiming for powerful data visualization with minor coding, we +built ``julearn`` on top of `scikit-learn`_. -Julearn is a library that provides users with the possibility of easy -testing ML models directly from `pandas`_ dataframes, while keeping the -flexibiliy of using `scikit-learn`_'s models. +``julearn`` is a library that provides users with the possibility of easy +testing ML models directly from `pandas`_ DataFrames, while keeping the +flexibility of using `scikit-learn`_'s models. -To get started with Julearn just keep reading here. Additionally You can +To get started with ``julearn`` just keep reading here. Additionally you can check out our `video tutorial`_. -Why Julearn? ------------- +Why ``julearn``? +---------------- -Why not just using `scikit-learn`? Julearn offers **three essential benefits**: +Why not just use ``scikit-learn``? ``julearn`` offers **three essential benefits**: -1. You can do machine learning with **less amount of code** than in - `scikit-learn` -2. Julearn helps you to build and evaluate pipelines in an easy way and thereby +#. You can do machine learning with **less amount of code** than in + ``scikit-learn``. +#. ``julearn`` helps you build and evaluate pipelines in an easy way and thereby helps you **avoid data leakage**! -3. It offers you nice **additional functionality**: - - * Easy to implement **confound removal**: Julearn offers you a simple way +#. It offers you nice **additional functionality**: + + * Easy to implement **confound removal**: ``julearn`` offers you a simple way to remove confounds from your data in a cross-validated way. - * Data **typing**: Julearn provides a system to specify **data types** for - your features, and then provides you with the possibility to - filter and transform your data according to these types. - * Model **inspection**: Julearn provides you with a simple way to **inspect** - your models and pipelines, and thereby helps you to understand what is - going on in your pipeline. - * Model **comparison**: Julearn provides out-of-the-box interactive + * Data **typing**: ``julearn`` provides a system to specify **data types** + for your features, and then provides you with the possibility to filter and + transform your data according to these types. + * Model **inspection**: ``julearn`` provides you with a simple way to + **inspect** your models and pipelines, and thereby helps you to understand + what is going on in your pipeline. + * Model **comparison**: ``julearn`` provides out-of-the-box interactive **visualizations** and **statistics** to compare your models. + Table of Contents ================= @@ -63,15 +63,10 @@ Table of Contents :numbered: 2 getting_started - what_really_need_know/index.rst - selected_deeper_topics/index.rst - available_pipeline_steps.rst - examples.rst - api/index.rst configuration contributing @@ -86,4 +81,3 @@ Indices and tables * :ref:`genindex` * :ref:`modindex` * :ref:`search` - diff --git a/docs/maintaining.rst b/docs/maintaining.rst index 95a83c941..2f0122b05 100644 --- a/docs/maintaining.rst +++ b/docs/maintaining.rst @@ -18,7 +18,7 @@ pre-release. The CI scripts will publish every tag with the format *v.X.Y.Z* to PyPI as version "X.Y.Z". Additionally, for every push to main, it will be published -as pre-release to TestPyPI (for now). +as pre-release to PyPI. Releasing a new version ----------------------- @@ -30,58 +30,58 @@ before proceeding. #. Make sure you are in sync with the main branch. -.. code-block:: bash + .. code-block:: bash - git checkout main - git pull --rebase origin main + git checkout main + git pull --rebase origin main #. Run the following to check changelog is properly generated: -.. code-block:: bash + .. code-block:: bash - towncrier build --draft + towncrier build --draft #. Then, run: -.. code-block:: bash + .. code-block:: bash - towncrier + towncrier -to generate the proper changelog that should be reflected in -``docs/whats_new.rst``. + to generate the proper changelog that should be reflected in + ``docs/whats_new.rst``. #. Commit the changes, make a PR and merge via a merge commit. #. Make sure you are in sync with the main branch. -.. code-block:: bash + .. code-block:: bash - git checkout main - git pull --rebase origin main + git checkout main + git pull --rebase origin main #. Create tag (replace ``X.Y.Z`` with the proper version) on the merged PR's merge commit. -.. code-block:: bash + .. code-block:: bash - git tag -a vX.Y.Z -m "Release X.Y.Z" + git tag -a vX.Y.Z -m "Release X.Y.Z" #. Check that the build system is creating the proper version -.. code-block:: bash + .. code-block:: bash - SETUPTOOLS_SCM_DEBUG=1 python -m build --source --binary --out-dir dist/ . + SETUPTOOLS_SCM_DEBUG=1 python -m build --outdir dist/ . #. Push the tag -.. code-block:: bash + .. code-block:: bash - git push origin --follow-tags + git push origin --follow-tags #. Optional: bump the *MAJOR* or *MINOR* segment of next release (replace ``D.E.0`` with the proper version). -.. code-block:: bash + .. code-block:: bash - git tag -a vD.E.0.dev -m "Set next release to D.E.0" - git push origin --follow-tags + git tag -a vD.E.0.dev -m "Set next release to D.E.0" + git push origin --follow-tags diff --git a/docs/what_really_need_know/cross_validation.rst b/docs/what_really_need_know/cross_validation.rst index 1512334a1..af8136e88 100644 --- a/docs/what_really_need_know/cross_validation.rst +++ b/docs/what_really_need_know/cross_validation.rst @@ -49,26 +49,26 @@ The essence of :func:`.run_cross_validation` Building pipelines (see :ref:`pipeline_usage`) within a (nested) cross-validation scheme, without accidentally leaking some information between steps can quickly become complicated and errors are often not-obvious to -detect. Julearn's :func:`.run_cross_validation` provides a simple and -straightforward way to do cross-validation less prone for such accidental +detect. ``julearn``'s :func:`.run_cross_validation` provides a simple and +straightforward way to do cross-validation less prone to such accidental mistakes and more transparent for debugging. The user only needs to specify the model to be used, the data to be used and the evaluation scheme to be used. -Julearn then builds the pipeline, splits the data into training and validation -sets accordingly, and most importantly, does all specified steps in a +``julearn`` then builds the pipeline, splits the data into training and +validation sets accordingly, and most importantly, does all specified steps in a cross-validation consistent manner. The main parameters needed for :func:`.run_cross_validation` include the specification of: -1. *data*: the data, including features, labels and feature types +#. ``data``: the data, including features, labels and feature types (see :ref:`data_usage`) -2. *model*: the model to evaluate, including the data transformation steps - and the learning algorithm to use(see :ref:`pipeline_usage`). -3. *model evaluation*: how the model performance should be estimated, +#. ``model``: the model to evaluate, including the data transformation steps + and the learning algorithm to use (see :ref:`pipeline_usage`). +#. ``model evaluation``: how the model performance should be estimated, like the cross validation scheme or the metrics to be computed (see :ref:`model_evaluation_usage`) -The :func:`.run_cross_validation` function will then output of dataframe with +The :func:`.run_cross_validation` function will then output the DataFrame with the fold-wise metrics, which can then be used to visualize and evaluate the estimation of the models' performance. diff --git a/docs/what_really_need_know/data.rst b/docs/what_really_need_know/data.rst index 67651c4f7..2b1817dc6 100644 --- a/docs/what_really_need_know/data.rst +++ b/docs/what_really_need_know/data.rst @@ -1,6 +1,7 @@ .. include:: ../links.inc -.. to edit the contents of this file, edit examples/99_docs/run_data_docs.py +.. to edit the contents of this file, edit + examples/99_docs/run_data_docs.py .. _data_usage: diff --git a/docs/what_really_need_know/index.rst b/docs/what_really_need_know/index.rst index 461bcded2..171bf399d 100644 --- a/docs/what_really_need_know/index.rst +++ b/docs/what_really_need_know/index.rst @@ -5,13 +5,12 @@ What you really need to know ============================ - -The backbone of Julearn is the function :func:`.run_cross_validation`, which +The backbone of ``julearn`` is the function :func:`.run_cross_validation`, which allows you to do all the *magic*. All important information needed to estimate your machine learning workflow's performance goes into this function, specified via its parameters. -But why is basically everything based on one `cross_validation` function? Well, +But why is basically everything based on one *cross-validation* function? Well, because doing proper cross-validation is of utmost importance in machine learning and it is not as easy as it might seem at first glance. If you want to understand why, reading the sub-chapter :ref:`cross_validation` is a good @@ -19,8 +18,8 @@ starting point. Once you are familiar with the basics of *cross-validation*, you can follow along the other sub-chapters to learn how to setup a basic workflow using -Julearn's :func:`.run_cross_validation`. There you can find out more about the -required data, building a basic pipeline and how to evaluate your model's +``julearn``'s :func:`.run_cross_validation`. There you can find out more about +the required data, building a basic pipeline and how to evaluate your model's performance. .. toctree:: @@ -36,5 +35,6 @@ If you are just interested in seeing all parameters of :func:`.run_cross_validation`, click on the function link to have a look at all its parameters in the :ref:`api`. -If you are already familiar with how to set up a basic workflow using Julearn -and want to do more fancy stuff, go to :ref:`selected_deeper_topics`. +If you are already familiar with how to set up a basic workflow using +``julearn`` and want to do more fancy stuff, go to +:ref:`selected_deeper_topics`. diff --git a/examples/00_starting/README.rst b/examples/00_starting/README.rst index caca58ed8..cd3675722 100644 --- a/examples/00_starting/README.rst +++ b/examples/00_starting/README.rst @@ -1,4 +1,4 @@ -Starting with julearn -===================== +Starting with ``julearn`` +========================= -Examples showing how to use basic julearn functionality. \ No newline at end of file +Examples showing how to use basic ``julearn`` functionality. diff --git a/examples/00_starting/plot_cm_acc_multiclass.py b/examples/00_starting/plot_cm_acc_multiclass.py index 12e69bb1f..2df333de0 100644 --- a/examples/00_starting/plot_cm_acc_multiclass.py +++ b/examples/00_starting/plot_cm_acc_multiclass.py @@ -2,7 +2,7 @@ Multiclass Classification ========================= -This example uses the 'iris' dataset and performs multiclass +This example uses the ``iris`` dataset and performs multiclass classification using a Support Vector Machine classifier and plots heatmaps for cross-validation accuracies and plots confusion matrix for the test data. @@ -10,7 +10,6 @@ """ # Authors: Shammi More # Federico Raimondo -# # License: AGPL import pandas as pd @@ -61,7 +60,7 @@ ############################################################################### # The scores dataframe has all the values for each CV split. -print(scores.head()) +scores.head() ############################################################################### # Now we can get the accuracy per fold and repetition: @@ -69,7 +68,7 @@ df_accuracy = scores.set_index(["repeat", "fold"])["test_accuracy"].unstack() df_accuracy.index.name = "Repeats" df_accuracy.columns.name = "K-fold splits" -print(df_accuracy) +df_accuracy ############################################################################### # Plot heatmap of accuracy over all repeats and CV splits @@ -86,6 +85,7 @@ cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true)) print(cm) + ############################################################################### # Now that we have our confusion matrix, let's build another matrix with # annotations. @@ -102,12 +102,14 @@ else: s = cm_sum[i] annot[i, j] = "%.1f%%\n%d/%d" % (p, c, s) + ############################################################################### # Finally we create another dataframe with the confusion matrix and plot # the heatmap with annotations. cm = pd.DataFrame(cm, index=np.unique(y_true), columns=np.unique(y_true)) cm.index.name = "Actual" cm.columns.name = "Predicted" + fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.heatmap(cm, cmap="YlGnBu", annot=annot, fmt="", ax=ax) plt.title("Confusion matrix") diff --git a/examples/00_starting/plot_example_regression.py b/examples/00_starting/plot_example_regression.py index 49ade565d..87bebc032 100644 --- a/examples/00_starting/plot_example_regression.py +++ b/examples/00_starting/plot_example_regression.py @@ -2,13 +2,12 @@ Regression Analysis =================== -This example uses the 'diabetes' data from sklearn datasets and performs +This example uses the ``diabetes`` data from ``sklearn datasets`` and performs a regression analysis using a Ridge Regression model. """ # Authors: Shammi More # Federico Raimondo -# # License: AGPL import pandas as pd @@ -23,15 +22,15 @@ from julearn.utils import configure_logging ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# load the diabetes data from sklearn as a pandas dataframe +# Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``. features, target = load_diabetes(return_X_y=True, as_frame=True) ############################################################################### -# Dataset contains ten variables age, sex, body mass index, average blood +# Dataset contains ten variables age, sex, body mass index, average blood # pressure, and six blood serum measurements (s1-s6) diabetes patients and # a quantitative measure of disease progression one year after baseline which # is the target we are interested in predicting. @@ -48,8 +47,9 @@ y = "target" ############################################################################### -# calculate correlations between the features/variables and plot it as heat map +# Calculate correlations between the features/variables and plot it as heat map. corr = data_diabetes.corr() + fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set(font_scale=1.2) sns.heatmap( @@ -60,14 +60,13 @@ fmt="0.1f", ) - ############################################################################### -# Split the dataset into train and test +# Split the dataset into train and test. train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3) ############################################################################### # Train a ridge regression model on train dataset and use mean absolute error -# for scoring +# for scoring. scores, model = run_cross_validation( X=X, y=y, @@ -82,7 +81,7 @@ ############################################################################### # The scores dataframe has all the values for each CV split. -print(scores.head()) +scores.head() ############################################################################### # Mean value of mean absolute error across CV @@ -95,17 +94,16 @@ df_mae.index.name = "Repeats" df_mae.columns.name = "K-fold splits" -print(df_mae) +df_mae ############################################################################### -# Plot heatmap of mean absolute error (MAE) over all repeats and CV splits +# Plot heatmap of mean absolute error (MAE) over all repeats and CV splits. fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.heatmap(df_mae, cmap="YlGnBu") plt.title("Cross-validation MAE") ############################################################################### -# Let's plot the feature importance using the coefficients of the trained model - +# Let's plot the feature importance using the coefficients of the trained model. features = pd.DataFrame({"Features": X, "importance": model["ridge"].coef_}) features.sort_values(by=["importance"], ascending=True, inplace=True) features["positive"] = features["importance"] > 0 @@ -117,10 +115,9 @@ ) ax.set(xlabel="Importance", title="Variable importance for Ridge Regression") - ############################################################################### # Use the final model to make predictions on test data and plot scatterplot -# of true values vs predicted values +# of true values vs predicted values. y_true = test_diabetes[y] y_pred = model.predict(test_diabetes[X]) diff --git a/examples/00_starting/plot_stratified_kfold_reg.py b/examples/00_starting/plot_stratified_kfold_reg.py index b0cc38108..6f5c0e419 100644 --- a/examples/00_starting/plot_stratified_kfold_reg.py +++ b/examples/00_starting/plot_stratified_kfold_reg.py @@ -2,18 +2,16 @@ Stratified K-fold CV for regression analysis ============================================ -This example uses the 'diabetes' data from sklearn datasets to +This example uses the ``diabetes`` data from ``sklearn datasets`` to perform stratified Kfold CV for a regression problem, .. include:: ../../links.inc """ - # Authors: Shammi More # Federico Raimondo # Leonard Sasse # License: AGPL -import math import pandas as pd import seaborn as sns import matplotlib.pyplot as plt @@ -25,15 +23,15 @@ from julearn.model_selection import ContinuousStratifiedKFold ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# load the diabetes data from sklearn as a pandas dataframe +# Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``. features, target = load_diabetes(return_X_y=True, as_frame=True) ############################################################################### -# Dataset contains ten variables age, sex, body mass index, average blood +# Dataset contains ten variables age, sex, body mass index, average blood # pressure, and six blood serum measurements (s1-s6) diabetes patients and # a quantitative measure of disease progression one year after baseline which # is the target we are interested in predicting. @@ -44,7 +42,7 @@ ############################################################################### # Let's combine features and target together in one dataframe and create some # outliers to see the difference in model performance with and without -# stratification +# stratification. data_df = pd.concat([features, target], axis=1) @@ -54,7 +52,7 @@ data_df = pd.concat([data_df, new_df], axis=0) data_df = data_df.reset_index(drop=True) -# define X, y +# Define X, y X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"] y = "target" @@ -80,12 +78,10 @@ # represented in each fold. Lets continue with 40 bins which gives a good # granularity. -cv_stratified = ContinuousStratifiedKFold( - n_bins=40, n_splits=5, shuffle=False -) +cv_stratified = ContinuousStratifiedKFold(n_bins=40, n_splits=5, shuffle=False) ############################################################################### -# Train a linear regression model with stratification on target +# Train a linear regression model with stratification on target. scores_strat, model = run_cross_validation( X=X, @@ -100,7 +96,7 @@ ) ############################################################################### -# Train a linear regression model without stratification on target +# Train a linear regression model without stratification on target. cv = KFold(n_splits=5, shuffle=False, random_state=None) scores, model = run_cross_validation( @@ -117,7 +113,7 @@ ############################################################################### # Now we can compare the test score for model trained with and without -# stratification. We can combine the two outputs as pandas dataframes +# stratification. We can combine the two outputs as ``pandas.DataFrame``. scores_strat["model"] = "With stratification" scores["model"] = "Without stratification" @@ -126,7 +122,7 @@ ############################################################################### # Plot a boxplot with test scores from both the models. We see here that -# the test score is higher when CV splits were not stratified +# the test score is higher when CV splits were not stratified. fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set_style("darkgrid") diff --git a/examples/00_starting/run_combine_pandas.py b/examples/00_starting/run_combine_pandas.py index ec2bea5c6..8cf7a18a3 100644 --- a/examples/00_starting/run_combine_pandas.py +++ b/examples/00_starting/run_combine_pandas.py @@ -1,53 +1,53 @@ """ -Working with pandas -=================== +Working with ``pandas`` +======================= -This example uses the 'fmri' dataset to transform and combine data in order -to prepare it to bse used by julearn. +This example uses the ``fmri`` dataset to transform and combine data in order +to prepare it to be used by ``julearn``. References ---------- -Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of -cognitive control in context-dependent decision-making. Cerebral Cortex. + Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of + cognitive control in context-dependent decision-making. Cerebral Cortex. .. include:: ../../links.inc """ # Authors: Federico Raimondo # # License: AGPL + from seaborn import load_dataset import pandas as pd ############################################################################### -# One of the key elements that make julearn easy to use, is the possibility to -# work directly with pandas data frames. Also known as excel spreadsheets or -# csv files. +# One of the key elements that make ``julearn`` easy to use, is the possibility +# to work directly with ``pandas.DataFrame``, similar to MS Excel spreadsheets +# or csv files. # -# Ideally, we will have everything tabulated and organised for julearn, but it -# might not be your case. You might have some files with the fMRI values, some -# others with demographics, some other with diagnostic metrics or behavioural +# Ideally, we will have everything tabulated and organised for ``julearn``, but +# it might not be your case. You might have some files with the fMRI values, some +# others with demographics, some other with diagnostic metrics or behavioral # results. # -# You need to prepare this files for julearn. +# You need to prepare these files for ``julearn``. # # One option is to manually edit the files and make sure that everything is -# ready to do some machine-learning. However, this is prune to introduce -# errors. +# ready to do some machine-learning. However, this is error-prone. # # Fortunately, `pandas`_ provides several tools to deal with this task. # -# This example is a collection on some of this useful methods +# This example is a collection of some of these useful methods. # -# Lets start with the fmri dataset. +# Let's start with the ``fmri`` dataset. df_fmri = load_dataset("fmri") ############################################################################### -# Lets see what this dataset has. +# Let's see what this dataset has. # -print(df_fmri.head()) +df_fmri.head() ############################################################################### # From long to wide format @@ -63,7 +63,7 @@ ) ############################################################################### -# This method reshapes the table, keeping a the specified elements as index, +# This method reshapes the table, keeping the specified elements as index, # columns and values. # # In our case, the values are extracted from the *signal* column. The columns @@ -72,36 +72,35 @@ # # The index is what identifies each sample. As a rule, the index can't be # duplicated. If each subject has more than one timepoint, and each timepoint -# has more than one event, then this 3 elements are needed as the index. +# has more than one event, then these 3 elements are needed as the index. # -# Lets see what we have here: -print(df_fmri.head()) - +# Let's see what we have here: +df_fmri.head() ############################################################################### # Now this is in the format we want. However, in order to access the index -# as columns ``df_fmri['subject']`` we need to reset the index. +# as columns ``df_fmri["subject"]`` we need to reset the index. # -# Check the sutil but important difference: +# Check the subtle but important difference: df_fmri = df_fmri.reset_index() -print(df_fmri.head()) +df_fmri.head() ############################################################################### # Merging or joining ``DataFrame`` # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # -# So now we have our fMRI data tabulated for julearn. However, it might be the -# case that we have some important information in another file. For example, +# So now we have our fMRI data tabulated for ``julearn``. However, it might be +# the case that we have some important information in another file. For example, # the subjects' age and the place where they were scanned. # -# For the purpose of the example, I will create the dataframe here. +# For the purpose of the example, we'll create the dataframe here. metadata = { "subject": [f"s{i}" for i in range(14)], "age": [23, 21, 31, 29, 43, 23, 43, 28, 48, 29, 35, 23, 34, 25], "scanner": ["a"] * 6 + ["b"] * 8, } df_meta = pd.DataFrame(metadata) -print(df_meta) +df_meta ############################################################################### # We will use the ``join`` method. This method will join the two dataframes, @@ -112,10 +111,10 @@ df_fmri = df_fmri.set_index("subject") df_meta = df_meta.set_index("subject") df_fmri = df_fmri.join(df_meta) -print(df_fmri) +df_fmri ############################################################################### -# Finally, lets reset the index and have it ready for julearn +# Finally, let's reset the index and have it ready for ``julearn``. df_fmri = df_fmri.reset_index() ############################################################################### @@ -135,18 +134,19 @@ columns="event", values=["frontal", "parietal"], ) +df_fmri -print(df_fmri) ############################################################################### -# Since the columns names are combinations of the values in the *event* column +# Since the column names are combinations of the values in the *event* column # and the previous *frontal* and *parietal* columns, it is now a multi-level # column name. -print(df_fmri.columns) +df_fmri.columns + ############################################################################### # The following trick will join the different levels using an underscore (*_*) df_fmri.columns = ["_".join(x) for x in df_fmri.columns] +df_fmri -print(df_fmri) ############################################################################### -# We have finally the information we want. We can now reset the index +# We have finally the information we want. We can now reset the index. df_fmri = df_fmri.reset_index() diff --git a/examples/00_starting/run_grouped_cv.py b/examples/00_starting/run_grouped_cv.py index b9767f91f..9363c4ef8 100644 --- a/examples/00_starting/run_grouped_cv.py +++ b/examples/00_starting/run_grouped_cv.py @@ -2,14 +2,14 @@ Grouped CV ========== -This example uses the 'fMRI' dataset and performs GroupKFold +This example uses the ``fMRI`` dataset and performs GroupKFold Cross-Validation for classification using Random Forest Classifier. References ---------- -Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of -cognitive control in context-dependent decision-making. Cerebral Cortex. + Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of + cognitive control in context-dependent decision-making. Cerebral Cortex. .. include:: ../../links.inc @@ -18,10 +18,8 @@ # Authors: Federico Raimondo # Shammi More # Kimia Nazarzadeh - - -# # License: AGPL + # Importing the necessary Python libraries import numpy as np @@ -43,7 +41,6 @@ ############################################################################### # First, lets get some information on what the dataset has: -# print(df_fmri.head()) diff --git a/examples/00_starting/run_simple_binary_classification.py b/examples/00_starting/run_simple_binary_classification.py index 3a1b19abf..3d2d2a168 100644 --- a/examples/00_starting/run_simple_binary_classification.py +++ b/examples/00_starting/run_simple_binary_classification.py @@ -2,7 +2,7 @@ Simple Binary Classification ============================ -This example uses the 'iris' dataset and performs a simple binary +This example uses the ``iris`` dataset and performs a simple binary classification using a Support Vector Machine classifier. .. include:: ../../links.inc @@ -10,6 +10,7 @@ # Authors: Federico Raimondo # # License: AGPL + from seaborn import load_dataset from julearn import run_cross_validation from julearn.utils import configure_logging @@ -54,8 +55,8 @@ ############################################################################### # If we compute the `accuracy`, we might not account for this imbalance. A more -# suitable metric is the `balanced_accuracy`. More information in scikit-learn: -# :func:`~sklearn.metrics.balanced_accuracy_score`. +# suitable metric is the `balanced_accuracy`. More information in +# ``scikit-learn``: :func:`~sklearn.metrics.balanced_accuracy_score`. # # We will also set the random seed so we always split the data in the same way. scores = run_cross_validation( @@ -72,7 +73,6 @@ print(scores["test_accuracy"].mean()) print(scores["test_balanced_accuracy"].mean()) - ############################################################################### # Other kind of metrics allows us to evaluate how good our model is to detect # specific targets. Suppose we want to create a model that correctly identifies @@ -80,7 +80,7 @@ # # Now we might want to evaluate the precision score, or the ratio of true # positives (tp) over all positives (true and false positives). More -# information in scikit-learn: :func:`~sklearn.metrics.precision_score`. +# information in ``scikit-learn``: :func:`~sklearn.metrics.precision_score`. # # For this metric to work, we need to define which are our `positive` values. # In this example, we are interested in detecting `versicolor`. diff --git a/examples/01_model_comparison/plot_simple_model_comparison.py b/examples/01_model_comparison/plot_simple_model_comparison.py index b48348bd3..358eee805 100644 --- a/examples/01_model_comparison/plot_simple_model_comparison.py +++ b/examples/01_model_comparison/plot_simple_model_comparison.py @@ -2,7 +2,7 @@ Simple Model Comparison ======================= -This example uses the 'iris' dataset and performs binary classifications +This example uses the ``iris`` dataset and performs binary classifications using different models. At the end, it compares the performance of the models using different scoring functions and performs a statistical test to assess whether the difference in performance is significant. @@ -10,7 +10,6 @@ .. include:: ../../links.inc """ # Authors: Federico Raimondo -# # License: AGPL from seaborn import load_dataset @@ -20,7 +19,7 @@ from julearn.stats.corrected_ttest import corrected_ttest ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### @@ -59,7 +58,7 @@ ############################################################################### # So we will choose to use the `balanced_accuracy` and `roc_auc` metrics. -# + scoring = ["balanced_accuracy", "roc_auc"] ############################################################################### @@ -115,7 +114,7 @@ scores3["model"] = "svm_linear" ############################################################################### -# We can now compare the performance of the models using corrected statistics +# We can now compare the performance of the models using corrected statistics. stats_df = corrected_ttest(scores1, scores2, scores3) print(stats_df) @@ -135,10 +134,11 @@ # sphinx_gallery_end_ignore ############################################################################### -# We can also plot the performance of the models using the Julearn Score Viewer -# +# We can also plot the performance of the models using the ``julearn`` Score +# Viewer. from julearn.viz import plot_scores + panel = plot_scores(scores1, scores2, scores3) # panel.show() # uncomment the previous line show the plot diff --git a/examples/02_inspection/plot_groupcv_inspect_svm.py b/examples/02_inspection/plot_groupcv_inspect_svm.py index 129173f92..bd13bf7a0 100644 --- a/examples/02_inspection/plot_groupcv_inspect_svm.py +++ b/examples/02_inspection/plot_groupcv_inspect_svm.py @@ -2,20 +2,19 @@ Inspecting SVM models ===================== -This example uses the 'fmri' dataset, performs simple binary classification +This example uses the ``fmri`` dataset, performs simple binary classification using a Support Vector Machine classifier and analyse the model. References ---------- -Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of -cognitive control in context-dependent decision-making. Cerebral Cortex. + Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of + cognitive control in context-dependent decision-making. Cerebral Cortex. .. include:: ../../links.inc """ # Authors: Federico Raimondo # Shammi More -# # License: AGPL import numpy as np @@ -32,7 +31,7 @@ from julearn.inspect import preprocess ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") @@ -44,7 +43,7 @@ ############################################################################### # First, lets get some information on what the dataset has: -# + print(df_fmri.head()) ############################################################################### @@ -126,12 +125,12 @@ # To test for unseen subject, we need to make sure that all the data from each # subject is either on the training or the testing set, but not in both. # -# We can use scikit-learn's :class:`sklearn.model_selection.GroupShuffleSplit`. -# And specify which is the grouping column using the `group` parameter. -# -# By setting `return_estimator='final'`, the :func:`.run_cross_validation` -# function return the estimator fitted with all the data. We will use this -# later to do some analysis. +# We can use ``scikit-learn``'s +# :class:`sklearn.model_selection.GroupShuffleSplit` and specify which is the +# grouping column using the ``group`` parameter. +# By setting ``return_estimator="final"``, the :func:`.run_cross_validation` +# function returns the estimator fitted with all the data. We will use this +# later to do some analyses. cv = GroupShuffleSplit(n_splits=5, test_size=0.5, random_state=42) scores, model = run_cross_validation( @@ -152,7 +151,7 @@ # After testing on independent subjects, we can now claim that given a new # subject, we can predict the kind of event. # -# Lets do some visualization on how these two features interact and what +# Let's do some visualization on how these two features interact and what # the preprocessing part of the model does. # Plot the raw features @@ -184,10 +183,10 @@ # In this case, the preprocessing is nothing more than a # :class:`sklearn.preprocessing.StandardScaler`. # -# It seems that the data is not quite linearly separable. Lets now visualize +# It seems that the data is not quite linearly separable. Let's now visualize # how the SVM does this complex task. -# get the model from the pipeline +# Get the model from the pipeline clf = model[2] fig = plt.figure() ax = sns.scatterplot( @@ -201,13 +200,13 @@ xlim = ax.get_xlim() ylim = ax.get_ylim() -# create grid to evaluate model +# Create grid to evaluate model xx = np.linspace(xlim[0], xlim[1], 30) yy = np.linspace(ylim[0], ylim[1], 30) YY, XX = np.meshgrid(yy, xx) xy = np.vstack([XX.ravel(), YY.ravel()]).T -# Create pandas dataframe +# Create pandas.DataFrame xy_df = pd.DataFrame( data=xy, columns=["parietal__:type:__continuous", "frontal__:type:__continuous"], diff --git a/examples/02_inspection/plot_inspect_random_forest.py b/examples/02_inspection/plot_inspect_random_forest.py index fa7a6bae8..ddfeb70b4 100644 --- a/examples/02_inspection/plot_inspect_random_forest.py +++ b/examples/02_inspection/plot_inspect_random_forest.py @@ -2,14 +2,14 @@ Inspecting Random Forest models =============================== -This example uses the 'iris' dataset, performs simple binary classification +This example uses the ``iris`` dataset, performs simple binary classification using a Random Forest classifier and analyse the model. .. include:: ../../links.inc """ # Authors: Federico Raimondo -# # License: AGPL + import pandas as pd import matplotlib.pyplot as plt @@ -20,7 +20,7 @@ from julearn.utils import configure_logging ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### @@ -55,8 +55,8 @@ ############################################################################### # This type of classifier has an internal variable that can inform us on how -# _important_ is each of the features. Caution: read the proper scikit-learn -# documentation :class:`~sklearn.ensemble.RandomForestClassifier` to understand\ +# *important* is each of the features. Caution: read the proper ``scikit-learn`` +# documentation :class:`~sklearn.ensemble.RandomForestClassifier` to understand # how this learning algorithm works. rf = model_iris["rf"] @@ -73,7 +73,7 @@ fig.tight_layout() ############################################################################### -# However, some reviewers (including myself), might wander about the +# However, some reviewers (including us), might wander about the # variability of the importance of these features. In the previous example # all the feature importances were obtained by fitting on the entire dataset, # while the performance was estimated using cross validation. @@ -92,7 +92,7 @@ ) ############################################################################### -# Now we can obtain the feature importance for each estimator (CV fold) +# Now we can obtain the feature importance for each estimator (CV fold). to_plot = [] for i_fold, estimator in enumerate(scores["estimator"]): this_importances = pd.DataFrame( @@ -107,7 +107,7 @@ to_plot = pd.concat(to_plot) ############################################################################### -# Finally, we can plot the variable importances for each fold +# Finally, we can plot the variable importances for each fold. fig, ax = plt.subplots(1, 1, figsize=(6, 4)) sns.swarmplot(x="importance", y="variable", data=to_plot, ax=ax) diff --git a/examples/02_inspection/plot_preprocess.py b/examples/02_inspection/plot_preprocess.py index 3f4710036..1f10c8029 100644 --- a/examples/02_inspection/plot_preprocess.py +++ b/examples/02_inspection/plot_preprocess.py @@ -2,13 +2,12 @@ Preprocessing with variance threshold, zscore and PCA ===================================================== -This example uses the 'make_regression' function to create a simple dataset, -performs a simple regression after the pre-processing of the features +This example uses the ``make_regression`` function to create a simple dataset, +performs a simple regression after the preprocessing of the features including removal of low variance features, feature normalization for only two features using zscore and feature reduction using PCA. We will check the features after each preprocessing step. """ - # Authors: Shammi More # Leonard Sasse # License: AGPL @@ -23,13 +22,12 @@ from julearn.pipeline import PipelineCreator from julearn.utils import configure_logging - ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# Create a dataset using sklearn's make_regression +# Create a dataset using ``sklearn`` ``make_regression``. df = pd.DataFrame() X, y = [f"Feature {x}" for x in range(1, 5)], "y" df[X], df[y] = make_regression( @@ -45,7 +43,7 @@ X_types = {"X_to_zscore": first_two} ############################################################################### -# Let's look at the summary statistics of the raw features +# Let's look at the summary statistics of the raw features. print("Summary Statistics of the raw features : \n", df.describe()) ############################################################################### @@ -54,13 +52,13 @@ # features. We will zscore the target and then train a random forest model. # Since we use the PipelineCreator object we have to explicitly declare which # `X_types` each preprocessing step should be applied to. If we do not declare -# the type in the 'add' method using the 'apply_to' keyword argument, -# the step will default to 'continuous' or to another type that can be declared -# in the 'init' method of the 'PipelineCreator'. -# To transform the target we could set 'apply_to='target'', which is a special +# the type in the ``add`` method using the ``apply_to`` keyword argument, +# the step will default to ``"continuous"`` or to another type that can be +# declared in the constructor of the ``PipelineCreator``. +# To transform the target we could set ``apply_to="target"``, which is a special # type, that cannot be user-defined. Please note also that if a step is added # to transform the target, you also have to explicitly add the model that is -# to be used in the regression to the 'PipelineCreator'. +# to be used in the regression to the ``PipelineCreator``. # Define model parameters and preprocessing steps first # The hyperparameters for each step can be added as a keyword argument and @@ -68,7 +66,7 @@ # search. # Setting the threshold for variance to 0.15, number of PCA components to 2 -# and number of trees for random forest to 200 +# and number of trees for random forest to 200. # By setting "apply_to=*", we can apply the preprocessing step to all features. pipeline_creator = PipelineCreator(problem_type="regression") @@ -79,9 +77,9 @@ pipeline_creator.add("rf", apply_to="*", n_estimators=200) # Because we have already added the model to the pipeline creator, we can -# simply drop in the pipeline_creator as a model. If we did not add a model -# here, we could add the pipeline_creator using the keyword argument -# 'preprocess' and hand over a model separately. +# simply drop in the ``pipeline_creator`` as a model. If we did not add a model +# here, we could add the ``pipeline_creator`` using the keyword argument +# ``preprocess`` and hand over a model separately. scores, model = run_cross_validation( X=X, @@ -114,12 +112,12 @@ print(X_after_zscore) # However, to make this less confusing you can also simply use the high-level -# function 'preprocess' to explicitly refer to a pipeline step by name: +# function ``preprocess`` to explicitly refer to a pipeline step by name: X_after_pca = preprocess(model, X=X, data=df, until="pca") X_after_zscore = preprocess(model, X=X, data=df, until="zscore") -# Let's plot scatter plots for raw features and the PCA components +# Let's plot scatter plots for raw features and the PCA components. fig, axes = plt.subplots(1, 2, figsize=(12, 6)) sns.scatterplot(x=X[0], y=X[1], data=df, ax=axes[0]) axes[0].set_title("Raw features") @@ -133,5 +131,3 @@ "Summary Statistics of the zscored features : \n", X_after_zscore.describe(), ) - -############################################################################### diff --git a/examples/02_inspection/run_binary_inspect_folds.py b/examples/02_inspection/run_binary_inspect_folds.py index 81c6deb06..e59129daa 100644 --- a/examples/02_inspection/run_binary_inspect_folds.py +++ b/examples/02_inspection/run_binary_inspect_folds.py @@ -2,7 +2,7 @@ Inspecting the fold-wise predictions ==================================== -This example uses the 'iris' dataset and performs a simple binary +This example uses the ``iris`` dataset and performs a simple binary classification using a Support Vector Machine classifier. We later inspect the predictions of the model for each fold. @@ -10,8 +10,8 @@ .. include:: ../../links.inc """ # Authors: Federico Raimondo -# # License: AGPL + from seaborn import load_dataset from sklearn.model_selection import RepeatedStratifiedKFold, ShuffleSplit @@ -21,7 +21,7 @@ from julearn.utils import configure_logging ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### @@ -44,7 +44,7 @@ creator.add("zscore") creator.add("svm") -cv = ShuffleSplit(n_splits=5, train_size=.7, random_state=200) +cv = ShuffleSplit(n_splits=5, train_size=0.7, random_state=200) cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=4, random_state=200) scores, model, inspector = run_cross_validation( diff --git a/examples/03_complex_models/run_apply_to_target.py b/examples/03_complex_models/run_apply_to_target.py index 94b2c15c0..a24ccc42a 100644 --- a/examples/03_complex_models/run_apply_to_target.py +++ b/examples/03_complex_models/run_apply_to_target.py @@ -2,7 +2,7 @@ Transforming target variable with z-score ========================================= -This example uses the sklearn "diabetes" regression dataset, and transforms the +This example uses the sklearn ``diabetes`` regression dataset, and transforms the target variable, in this case, using z-score. Then, we perform a regression analysis using Ridge Regression model. @@ -13,28 +13,22 @@ # License: AGPL import pandas as pd -import seaborn as sns -import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from julearn import run_cross_validation from julearn.utils import configure_logging -# this is crucial for creating the model in the new version from julearn.pipeline import PipelineCreator, TargetPipelineCreator - ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") - ############################################################################### -# Load the diabetes dataset from sklearn as a pandas dataframe +# Load the diabetes dataset from ``sklearn`` as a ``pandas.DataFrame``. features, target = load_diabetes(return_X_y=True, as_frame=True) - ############################################################################### # Dataset contains ten variables age, sex, body mass index, average blood # pressure, and six blood serum measurements (s1-s6) diabetes patients and @@ -43,20 +37,18 @@ print("Features: \n", features.head()) print("Target: \n", target.describe()) - ############################################################################### # Let's combine features and target together in one dataframe and define X -# and y +# and y. data_diabetes = pd.concat([features, target], axis=1) X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"] y = "target" ############################################################################### -# Split the dataset into train and test +# Split the dataset into train and test. train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3) - ############################################################################### # Let's create the model. Since we will be transforming the target variable # we will first need to create a TargetPipelineCreator for this. @@ -64,7 +56,6 @@ target_creator = TargetPipelineCreator() target_creator.add("zscore") - ############################################################################## # Now we can create the pipeline using a PipelineCreator. creator = PipelineCreator(problem_type="regression") diff --git a/examples/03_complex_models/run_example_pca_featsets.py b/examples/03_complex_models/run_example_pca_featsets.py index 414eef3b5..14571735e 100644 --- a/examples/03_complex_models/run_example_pca_featsets.py +++ b/examples/03_complex_models/run_example_pca_featsets.py @@ -2,10 +2,10 @@ Regression Analysis =================== -This example uses the 'diabetes' data from sklearn datasets and performs -a regression analysis using a Ridge Regression model. I will use the Julearn -PipelineCreator to create a pipeline with two different PCA steps are used -to reduce the dimensionality of the data, each one computed on a different +This example uses the ``diabetes`` data from ``sklearn datasets`` and performs +a regression analysis using a Ridge Regression model. We'll use the +``julearn.PipelineCreator`` to create a pipeline with two different PCA steps and +reduce the dimensionality of the data, each one computed on a different subset of features. """ @@ -13,7 +13,6 @@ # Kaustubh R. Patil # Shammi More # Federico Raimondo -# # License: AGPL import pandas as pd @@ -30,11 +29,11 @@ from julearn.inspect import preprocess ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# load the diabetes data from sklearn as a pandas dataframe +# Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``. features, target = load_diabetes(return_X_y=True, as_frame=True) ############################################################################### @@ -55,9 +54,8 @@ y = "target" ############################################################################### -# Assign types to the features -# and create feature groups for PCA -# we will keep 1 component per PCA group +# Assign types to the features and create feature groups for PCA. +# We will keep 1 component per PCA group. X_types = { "pca1": ["age", "bmi", "bp"], "pca2": ["s1", "s2", "s3", "s4", "s5", "s6"], @@ -66,8 +64,8 @@ ############################################################################### # Create a pipeline to process the data and the fit a model. We must specify -# how each X_type will be used. For example if in the last step we do not -# specify `apply_to=['continuous', 'categorical']`, then the pipeline will not +# how each ``X_type`` will be used. For example if in the last step we do not +# specify ``apply_to=["continuous", "categorical"]``, then the pipeline will not # know what to do with the categorical features. creator = PipelineCreator(problem_type="regression") creator.add("pca", apply_to="pca1", n_components=1, name="pca_feats1") @@ -75,12 +73,12 @@ creator.add("ridge", apply_to=["continuous", "categorical"]) ############################################################################### -# Split the dataset into train and test +# Split the dataset into train and test. train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3) ############################################################################### # Train a ridge regression model on train dataset and use mean absolute error -# for scoring/ +# for scoring. scores, model = run_cross_validation( X=X, y=y, @@ -96,23 +94,23 @@ print(scores.head()) ############################################################################### -# Mean value of mean absolute error across CV +# Mean value of mean absolute error across CV. print(scores["test_score"].mean()) - ############################################################################### -# Let's see how the data looks like after preprocessing -# We will process the data until the first PCA step -# We should get the first -# PCA component for ['age', 'bmi', 'bp'] and other features untouched +# Let's see how the data looks like after preprocessing. We will process the +# data until the first PCA step. We should get the first PCA component for +# ["age", "bmi", "bp"] and leave other features untouched. data_processed1 = preprocess(model, X, data=train_diabetes, until="pca_feats1") print("Data after preprocessing until PCA step 1") -print(data_processed1.head()) -# We will process the data until the second PCA step -# We should now also get one PCA component for ['s1', 's2', 's3', 's4', 's5', 's6'] +data_processed1.head() + +############################################################################### +# We will process the data until the second PCA step. We should now also get +# one PCA component for ["s1", "s2", "s3", "s4", "s5", "s6"]. data_processed2 = preprocess(model, X, data=train_diabetes, until="pca_feats2") print("Data after preprocessing until PCA step 2") -print(data_processed2.head()) +data_processed2.head() ############################################################################### # Now we can get the MAE fold and repetition: @@ -123,14 +121,14 @@ print(df_mae) ############################################################################### -# Plot heatmap of mean absolute error (MAE) over all repeats and CV splits +# Plot heatmap of mean absolute error (MAE) over all repeats and CV splits. fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.heatmap(df_mae, cmap="YlGnBu") plt.title("Cross-validation MAE") ############################################################################### # Use the final model to make predictions on test data and plot scatterplot -# of true values vs predicted values +# of true values vs predicted values. y_true = test_diabetes[y] y_pred = model.predict(test_diabetes[X]) mae = format(mean_absolute_error(y_true, y_pred), ".2f") diff --git a/examples/03_complex_models/run_hyperparameter_multiple_grids.py b/examples/03_complex_models/run_hyperparameter_multiple_grids.py index c3d8a658b..db656be6f 100644 --- a/examples/03_complex_models/run_hyperparameter_multiple_grids.py +++ b/examples/03_complex_models/run_hyperparameter_multiple_grids.py @@ -2,22 +2,21 @@ Tuning Multiple Hyperparameters Grids ===================================== -This example uses the 'fmri' dataset, performs simple binary classification +This example uses the ``fmri`` dataset, performs simple binary classification using a Support Vector Machine classifier while tuning multiple hyperparameters grids at the same time. - References ---------- -Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of -cognitive control in context-dependent decision-making. Cerebral Cortex. + Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of + cognitive control in context-dependent decision-making. Cerebral Cortex. .. include:: ../../links.inc """ # Authors: Federico Raimondo -# # License: AGPL + import numpy as np from seaborn import load_dataset @@ -26,33 +25,33 @@ from julearn.pipeline import PipelineCreator ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# Set the random seed to always have the same example +# Set the random seed to always have the same example. np.random.seed(42) ############################################################################### -# Load the dataset +# Load the dataset. df_fmri = load_dataset("fmri") -print(df_fmri.head()) +df_fmri.head() ############################################################################### -# Set the dataframe in the right format +# Set the dataframe in the right format. df_fmri = df_fmri.pivot( index=["subject", "timepoint", "event"], columns="region", values="signal" ) df_fmri = df_fmri.reset_index() -print(df_fmri.head()) - -X = ["frontal", "parietal"] -y = "event" +df_fmri.head() ############################################################################### # Lets do a first attempt and use a linear SVM with the default parameters. +X = ["frontal", "parietal"] +y = "event" + creator = PipelineCreator(problem_type="classification") creator.add("zscore") creator.add("svm", kernel="linear") @@ -62,10 +61,10 @@ print(scores["test_score"].mean()) ############################################################################### -# Now lets tune a bit this SVM. We will use a grid search to tune the -# regularization parameter C and the kernel. We will also tune the gamma. -# But since the gamma is only used for the rbf kernel, we will use a -# different grid for the rbf kernel. +# Now let's tune a bit this SVM. We will use a grid search to tune the +# regularization parameter ``C`` and the kernel. We will also tune the ``gamma``. +# But since the ``gamma`` is only used for the rbf kernel, we will use a +# different grid for the ``"rbf"`` kernel. # # To specify two different sets of parameters for the same step, we can # explicitly specify the name of the step. This is done by passing the @@ -96,6 +95,7 @@ ) print(scores["test_score"].mean()) + ############################################################################### # It seems that we might have found a better model, but which one is it? print(estimator.best_params_) diff --git a/examples/03_complex_models/run_hyperparameter_tuning.py b/examples/03_complex_models/run_hyperparameter_tuning.py index 6d1d3231b..82e206fa5 100644 --- a/examples/03_complex_models/run_hyperparameter_tuning.py +++ b/examples/03_complex_models/run_hyperparameter_tuning.py @@ -2,21 +2,20 @@ Tuning Hyperparameters ======================= -This example uses the 'fmri' dataset, performs simple binary classification -using a Support Vector Machine classifier and analyse the model. - +This example uses the ``fmri`` dataset, performs simple binary classification +using a Support Vector Machine classifier and analyze the model. References ---------- -Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of -cognitive control in context-dependent decision-making. Cerebral Cortex. + Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of + cognitive control in context-dependent decision-making. Cerebral Cortex. .. include:: ../../links.inc """ # Authors: Federico Raimondo -# # License: AGPL + import numpy as np from seaborn import load_dataset @@ -25,34 +24,32 @@ from julearn.pipeline import PipelineCreator ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# Set the random seed to always have the same example +# Set the random seed to always have the same example. np.random.seed(42) - ############################################################################### -# Load the dataset +# Load the dataset. df_fmri = load_dataset("fmri") -print(df_fmri.head()) +df_fmri.head() ############################################################################### -# Set the dataframe in the right format +# Set the dataframe in the right format. df_fmri = df_fmri.pivot( index=["subject", "timepoint", "event"], columns="region", values="signal" ) df_fmri = df_fmri.reset_index() -print(df_fmri.head()) +df_fmri.head() +############################################################################### +# Let's do a first attempt and use a linear SVM with the default parameters. X = ["frontal", "parietal"] y = "event" -############################################################################### -# Lets do a first attempt and use a linear SVM with the default parameters. - creator = PipelineCreator(problem_type="classification") creator.add("zscore") creator.add("svm", kernel="linear") @@ -62,9 +59,9 @@ print(scores["test_score"].mean()) ############################################################################### -# The score is not so good. Lets try to see if there is an optimal +# The score is not so good. Let's try to see if there is an optimal # regularization parameter (C) for the linear SVM. -# We will use a grid search to find the best C. +# We will use a grid search to find the best ``C``. creator = PipelineCreator(problem_type="classification") creator.add("zscore") @@ -113,9 +110,7 @@ creator = PipelineCreator(problem_type="classification") creator.add("zscore") -creator.add( - "svm", kernel="rbf", C=[0.01, 0.1], gamma=[1e-2, 1e-3] -) +creator.add("svm", kernel="rbf", C=[0.01, 0.1], gamma=[1e-2, 1e-3]) scores, estimator = run_cross_validation( X=X, @@ -135,9 +130,7 @@ creator = PipelineCreator(problem_type="classification") creator.add("zscore") -creator.add( - "svm", kernel="rbf", C=[0.01, 0.1], gamma=[1e-2, 1e-3, "scale"] -) +creator.add("svm", kernel="rbf", C=[0.01, 0.1], gamma=[1e-2, 1e-3, "scale"]) X = ["frontal", "parietal"] y = "event" diff --git a/examples/03_complex_models/run_stacked_models.py b/examples/03_complex_models/run_stacked_models.py index c695c381f..241d42e69 100644 --- a/examples/03_complex_models/run_stacked_models.py +++ b/examples/03_complex_models/run_stacked_models.py @@ -2,7 +2,7 @@ Stacking Classification ======================= -This example uses the 'iris' dataset and performs a complex stacking +This example uses the ``iris`` dataset and performs a complex stacking classification. We will use two different classifiers, one applied to petal features and one applied to sepal features. A final logistic regression classifier will be applied on the predictions of the two classifiers. @@ -10,15 +10,15 @@ .. include:: ../../links.inc """ # Authors: Federico Raimondo -# # License: AGPL + from seaborn import load_dataset from julearn import run_cross_validation from julearn.pipeline import PipelineCreator from julearn.utils import configure_logging ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### @@ -42,7 +42,6 @@ "petal": ["petal_length", "petal_width"], } - # Create the pipeline for the sepal features, by default will apply to "sepal" model_sepal = PipelineCreator(problem_type="classification", apply_to="sepal") model_sepal.add("filter_columns", apply_to="*", keep="sepal") diff --git a/examples/04_confounds/README.rst b/examples/04_confounds/README.rst index a705ee873..c96958f5e 100644 --- a/examples/04_confounds/README.rst +++ b/examples/04_confounds/README.rst @@ -1,4 +1,4 @@ Confounds ========= -Examples that show the confound-related functionality of Julearn. \ No newline at end of file +Examples that show the confound-related functionality of ``julearn``. diff --git a/examples/04_confounds/plot_confound_removal_classification.py b/examples/04_confounds/plot_confound_removal_classification.py index 81257cb10..024238978 100644 --- a/examples/04_confounds/plot_confound_removal_classification.py +++ b/examples/04_confounds/plot_confound_removal_classification.py @@ -2,11 +2,10 @@ Confound Removal (model comparison) =================================== -This example uses the 'iris' dataset, performs simple binary classification +This example uses the ``iris`` dataset, performs simple binary classification with and without confound removal using a Random Forest classifier. """ - # Authors: Shammi More # Federico Raimondo # Leonard Sasse @@ -22,13 +21,12 @@ from julearn.pipeline import PipelineCreator from julearn.utils import configure_logging - ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# Load the iris data from seaborn +# Load the iris data from seaborn. df_iris = load_dataset("iris") ############################################################################### @@ -36,7 +34,6 @@ # classification. df_iris = df_iris[df_iris["species"].isin(["versicolor", "virginica"])] - ############################################################################### # As features, we will use the sepal length, width and petal length and use # petal width as confound. @@ -57,15 +54,15 @@ # difference. If the 95% CI is above 0 (or below), we can claim that the models # are different with p < 0.05. # -# Lets use a bootstrap CV. In the interest of time we do 20 iterations, +# Let's use a bootstrap CV. In the interest of time we do 20 iterations, # change the number of bootstrap iterations to at least 2000 for a valid test. n_bootstrap = 20 n_elements = len(df_iris) cv = StratifiedBootstrap(n_splits=n_bootstrap, test_size=0.3, random_state=42) ############################################################################### -# First, we will train a model without performing confound removal on features -# Note: confounds by default +# First, we will train a model without performing confound removal on features. +# Note: confounds by default. scores_ncr = run_cross_validation( X=X, y=y, @@ -79,18 +76,15 @@ seed=200, ) - ############################################################################### -# Next, we train a model after performing confound removal on the features -# Note: we initialize the CV again to use the same folds as before +# Next, we train a model after performing confound removal on the features. +# Note: we initialize the CV again to use the same folds as before. cv = StratifiedBootstrap(n_splits=n_bootstrap, test_size=0.3, random_state=42) - -# In order to tell 'run_cross_validation' which columns are confounds, +# In order to tell ``run_cross_validation`` which columns are confounds, # and which columns are features, we have to define the X_types: X_types = {"features": X, "confound": confounds} - ############################################################################## # We can now define a pipeline creator and add a confound removal step. # The pipeline creator should apply all the steps, by default, to the @@ -102,7 +96,7 @@ # "features". # # Finally, a random forest will be trained. -# Given the default apply_to in the pipeline creator, +# Given the default ``apply_to`` in the pipeline creator, # the random forest will only be trained using "features". creator = PipelineCreator(problem_type="classification", apply_to="features") creator.add("zscore", apply_to=["features", "confound"]) @@ -123,13 +117,13 @@ ############################################################################### # Now we can compare the accuracies. We can combine the two outputs as -# pandas dataframes +# ``pandas.DataFrame``. scores_ncr["confounds"] = "Not Removed" scores_cr["confounds"] = "Removed" ############################################################################### # Now we convert the metrics to a column for easier seaborn plotting (convert -# to long format) +# to long format). index = ["fold", "confounds"] scorings = ["test_accuracy", "test_roc_auc"] @@ -145,10 +139,10 @@ df_metrics = pd.concat((df_ncr_metrics, df_cr_metrics)) df_metrics = df_metrics.reset_index() -# print(df_metrics.head()) +df_metrics.head() ############################################################################### -# And finally plot the results +# And finally plot the results. sns.catplot( x="confounds", y="value", col="metric", data=df_metrics, kind="swarm" ) @@ -160,9 +154,8 @@ # difference, we need to check the distribution of differeces between the # the models. # -# First we remove the column "confounds" from the index and make the difference -# between the metrics - +# First, we remove the column "confounds" from the index and make the difference +# between the metrics. df_cr_metrics = df_cr_metrics.reset_index().set_index(["fold", "metric"]) df_ncr_metrics = df_ncr_metrics.reset_index().set_index(["fold", "metric"]) @@ -186,8 +179,7 @@ # Maybe the percentiles will be more accuracy with the proper amount of # bootstrap iterations? # -# -# But the main point of confound removal is for interpretability. Lets see +# But the main point of confound removal is for interpretability. Let's see # if there is a change in the feature importances. # # First, we need to collect the feature importances for each model, for each @@ -222,7 +214,7 @@ feature_importance = pd.concat([cr_fi, ncr_fi]) ############################################################################### -# We can now plot the importances +# We can now plot the importances. sns.catplot( x="feature", y="importance", diff --git a/examples/04_confounds/run_return_confounds.py b/examples/04_confounds/run_return_confounds.py index 45996e076..f1ddc57fb 100644 --- a/examples/04_confounds/run_return_confounds.py +++ b/examples/04_confounds/run_return_confounds.py @@ -5,34 +5,32 @@ In most cases confound removal is a simple operation. You regress out the confound from the features and only continue working with these new confound removed features. This is also the default setting for -julearn's `remove_confound` step. But sometimes you want to work with the +``julearn``'s ``remove_confound`` step. But sometimes you want to work with the confound even after removing it from the features. In this example, we will discuss the options you have. .. include:: ../../links.inc """ # Authors: Sami Hamdan -# # License: AGPL + from sklearn.datasets import load_diabetes # to load data from julearn.pipeline import PipelineCreator from julearn import run_cross_validation from julearn.inspect import preprocess -# load in the data +# Load in the data df_features, target = load_diabetes(return_X_y=True, as_frame=True) - ############################################################################### # First, we can have a look at our features. # You can see it includes Age, BMI, average blood pressure (bp) and 6 other -# measures from s1 to s6 Furthermore, it includes sex which will be considered +# measures from s1 to s6. Furthermore, it includes sex which will be considered # as a confound in this example. -# print("Features: ", df_features.head()) ############################################################################### -# Second, we can have a look at the target +# Second, we can have a look at the target. print("Target: ", target.describe()) ############################################################################### @@ -42,24 +40,22 @@ ############################################################################### # In the following we will explore different settings of confound removal -# using Julearns pipeline functionalities. +# using ``julearn``'s pipeline functionalities. # # Confound Removal Typical Use Case # --------------------------------- # Here, we want to deconfound the features and not include the confound as a -# feature into our last model. We will use the `remove_confound` step for this. -# Then we will use the `pca` step to reduce the dimensionality of the features. +# feature into our last model. We will use the ``remove_confound`` step for this. +# Then we will use the ``pca`` step to reduce the dimensionality of the features. # Finally, we will fit a linear regression model. - creator = PipelineCreator(problem_type="regression", apply_to="continuous") creator.add("confound_removal") creator.add("pca") creator.add("linreg") - ############################################################################### -# Now we need to set the `X_types` argument of the `run_cross_validation` +# Now we need to set the ``X_types`` argument of the ``run_cross_validation`` # function. This argument is a dictionary that maps the names of the different # types of X to the features that belong to this type. In this example, we # have two types of features: `continuous` and `confound`. The `continuous` @@ -83,29 +79,34 @@ ) ############################################################################### -# We can use the `preprocess` method of the `inspect` module to inspect the +# We can use the ``preprocess`` method of the ``inspect`` module to inspect the # transformations steps of the returned estimator. -# By providing a step name to the `until` argument of the -# `preprocess` method we return the transformed X and y up to +# By providing a step name to the ``until`` argument of the +# ``preprocess`` method we return the transformed X and y up to # the provided step (inclusive). -df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal") -print(df_deconfounded.head()) +df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal") +df_deconfounded.head() -# As you can see the confound `sex` was dropped and only the confound removed -# features are used in the following pca. +############################################################################### +# As you can see the confound ``sex`` was dropped and only the confound removed +# features are used in the following PCA. # # But what if you want to keep the confound after removal for # other transformations? # -# For example, let's assume that you want to do a pca on the confound removed +# For example, let's assume that you want to do a PCA on the confound removed # feature, but want to keep the confound for the actual modelling step. # Let us have a closer look to the confound remover in order to understand # how we could achieve such a task: # -# .. autoclass:: julearn.transformers.ConfoundRemover +# .. autoclass:: julearn.transformers.confound_remover.ConfoundRemover +# :noindex: +# :exclude-members: transform, get_support, get_feature_names_out, +# filter_columns, fit, fit_transform, get_apply_to, +# get_needed_types, get_params, set_output, set_params ############################################################################### -# In this example, we will set the `keep_confounds` argument to True. +# In this example, we will set the ``keep_confounds`` argument to True. # This will keep the confounds after confound removal. creator = PipelineCreator(problem_type="regression", apply_to="continuous") @@ -113,7 +114,6 @@ creator.add("pca") creator.add("linreg") - ############################################################################### # Now we can run the cross validation and get the scores. scores, model = run_cross_validation( @@ -126,18 +126,19 @@ ) ############################################################################### -# As you can see this kept the confound variable `sex` in the data. -df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal") -print(df_deconfounded.head()) +# As you can see this kept the confound variable ``sex`` in the data. +df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal") +df_deconfounded.head() ############################################################################### -# Even after the pca, the confound will still be present. +# Even after the PCA, the confound will still be present. # This is the case because by default transformers only transform continuous -# features (including features without a specified type) -# and ignore confounds and categorical variables. +# features (including features without a specified type) and ignore confounds +# and categorical variables. df_transformed = preprocess(model, X=X, data=data) -print(df_transformed.head()) +df_transformed.head() +############################################################################### # This means that the resulting Linear Regression can use the deconfounded # features together with the confound to predict the target. However, in the # pipeline creator, the model is only applied to the continuous features. @@ -162,12 +163,12 @@ model=creator, return_estimator="final", ) -print(scores) +scores ############################################################################### # As you can see the confound is now used in the linear regression model. -# This is the case because we set the `apply_to` argument of the `linreg` -# step to `*`. This means that the step will be applied to all features +# This is the case because we set the ``apply_to`` argument of the ``linreg`` +# step to ``*``. This means that the step will be applied to all features # (including confounds and categorical variables). # Here we can see that the model is using 10 features (9 deconfounded features # and the confound). diff --git a/examples/05_customization/README.rst b/examples/05_customization/README.rst index c7887d48f..c0eb8c8d5 100644 --- a/examples/05_customization/README.rst +++ b/examples/05_customization/README.rst @@ -1,4 +1,4 @@ Customization ============= -Examples that show to extend and control various aspects of Julearn. \ No newline at end of file +Examples that show to extend and control various aspects of ``julearn``. diff --git a/examples/05_customization/run_custom_scorers_regression.py b/examples/05_customization/run_custom_scorers_regression.py index 5717727cb..6c53cf9d0 100644 --- a/examples/05_customization/run_custom_scorers_regression.py +++ b/examples/05_customization/run_custom_scorers_regression.py @@ -2,14 +2,13 @@ Custom Scoring Function for Regression ====================================== -This example uses the 'diabetes' data from sklearn datasets and performs +This example uses the ``diabetes`` data from ``sklearn datasets`` and performs a regression analysis using a Ridge Regression model. As scorers, it uses -scikit-learn, julearn and a custom metric defined by the user. +``scikit-learn``, ``julearn`` and a custom metric defined by the user. """ # Authors: Shammi More # Federico Raimondo -# # License: AGPL import pandas as pd @@ -23,34 +22,32 @@ from julearn.utils import configure_logging ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# load the diabetes data from sklearn as a pandas dataframe +# load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``. features, target = load_diabetes(return_X_y=True, as_frame=True) ############################################################################### -# Dataset contains ten variables age, sex, body mass index, average blood +# Dataset contains ten variables age, sex, body mass index, average blood # pressure, and six blood serum measurements (s1-s6) diabetes patients and # a quantitative measure of disease progression one year after baseline which # is the target we are interested in predicting. - -print("Features: \n", features.head()) # type: ignore -print("Target: \n", target.describe()) # type: ignore +print("Features: \n", features.head()) +print("Target: \n", target.describe()) ############################################################################### # Let's combine features and target together in one dataframe and define X -# and y +# and y. data_diabetes = pd.concat([features, target], axis=1) # type: ignore X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"] y = "target" - ############################################################################### # Train a ridge regression model on train dataset and use mean absolute error -# for scoring +# for scoring. scores, model = run_cross_validation( X=X, y=y, @@ -64,16 +61,15 @@ ############################################################################### # The scores dataframe has all the values for each CV split. - -print(scores.head()) +scores.head() ############################################################################### -# Mean value of mean absolute error across CV -print(scores["test_score"].mean() * -1) # type: ignore +# Mean value of mean absolute error across CV. +print(scores["test_score"].mean() * -1) ############################################################################### # Now do the same thing, but use mean absolute error and Pearson product-moment -# correlation coefficient (squared) as scoring functions +# correlation coefficient (squared) as scoring functions. scores, model = run_cross_validation( X=X, y=y, @@ -87,11 +83,10 @@ ############################################################################### # Now the scores dataframe has all the values for each CV split, but two scores -# unders the column names 'test_neg_mean_absolute_error' and -# 'test_r2_corr'. +# unders the column names ``"test_neg_mean_absolute_error"`` and +# ``"test_r2_corr"``. print(scores[["test_neg_mean_absolute_error", "test_r2_corr"]].mean()) - ############################################################################### # If we want to define a custom scoring metric, we need to define a function # that takes the predicted and the actual values as input and returns a value. @@ -99,14 +94,13 @@ def pearson_scorer(y_true, y_pred): - return scipy.stats.pearsonr( # type: ignore - y_true.squeeze(), y_pred.squeeze() - )[0] + return scipy.stats.pearsonr(y_true.squeeze(), y_pred.squeeze())[0] ############################################################################### -# Before using it, we need to convert it to a sklearn scorer and register it -# with julearn. +# Before using it, we need to convert it to a ``sklearn scorer`` and register it +# with ``julearn``. + register_scorer(scorer_name="pearsonr", scorer=make_scorer(pearson_scorer)) ############################################################################### diff --git a/examples/99_docs/run_cbpm_docs.py b/examples/99_docs/run_cbpm_docs.py index a4f936502..ba19ef6fa 100644 --- a/examples/99_docs/run_cbpm_docs.py +++ b/examples/99_docs/run_cbpm_docs.py @@ -1,5 +1,3 @@ -# Authors: Leonard Sasse -# License: AGPL """ Connectome-based Predictive Modeling (CBPM) =========================================== @@ -14,9 +12,9 @@ In a nutshell, CBPM consists of: -1. feature selection -2. feature aggregation -3. model building +1. Feature selection +2. Feature aggregation +3. Model building In CBPM, features are selected if their correlation to the target is significant according to some specified significance threshold alpha. These @@ -25,18 +23,21 @@ approach a linear model is used for this, but in principle it could be any other machine learning model. -CBPM in Julearn ---------------- +CBPM in ``julearn`` +------------------- -Julearn implements a simple, scikit-learn compatible transformer ("cbpm"), that -performs the first two parts of this approach, i.e. the feature selection and -feature aggregation. Leveraging julearn's PipelineCreator, one can therefore -easily apply the "cbpm" transformer as a preprocessing step, and then apply any -sklearn-compatible estimator for the model building part. +``julearn`` implements a simple, ``scikit-learn`` compatible transformer +("cbpm"), that performs the first two parts of this approach, i.e., the feature +selection and feature aggregation. Leveraging ``julearn``'s ``PipelineCreator``, +one can therefore easily apply the ``"cbpm"`` transformer as a preprocessing +step, and then apply any ``scikit-learn``-compatible estimator for the model +building part. For example, to build a simple CBPM workflow, you can create a pipeline and run a cross-validation as follows: """ +# Authors: Leonard Sasse +# License: AGPL from julearn import run_cross_validation from julearn.pipeline import PipelineCreator @@ -44,23 +45,21 @@ from sklearn.datasets import make_regression import pandas as pd - -# prepare some data: -# prepare data +# Prepare data X, y = make_regression(n_features=20, n_samples=200) -# make dataframe +# Make dataframe X_names = [f"feature_{x}" for x in range(1, 21)] data = pd.DataFrame(X) data.columns = X_names data["target"] = y -# prepare a pipeline creator: +# Prepare a pipeline creator cbpm_pipeline_creator = PipelineCreator(problem_type="regression") cbpm_pipeline_creator.add("cbpm") cbpm_pipeline_creator.add("linreg") -# cross-validate the cbpm pipeline +# Cross-validate the cbpm pipeline scores, final_model = run_cross_validation( data=data, X=X_names, @@ -70,7 +69,7 @@ ) ############################################################################### -# By default the "cbpm" transformer will perform feature selection using the +# By default the ``"cbpm"`` transformer will perform feature selection using the # Pearson correlation between each feature and the target, and select the # features for which the p-value of the correlation falls below the default # significance threshold of 0.01. It will then group the features into @@ -78,15 +77,15 @@ # each of these groups using :func:`numpy.sum`. That is, the linear model in # this case is fitted on two features: # -# 1. sum of features that are positively correlated to the target -# 2. sum of features that are negatively correlated to the target +# 1. Sum of features that are positively correlated to the target +# 2. Sum of features that are negatively correlated to the target # # The pipeline creator also allows easily customising these parameters of the -# "cbpm" transformer according to your needs. For example, to use a different +# ``"cbpm"`` transformer according to your needs. For example, to use a different # significance threshold during feature selection one may set the -# `significance_threshold` keyword to increase it to 0.05 as follows: +# ``significance_threshold`` keyword to increase it to 0.05 as follows: -# prepare a pipeline creator +# Prepare a pipeline creator cbpm_pipeline_creator = PipelineCreator(problem_type="regression") cbpm_pipeline_creator.add("cbpm", significance_threshold=0.05) cbpm_pipeline_creator.add("linreg") @@ -94,10 +93,10 @@ print(cbpm_pipeline_creator) ############################################################################### -# Julearn also allows this to be tuned as a hyperparameter in a nested +# ``julearn`` also allows this to be tuned as a hyperparameter in a nested # cross-validation. Simply hand over an iterable of values: -# prepare a pipeline creator: +# Prepare a pipeline creator cbpm_pipeline_creator = PipelineCreator(problem_type="regression") cbpm_pipeline_creator.add("cbpm", significance_threshold=[0.01, 0.05]) cbpm_pipeline_creator.add("linreg") @@ -105,17 +104,16 @@ print(cbpm_pipeline_creator) ############################################################################### -# In addition, it may be noteworthy, that you can customise the correlation -# method, the aggregation method, as well as the sign (`"pos"`, `"neg"`, -# or `"posneg"`) of the feature-target correlations that should be selected. +# In addition, it may be noteworthy, that you can customize the correlation +# method, the aggregation method, as well as the sign (``"pos"``, ``"neg"``, +# or ``"posneg"``) of the feature-target correlations that should be selected. # For example, a pipeline that specifies each of these parameters may look as # follows: - import numpy as np from scipy.stats import spearmanr -# prepare a pipeline creator: +# Prepare a pipeline creator cbpm_pipeline_creator = PipelineCreator(problem_type="regression") cbpm_pipeline_creator.add( "cbpm", diff --git a/examples/99_docs/run_confound_removal_docs.py b/examples/99_docs/run_confound_removal_docs.py index bfa9d4710..f70eed986 100644 --- a/examples/99_docs/run_confound_removal_docs.py +++ b/examples/99_docs/run_confound_removal_docs.py @@ -20,14 +20,14 @@ training and testing data jointly in order to prevent test-to-train data leakage [#4]_, [#5]_. -Confound Removal in Julearn ---------------------------- +Confound Removal in ``julearn`` +------------------------------- -Julearn implements cross-validation consistent confound regression for both of -the scenarios laid out above (i.e. either confound regression on the features -or on the target) allowing the user to implement complex machine learning -pipelines with relatively little code while avoiding test-to-train leakage -during confound removal. +``julearn`` implements cross-validation consistent confound regression for both +of the scenarios laid out above (i.e., either confound regression on the +features or on the target) allowing the user to implement complex machine +learning pipelines with relatively little code while avoiding test-to-train +leakage during confound removal. Let us initially consider removing a confounding variable from the features. @@ -36,9 +36,9 @@ The first scenario involves confound regression on the features. In order to do this we can simply configure an instance of a :class:`.PipelineCreator` -by adding the "confound_removal" step. +by adding the ``"confound_removal"`` step. -We can create some data using scikit-learn's +We can create some data using ``scikit-learn``'s :func:`~sklearn.datasets.make_regression` and then simulate a normally distributed random variable that has a linear relationship with the target that we can use as a confound. @@ -61,7 +61,7 @@ X, y = make_regression(n_features=20) # create two normally distributed random variables with the same mean -# and standard deviation as the y +# and standard deviation as y normal_dist_conf_one = np.random.normal(y.mean(), y.std(), y.size) normal_dist_conf_two = np.random.normal(y.mean(), y.std(), y.size) @@ -75,8 +75,7 @@ ############################################################################### # Let's organise these data as a :class:`pandas.DataFrame`, which is the -# preferred data format when using julearn: - +# preferred data format when using ``julearn``: # put the features into a dataframe data = pd.DataFrame(X) @@ -97,19 +96,18 @@ ############################################################################### # In this example, we only distinguish between two types of variables in the -# "X". That is, we have 1.) our features (or predictors) and 2.) our confounds. -# Let's prepare the "X_types" dictionary that we hand over to +# ``X``. That is, we have 1. our features (or predictors) and 2. our confounds. +# Let's prepare the ``X_types`` dictionary that we hand over to # :func:`.run_cross_validation` accordingly: - X_types = {"features": features, "confounds": confounds} ############################################################################### -# Now, that we have all the data prepared, and we have defined our "X_types", +# Now, that we have all the data prepared, and we have defined our ``X_types``, # we can think about creating the pipeline that we want to run. Now, this is -# the crucial point at which we parametrise the confound removal. We initialise -# the :class:`.PipelineCreator` and add to it as a step the -# "confound_removal" transformer (the underlying transformer object is the +# the crucial point at which we parametrize the confound removal. We initialize +# the :class:`.PipelineCreator` and add to it as a step using the +# ``"confound_removal"`` transformer (the underlying transformer object is the # :class:`.ConfoundRemover`). pipeline_creator = PipelineCreator( @@ -122,17 +120,17 @@ ############################################################################### # As you can see, we tell the :class:`.PipelineCreator` that we want to work on -# a "regression" problem when we initialise the class. We also tell that by -# default each "step" of the pipeline should be applied to the features which -# type is "features". In the first step that we add, we specify we want to -# perform "confound_removal", and that the features that have the type -# "confounds" should be used as confounds in the confound regression. -# Note, that because we already specified apply_to="features" -# during the initialisation, we do not need to explicitly state this again. -# In short, the "confounds" will be removed from the "features". +# a "regression" problem when we initialize the class. We also tell that by +# default each "step" of the pipeline should be applied to the features whose +# type is ``"features"``. In the first step that we add, we specify we want to +# perform ``"confound_removal"``, and that the features that have the type +# ``"confounds"`` should be used as confounds in the confound regression. +# Note, that because we already specified ``apply_to="features"`` +# during the initialization, we do not need to explicitly state this again. +# In short, the ``"confounds"`` will be removed from the ``"features"``. # -# As a second and last step, we simply add a linear regression ("linreg") to -# fit a predictive model to the de-confounded X and the y. +# As a second and last step, we simply add a linear regression (``"linreg"``) to +# fit a predictive model to the de-confounded ``X`` and the ``y``. # # Lastly, we only need to apply this pipeline in the :func:`.run_cross_validation` # function to perform confound removal on the features in a cross-validation @@ -157,10 +155,10 @@ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # # If we want to remove the confounds from the target rather than from the -# features, we need to create a slightly different pipeline. Julearn has a +# features, we need to create a slightly different pipeline. ``julearn`` has a # specific :class:`.TargetPipelineCreator` to perform transformations on the -# target. We first configure this pipeline and add the "confound_removal" step. - +# target. We first configure this pipeline and add the ``"confound_removal"`` +# step. target_pipeline_creator = TargetPipelineCreator() target_pipeline_creator.add("confound_removal", confounds="confounds") @@ -170,7 +168,7 @@ ############################################################################### # Now we insert the target pipeline into the main pipeline that will be used # to do the prediction. Importantly, we specify that the target pipeline should -# be applied to the "target". +# be applied to the ``"target"``. pipeline_creator = PipelineCreator( problem_type="regression", apply_to="features" @@ -190,13 +188,14 @@ y="my_target", X_types=X_types, model=pipeline_creator, - scoring="r2" + scoring="r2", ) + print(scores) ############################################################################### # As you can see, applying confound regression in your machine learning -# pipeline in a cross-validated fashion is reasonably easy using julearn. +# pipeline in a cross-validated fashion is reasonably easy using ``julearn``. # If you are considering whether or not to use confound regression, however, # there are further important considerations: # diff --git a/examples/99_docs/run_cv_splitters_docs.py b/examples/99_docs/run_cv_splitters_docs.py index e03acbab5..baa4a0787 100644 --- a/examples/99_docs/run_cv_splitters_docs.py +++ b/examples/99_docs/run_cv_splitters_docs.py @@ -1,20 +1,21 @@ # Authors: Fede Raimondo # License: AGPL + """ Cross-validation splitters ========================== As mentioned in the :ref:`why_cv`, cross-validation is a *must use* technique -to evaluate the performance of a model when we don't have _almost_ infinite -data. However, there are several ways to split the data into training and +to evaluate the performance of a model when we don't have *almost* infinite +data. However, there are several ways to split the data into training and testing sets, and each of them has different properties. In this section we will see why it is important to choose the right cross-validation splitter for your evaluation. The most important argument is because we want to have an unbiased estimate of the performance of our model, mostly avoiding overestimation. Remember, we are -evaluating how well a model will predict _unseen_ data. So if later in the -future someone uses our model, we want to be sure that it will perform as well +evaluating how well a model will predict *unseen* data. So if in the future +someone uses our model, we want to be sure that it will perform as well as we estimated. If we overestimate the performance, we might be in for a surprise. @@ -26,9 +27,9 @@ Grandvalet [#1]_, it is simply not possible to have an unbiased estimate of the variance of the generalization error. -We will not repeat what our colleagues from scikit-learn have already +We will not repeat what our colleagues from ``scikit-learn`` have already explained in their excellent documentation [#2]_. So we will just add a few -words about some topics, assuming you have already read the scikit-learn +words about some topics, assuming you have already read the ``scikit-learn`` documentation. As a rule of thumb, K-fold cross-validation is a good compromise between bias @@ -39,15 +40,14 @@ this estimate is lower than the variance of the leave-one-out cross-validation (LOOCV), but higher than the variance of the holdout method. The bias is higher than the bias of LOOCV, but lower than the bias of the holdout method. -But this claims must be taken with caution. There has been intense research on +But these claims must be taken with caution. There has been intense research on this topic, and there are still unconclusive results. While intuition points in -one direction, empirical evidence points in others. If you want to know more +one direction, empirical evidence points in other. If you want to know more about this topic, we suggest you start with this thread on Cross -Validated [#3]_. Emirical evidence shows that choosing any K between the +Validated [#3]_. Empirical evidence shows that choosing any K between the extremes [n, 2] is a good compromise between bias and variance. In practice, `K=10` is a good choice [#4]_. - Now the fun part begins, which of the many variants of K-fold shall we choose? The answer is: it depends. It depends on the data and the problem you are trying to solve. In this section we will shed some light on two important @@ -56,7 +56,7 @@ Stratification -------------- -Stratification is a technique used to ensure that the distribution of the +It is a technique used to ensure that the distribution of the target variable is the same in the training and testing sets. This is important because if the distribution of the target variable is different in the training and testing sets, the model will learn a different distribution @@ -68,26 +68,27 @@ That is, you can ensure that the distribution of the target variable is the same in the training and testing sets. -Fortunately, scikit-learn already implements stratification (e.g., stratified -K-fold in the :class:`sklearn.model_selection.StratifiedKFold`). However, this -implementation is only valid for discrete target variables. In the case of -continuous target variables, Julearn comes to rescue you with the +Fortunately, ``scikit-learn`` already implements stratification +(e.g., stratified K-fold in the +:class:`sklearn.model_selection.StratifiedKFold`). However, this implementation +is only valid for discrete target variables. In the case of continuous target +variables, ``julearn`` comes to rescue you with the :class:`.ContinuousStratifiedKFold` splitter. The main issue with continuous target variables is that it is not just a simple matter of counting the number of samples of each class. In this case, we need to ensure that the distribution of the target variable is the same in the training and testing sets. This is a more complex problem, and there are -several ways to solve it. In Julearn, we have implemented two ways of doing +several ways to solve it. In ``julearn``, we have implemented two ways of doing this: *binning* and *quantizing*. Binning is a technique that consists of dividing the target variable into discrete bins, each of equal size, and then ensuring that the distribution of the target variable is the same in the training and testing sets in terms of -samples per bin. Let's see an example using a uniform distribution, creating +samples per bin. Let's see an example using a uniform distribution, creating 200 samples and 10 bins. + """ -# %% import numpy as np import matplotlib.pyplot as plt import seaborn as sns @@ -111,7 +112,6 @@ n_bins=n_bins, n_splits=3, shuffle=True, random_state=42 ) -# %% fig, axis = plt.subplots(1, 2, figsize=(20, 4)) train_sets = [] test_sets = [] @@ -129,10 +129,11 @@ ) ############################################################################### -# Now lets see how K-fold would have split this data. +# Now let's see how K-fold would have split this data. from sklearn.model_selection import KFold cv = KFold(n_splits=3, shuffle=True, random_state=42) + fig, axis = plt.subplots(1, 2, figsize=(20, 4)) train_sets = [] test_sets = [] @@ -156,7 +157,6 @@ # differences between the distributions of the training and testing sets can be # much more evident. Let's take a look at the same analysis but using a # Gaussian distribution. -# %% y = np.random.normal(size=200) sns.histplot(y, bins=n_bins) @@ -183,9 +183,8 @@ ############################################################################### # Now lets see how K-fold would have split this data. -from sklearn.model_selection import KFold - cv = KFold(n_splits=3, shuffle=True, random_state=42) + fig, axis = plt.subplots(1, 2, figsize=(20, 4)) train_sets = [] test_sets = [] @@ -212,7 +211,6 @@ # target variable. Instead of fixing the size of the bins, we can split the # data into bins with the same number of samples. This is called *quantizing*. # Let's see how this works on the same data. -# %% bins = np.quantile(y, np.linspace(0, 1, n_bins + 1)) discrete_y = np.digitize(y, bins=bins[:-1]) sns.histplot(discrete_y, bins=n_bins) @@ -221,7 +219,6 @@ # In this case, each quantile of the target variable is equally represented in # each "bin". To use this approach, we can simply set ``method="quantile"`` in # the :class:`.ContinuousStratifiedKFold`. -# %% cv = ContinuousStratifiedKFold( n_bins=n_bins, method="quantile", n_splits=3, shuffle=True, random_state=42 ) @@ -250,8 +247,10 @@ # importantly, due to how the bins are defined (dashed lines), each quantile is # now equally represented in each fold. # -# .. note:: Julearn provides :class:`.RepeatedContinuousStratifiedKFold` as -# the repeated version of :class:`.ContinuousStratifiedKFold`. +# .. note:: +# +# ``julearn`` provides :class:`.RepeatedContinuousStratifiedKFold` as +# the repeated version of :class:`.ContinuousStratifiedKFold`. # # # Grouping @@ -265,7 +264,7 @@ # measured multiple times, we might want to ensure that the model is not # evaluated on data from the same subject that was used to train it. # -# To this matter, Julearn provides :class:`.ContinuousStratifiedGroupKFold`, +# To this matter, ``julearn`` provides :class:`.ContinuousStratifiedGroupKFold`, # which provides support for a grouping variable and # :class:`.RepeatedContinuousStratifiedGroupKFold` as the repeated version of # it. @@ -288,5 +287,3 @@ # estimation and model selection" \ # `_, IJCAI'95, pages # 1137-1145. - -# %% diff --git a/examples/99_docs/run_data_docs.py b/examples/99_docs/run_data_docs.py index 9753132be..6036ce4d7 100644 --- a/examples/99_docs/run_data_docs.py +++ b/examples/99_docs/run_data_docs.py @@ -1,6 +1,7 @@ # Authors: Vera Komeyer # Fede Raimondo # License: AGPL + """ Data ==== @@ -8,29 +9,29 @@ Data input to :func:`.run_cross_validation` ------------------------------------------- -Julearn deals with data in the form of pandas DataFrames. This is the kind of -data structure that the :func:`.run_cross_validation` uses to input the data -and output some of the results. +``julearn`` deals with data in the form of ``pandas.DataFrames``. This is the +kind of data structure that the :func:`.run_cross_validation` uses to input the +data and output some of the results. The input DataFrame must contain the features and the target or label. This will be communicated to :func:`.run_cross_validation` by specifying the following parameters: -- ``data``: Name of the dataframe containing the features and the target or +- ``data``: Name of the DataFrame containing the features and the target or label. -- ``X``: List of strings containing the column names of the features. +- ``X``: List of string containing the column names of the features. - ``y``: String containing the name of the column with the target or label. -For example, using the well known *iris* dataset, we can specify the data input -as follows: +For example, using the well known ``iris`` dataset, we can specify the data +input as follows: -First, we load the data into a pandas dataframe called ``df`` and specify +First, we load the data into a ``pandas.DataFrame`` called ``df`` and specify ``X`` and ``y``: """ from seaborn import load_dataset -df = load_dataset('iris') +df = load_dataset("iris") ############################################################################## # Let's inspect what our dataframe looks like. @@ -44,20 +45,21 @@ y = "species" ############################################################################## -# Julearn's :func:`.run_cross_validation` function so far would look like this: +# ``julearn``'s :func:`.run_cross_validation` function so far would look like +# this: # # .. code-block:: python # # run_cross_validation(X=X, y=y, data=df) # -# This is not yet very useful to do machine learning, but we will come to it step -# by step. +# This is not yet very useful to do machine learning, but we will come to it +# step by step. ############################################################################## -# Giving *types* to features -# -------------------------- +# Giving ``types`` to features +# ---------------------------- # -# A nice add-on that Julearn offers is the capacity to specify colum-based +# A nice add-on that ``julearn`` offers is the capacity to specify colum-based # types for the features. This comes in handy if within the pipeline, one # wants to manipulate only certain columns. # @@ -69,9 +71,9 @@ # Every column can only have **one type**! # # -# In the case of the iris dataset, we could specify the type of the columns -# related to the _sepal_ and _petal_ information as ``"sepal"`` and ``"petal"`` -# respectively. +# In the case of the ``iris dataset``, we could specify the type of the columns +# related to the ``sepal`` and ``petal`` information as ``"sepal"`` and +# ``"petal"`` respectively. X_types = { "petal": ["petal_length", "petal_width"], @@ -79,10 +81,10 @@ } ############################################################################## -# Importantly, Julearn also allows to specify the column names as regular +# Importantly, ``julearn`` also allows to specify the column names as regular # expressions. This comes in handy when we are dealing with hundreds or # thousands of features and we do not want to specify all the names by hand. -# For example, we could specify the type of the _sepal_ columns +# For example, we could specify the type of the ``sepal`` columns # as follows: X_types = { @@ -106,11 +108,11 @@ # ``"continuous"`` and a warning will be raised. # # -# Up to now, we saw how to parametrize :func:`.run_cross_validation` in terms -# of the input data. In the next section we will see how to specify the output +# Until now we saw how to parametrize :func:`.run_cross_validation` in terms +# of the input data. In the next section we will see how to specify the output. # In the next section we will focus on basic options to use -# :func:`.run_cross_validation` to evaluate different pipelines in a +# :func:`.run_cross_validation` to evaluate different pipelines in a # cross-validation consistent manner. # -# Advanced uses cases regarding X_types selective processing are covered in -# :ref:`apply_to_feature_types` +# Advanced uses cases regarding ``X_types`` selective processing are covered in +# :ref:`apply_to_feature_types`. diff --git a/examples/99_docs/run_hyperparameters_docs.py b/examples/99_docs/run_hyperparameters_docs.py index 1be96a19e..086a800aa 100644 --- a/examples/99_docs/run_hyperparameters_docs.py +++ b/examples/99_docs/run_hyperparameters_docs.py @@ -1,5 +1,6 @@ # Authors: Fede Raimondo # License: AGPL + """ Hyperparameter Tuning ===================== @@ -21,10 +22,9 @@ parameter ``C``. We will use the ``iris`` dataset, which is a dataset of measurements of flowers. -Lets start by loading the dataset and setting the features and target +We start by loading the dataset and setting the features and target variables. """ - from seaborn import load_dataset from pprint import pprint # To print in a pretty way @@ -98,7 +98,7 @@ # try more values. And since this is only one hyperparameter, it is not that # difficult. But what if we have more hyperparameters? And what if we have # several steps in the pipeline (e.g. feature selection, PCA, etc.)? -# This has a main problem: the more hyperparameters we have, the more +# This is a major problem: the more hyperparameters we have, the more # times we use the same data for training and testing. This usually gives an # optimistic estimation of the performance of the model. # @@ -110,7 +110,7 @@ # (outer loop), and then we split the training set into two sets to tune the # hyperparameters (inner loop). # -# Julearn has a simple way to do hyperparameter tuning using nested cross- +# ``julearn`` has a simple way to do hyperparameter tuning using nested cross- # validation. When we use a :class:`.PipelineCreator` to create a pipeline, # we can set the hyperparameters we want to tune and the values we want to try. # @@ -152,7 +152,7 @@ # hyperparameter was not on the boundary of the values we tried, we can # conclude that our search for the best ``C`` value was successful. # -# However, by checking on the :class:`~sklearn.svm.SVC` documentation, we can +# However, by checking the :class:`~sklearn.svm.SVC` documentation, we can # see that there are more hyperparameters that we can tune. For example, for # the default ``rbf`` kernel, we can tune the ``gamma`` hyperparameter: @@ -179,7 +179,7 @@ # But since ``gamma`` was on the boundary of the values we tried, we should # try more values to be sure that we are using the best hyperparameter set. # -# We can even give a mixture of different variable types, like the words +# We can even give a combination of different variable types, like the words # ``"scale"`` and ``"auto"`` for the ``gamma`` hyperparameter: creator = PipelineCreator(problem_type="classification") creator.add("zscore") @@ -231,27 +231,26 @@ print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}") pprint(model_tuned.best_params_) - ############################################################################### -# But how will Julearn find the optimal hyperparameter set? +# But how will ``julearn`` find the optimal hyperparameter set? # # Searchers # --------- # -# Julearn uses the same concept as `scikit-learn`_ to tune hyperparameters: it -# uses a *searcher* to find the best hyperparameter set. A searcher is an +# ``julearn`` uses the same concept as `scikit-learn`_ to tune hyperparameters: +# it uses a *searcher* to find the best hyperparameter set. A searcher is an # object that receives a set of hyperparameters and their values, and then # tries to find the best combination of values for the hyperparameters using # cross-validation. # -# By default, Julearn uses a :class:`~sklearn.model_selection.GridSearchCV`. +# By default, ``julearn`` uses a :class:`~sklearn.model_selection.GridSearchCV`. # This searcher is very simple. First, it construct the "grid" of # hyperparameters to try. As we see above, we have 3 hyperparameters to tune. # So it constructs a 3-dimentional grid with all the possible combinations of # the hyperparameters values. The second step is to perform cross-validation # on each of the possible combinations of hyperparameters values. # -# Another searcher that Julearn provides is the +# Another searcher that ``julearn`` provides is the # :class:`~sklearn.model_selection.RandomizedSearchCV`. This searcher is # similar to the :class:`~sklearn.model_selection.GridSearchCV`, but instead # of trying all the possible combinations of hyperparameters values, it tries @@ -290,9 +289,9 @@ # We can avoid this by using multiple *grids*. One grid for the ``linear`` # kernel and one grid for the ``rbf`` kernel. # -# Julearn allows to specify multiple *grid* using two different approaches. +# ``julearn`` allows to specify multiple *grid* using two different approaches. # -# 1) Repeating the step name with different hyperparameters: +# 1. Repeating the step name with different hyperparameters: creator = PipelineCreator(problem_type="classification") creator.add("zscore") @@ -312,7 +311,6 @@ print(creator) - scores1, model1 = run_cross_validation( X=X, y=y, @@ -328,12 +326,12 @@ ############################################################################### # .. important:: # Note that the ``name`` parameter is required when repeating a step name. -# If we do not specify the ``name`` parameter, julearn will auto-determine -# the step name in an unique way. The only way to force repated names is -# to do so explicitly. +# If we do not specify the ``name`` parameter, ``julearn`` will +# auto-determine the step name in an unique way. The only way to force repated +# names is to do so explicitly. ############################################################################### -# 2) Using multiple pipeline creators: +# 2. Using multiple pipeline creators: creator1 = PipelineCreator(problem_type="classification") creator1.add("zscore") @@ -370,8 +368,6 @@ # All the pipeline creators must have the same problem type and steps names # in order for this approach to work. -############################################################################### - ############################################################################### # Indeed, if we compare both approaches, we can see that they are equivalent. # They both produce the same *grid* of hyperparameters: @@ -426,7 +422,6 @@ name="model", ) - scores3, model3 = run_cross_validation( X=X, y=y, diff --git a/examples/99_docs/run_model_comparison_docs.py b/examples/99_docs/run_model_comparison_docs.py index 3858a6cd6..9d4108ee8 100644 --- a/examples/99_docs/run_model_comparison_docs.py +++ b/examples/99_docs/run_model_comparison_docs.py @@ -1,26 +1,26 @@ # Authors: Vera Komeyer # Fede Raimondo # License: AGPL + """ -Model comparison +Model Comparison ================ In the previous section, we saw how to evaluate a single model using cross-validation. The example model seems to perform decently well. However, how do we know that it can't be better? Building machine-learning models is always a matter of *benchmarking*. We want to know how well our model performs, -compared to other models. In the previous section we saw how to evaluate a -model's performance using cross-validation. This is a good start, but it is not -enough. We can use cross-validation to evaluate the performance of a single -model, but we can't use it to compare different models. We could build -different models and evaluate them using cross-validation, but then we would -have to compare the results manually. This is not only tedious, but also -error-prone. We need a way to compare different models in a statistically -sound way. - -To statistically compare different models, Julearn provides a built-in -corrected t-test. To see how to apply it, we will first build three different -models, each with another learning algorithm. +compared to other models. We already saw how to evaluate a model's performance +using cross-validation. This is a good start, but it is not enough. We can use +cross-validation to evaluate the performance of a single model, but we can't use +it to compare different models. We could build different models and evaluate them +using cross-validation, but then we would have to compare the results manually. +This is not only tedious, but also error-prone. We need a way to compare +different models in a statistically sound way. + +To statistically compare different models, ``julearn`` provides a built-in +corrected ``t-test``. To see how to apply it, we will first build three +different models, each with different learning algorithms. To perform a binary classification (and not a multi-class classification) we will switch to the :func:`breast cancer dataset from scikit-learn @@ -39,6 +39,10 @@ y = "target" X_types = {"continuous": [".*"]} +# sphinx_gallery_start_ignore +pd.set_option("display.max_columns", 9) +# sphinx_gallery_end_ignore + df.head() ############################################################################### @@ -51,7 +55,6 @@ scoring = ["accuracy", "roc_auc"] - ############################################################################### # We use three different learning algorithms to build three different models. # We will use the default hyperparameters for each of them. @@ -113,7 +116,7 @@ ) ############################################################################### -# We will add a column to each scores dataframes to be able to use names to +# We will add a column to each scores DataFrames to be able to use names to # identify the models later on. scores1["model"] = "svm" @@ -141,7 +144,7 @@ stats_df = corrected_ttest(scores1, scores2, scores3) ############################################################################### -# This gives us a dataframe with the corrected t-test results for each pairwise +# This gives us a DataFrame with the corrected t-test results for each pairwise # comparison of the three models' test scores: # # We can see, that none of the models performed better with respect to @@ -155,9 +158,9 @@ # # Visualizations can help to get a better intuitive understanding of the # differences between the models. To get a better overview of the performances -# of our three models, we can make use of Julearn's visualization tool to plot -# the scores in an interactive manner. As visualizations are not part of the -# core functionality of Julearn, you will need to first manually +# of our three models, we can make use of ``julearn``'s visualization tool to +# plot the scores in an interactive manner. As visualizations are not part of the +# core functionality of ``julearn``, you will need to first manually # **install the additional visualization dependencies**. # # From here we can create the interactive plot. Interactive, because you can diff --git a/examples/99_docs/run_model_evaluation_docs.py b/examples/99_docs/run_model_evaluation_docs.py index f6250244f..1c7b4f96d 100644 --- a/examples/99_docs/run_model_evaluation_docs.py +++ b/examples/99_docs/run_model_evaluation_docs.py @@ -1,6 +1,7 @@ # Authors: Vera Komeyer # Fede Raimondo # License: AGPL + """ Model Evaluation ================ @@ -15,8 +16,8 @@ Cross-validation scores ~~~~~~~~~~~~~~~~~~~~~~~ -We consider the _iris_ data example and one of the pipelines from the previous -section (feature z-scoring and a ``svm``). +We consider the ``iris`` data example and one of the pipelines from the previous +section (feature z-scoring and a ``svm``). """ from julearn import run_cross_validation from julearn.pipeline import PipelineCreator @@ -25,6 +26,7 @@ # sphinx_gallery_start_ignore from sklearn import set_config + set_config(display="diagram") # sphinx_gallery_end_ignore @@ -40,12 +42,12 @@ ] } -# create a pipeline +# Create a pipeline creator = PipelineCreator(problem_type="classification") creator.add("zscore") creator.add("svm") -# run cross-validation +# Run cross-validation scores = run_cross_validation( X=X, y=y, @@ -55,12 +57,11 @@ ) ############################################################################### -# The ``scores`` variable is an pandas DataFrame object which contains the +# The ``scores`` variable is a ``pandas.DataFrame`` object which contains the # cross-validated metrics for each fold as columns and rows respectively. print(scores) - ############################################################################### # We see that for example the ``test_score`` for the third fold is 0.933. This # means that the model achieved a score of 0.933 on the validation set @@ -95,7 +96,7 @@ # # The column ``cv_mdsum`` on the first glance might appear a bit cryptic. # This column is used in internal checks, to verify that the same CV was used -# when results are compared using julearn's provided statistical tests. +# when results are compared using ``julearn``'s provided statistical tests. # This is nothing you need to worry about at this point. # # Returning a model (estimator) @@ -104,23 +105,22 @@ # Now that we saw that our model doesn't seem to overfit, we might be # interested in checking how our model parameters look like. By setting the # parameter ``return_estimator``, we can tell :func:`.run_cross_validation` to -# give us models. It can have three different values: +# give us the models. It can have three different values: # # 1. ``"cv"``: This option indicates that we want to get the model that was # trained on the entire training data of each CV fold. This means that we # get as many models as we have CV folds. They will be returned within the -# scores dataframe. +# scores DataFrame. # # 2. ``"final"``: With this setting, an additional model will be trained on the # entire input dataset. This model will be returned as a separate variable. # -# 3. ``"all"``: In this scenario, all the estimators (final and cv) will be -# returned. +# 3. ``"all"``: In this scenario, all the estimators (``"final"`` and ``"cv"``) +# will be returned. # # For demonstration purposes we will have a closer look at the ``"final"`` # estimator option. - scores, model = run_cross_validation( X=X, y=y, @@ -134,7 +134,7 @@ print(scores) ############################################################################### -# As we see, the scores dataframe is the same as before. However, we now have +# As we see, the scores DataFrame is the same as before. However, we now have # an additional variable ``model``. This variable contains the final estimator # that was trained on the entire training dataset. @@ -150,36 +150,36 @@ # # When performing a cross-validation, we need to split the data into training # and validation sets. This is done by a *cross-validation splitter*, that -# defines how the data should be split, how many folds should be used weather -# to repeat the process several times. For example, we might want to shuffle -# the data before splitting, stratify the splits so the distribution of targets -# are always represented in the individual folds, or consider certain grouping -# variables in the splitting process, so that samples from the same group are -# always in the same fold and not split across folds. +# defines how the data should be split, how many folds should be used and +# whether to repeat the process several times. For example, we might want to +# shuffle the data before splitting, stratify the splits so the distribution of +# targets are always represented in the individual folds, or consider certain +# grouping variables in the splitting process, so that samples from the same +# group are always in the same fold and not split across folds. # # So far, however, we didn't specify anything in that regard and still the # cross-validation was performed and we got five folds (see the five rows above # in the scores dataframe). This is because the default behaviour in -# :func:`.run_cross_validation` falls back to the scikit-learn defaults, -# which is a :class:`~sklearn.model_selection.StratifiedKFold` (with ``k=5``) -# for classification and :class:`~sklearn.model_selection.KFold` (with ``k=5``) +# :func:`.run_cross_validation` falls back to the ``scikit-learn`` defaults, +# which is a :class:`sklearn.model_selection.StratifiedKFold` (with ``k=5``) +# for classification and :class:`sklearn.model_selection.KFold` (with ``k=5``) # for regression. # # .. note:: -# These defaults will change when they are changed in scikit-learn as here -# Julearn just uses scikit-learn's defaults. +# These defaults will change when they are changed in ``scikit-learn`` as here +# ``julearn`` uses ``scikit-learn``'s defaults. # # We can define the cross-validation splitting strategy ourselves by passing an -# int, str or cross-validation generator to the ``cv`` parameter of +# ``int, str or cross-validation generator`` to the ``cv`` parameter of # :func:`.run_cross_validation`. The default described above is ``cv=None``. -# the second options is to pass only an integer to ``cv``. In that case, the +# the second option is to pass only an integer to ``cv``. In that case, the # same default splitting strategies will be used -# (:class:`~sklearn.model_selection.StratifiedKFold` for classification, -# :class:`~sklearn.model_selection.KFold` for regression), but the number of -# folds will be changed to the value of the provided integer (e.g. ``cv=10``). -# To define the entire splitting strategy, one can pass all scikit-learn +# (:class:`sklearn.model_selection.StratifiedKFold` for classification, +# :class:`sklearn.model_selection.KFold` for regression), but the number of +# folds will be changed to the value of the provided integer (e.g., ``cv=10``). +# To define the entire splitting strategy, one can pass all ``scikit-learn`` # compatible splitters :mod:`sklearn.model_selection` to ``cv``. However, -# Julearn provides a built-in set of additional splitters that can be found +# ``julearn`` provides a built-in set of additional splitters that can be found # under :mod:`.model_selection` (see more about them in :ref:`cv_splitter`). # The fourth option is to pass an iterable that yields the train and test # indices for each split. @@ -203,7 +203,7 @@ ############################################################################### # This will repeat 2 times a 5-fold stratified cross-validation. So the -# returned ``scores`` dataframe will have 10 rows. We set the ``random_state`` +# returned ``scores`` DataFrame will have 10 rows. We set the ``random_state`` # to an arbitrary integer to make the splitting of the data reproducible. print(scores) @@ -221,18 +221,19 @@ # default assumption for the scorer to be used to evaluate the # cross-validation, which is always the model's default scorer. Remember, we # used a support vector classifier with the ``y`` (target) variable being the -# species of the iris dataset (possible values: 'setosa', 'versicolor' or -# 'virginica'). Therefore we have a multi-class classification (not to be -# confused with a multi-label classification!). Checking scikit-learn's -# documentation of a support vector classifier's default scorer, we can see -# that this is the 'mean accuracy on the given test data and labels' -# :meth:`sklearn.svm.SVC.score`. +# species of the ``iris`` dataset (possible values: ``'setosa'``, +# ``'versicolor'`` or ``'virginica'``). Therefore we have a multi-class +# classification (not to be confused with a multi-label classification!). +# Checking ``scikit-learn``'s documentation of a support vector classifier's +# default scorer :meth:`sklearn.svm.SVC.score`, we can see that this is the +# 'mean accuracy on the given test data and labels'. # # With the ``scoring`` parameter of :func:`.run_cross_validation`, one can -# define the scoring function to be used. On top of the available scikit-learn -# :mod:`sklearn.metrics` julearn extends the functionality with more internal -# scorers and the possibility to define custom scorers. To see the available -# Julearn scorers, one can use the :func:`.list_scorers` function: +# define the scoring function to be used. On top of the available +# ``scikit-learn`` :mod:`sklearn.metrics`, ``julearn`` extends the functionality +# with more internal scorers and the possibility to define custom scorers. To see +# the available ``julearn`` scorers, one can use the :func:`.list_scorers` +# function: from julearn import scoring from pprint import pprint # for nice printing @@ -240,9 +241,9 @@ pprint(scoring.list_scorers()) ############################################################################### -# To use a Julearn scorer, one can pass the name of the scorer as a string to -# the ``scoring`` parameter of :func:`.run_cross_validation`. If multiple -# different scorers should be used, a list of strings can be passed. For +# To use a ``julearn`` scorer, one can pass the name of the scorer as a string +# to the ``scoring`` parameter of :func:`.run_cross_validation`. If multiple +# different scorers need to be used, a list of strings can be passed. For # example, if we were interested in the ``accuracy`` and the ``f1`` scores we # could do the following: @@ -260,7 +261,7 @@ ) ############################################################################### -# The ``scores`` dataframe will now have train- and test-score columns for both +# The ``scores`` DataFrame will now have train- and test-score columns for both # scorers: print(scores) diff --git a/examples/99_docs/run_model_inspection_docs.py b/examples/99_docs/run_model_inspection_docs.py index 7c8374141..bb1285ff8 100644 --- a/examples/99_docs/run_model_inspection_docs.py +++ b/examples/99_docs/run_model_inspection_docs.py @@ -2,6 +2,7 @@ # Vera Komeyer # Fede Raimondo # License: AGPL + """ Inspecting Models ================= @@ -18,9 +19,9 @@ the model's predictions, understand how the model works and make informed decisions about its deployment. -In this context, we will explore how to perform model inspection in julearn. -Julearn provides an intuitive suite of tools for model inspection and -interpretation. We will focus on how to inspect models in julearn's nested +In this context, we will explore how to perform model inspection in ``julearn``. +``julearn`` provides an intuitive suite of tools for model inspection and +interpretation. We will focus on how to inspect models in ``julearn``'s nested cross-validation workflow. With these techniques, we can gain a better understanding of how the model works and identify any patterns or anomalies that could affect its performance. This knowledge can help us deploy models more @@ -29,7 +30,6 @@ Let's start by importing some useful utilities: """ - from pprint import pprint import seaborn as sns import numpy as np @@ -42,10 +42,10 @@ ############################################################################## -# Now, let's configure julearn's logger to get some output as the pipeline is -# running and get some toy data to play with. In this example, we will use the -# penguin dataset, and classify the penguin species based on the continuous -# measures in the dataset. +# Now, let's configure ``julearn``'s logger to get some output as the pipeline +# is running and get some toy data to play with. In this example, we will use +# the ``penguin`` dataset, and classify the penguin species based on the +# continuous measures in the dataset. configure_logging(level="INFO") @@ -69,11 +69,11 @@ print(pipeline_creator) ############################################################################## -# Once this is set up, we can simply call julearn's +# Once this is set up, we can simply call ``julearn``'s # :func:`.run_cross_validation`. Notice, how we set the ``return_inspector`` # parameter to ``True``. Importantly, we also have to set the -# ``return_estimator`` parameter to ``"all"``. This is because julearn's -# :class:`.Inspector` extracts all relevant nformation from estimators after +# ``return_estimator`` parameter to ``"all"``. This is because ``julearn``'s +# :class:`.Inspector` extracts all relevant information from estimators after # the pipeline has been run. The pipeline will take a few minutes in our # example: @@ -94,7 +94,6 @@ # the cross-validation. The final model can be inspected using the ``.model`` # attribute. For example to get a quick overview over the model parameters, run: - # remember to actually import pprint as above, or just print out using print pprint(inspector.model.get_params()) @@ -115,7 +114,6 @@ print(inspector.model.get_fitted_params()["zscore__mean_"]) - ############################################################################## # In addition, sometimes it can be very useful to know what predictions were # made in each individual train-test split of the cross-validation. This is @@ -133,12 +131,11 @@ # This ``.folds`` attribute is actually an iterator, that can iterate over # every single fold used in the cross-validation, and it yields an instance of # a :class:`.FoldsInspector`, which can then be used to explore each model that -# was fitted during cross-validation. For example, we can collect the _C_ +# was fitted during cross-validation. For example, we can collect the ``C`` # parameters that were selected in each outer fold of our nested # cross-validation. That way, we can assess the amount of variance on that # particular parameter across folds: - c_values = [] for fold_inspector in inspector.folds: fold_model = fold_inspector.model @@ -147,7 +144,7 @@ ) ############################################################################## -# By printing out the unique values in the ``c_values`` list, we realise, that +# By printing out the unique values in the ``c_values`` list, we realize, that # actually there was not much variance across models. In fact, there was only # one parameter value ever selected. This may indicate that this is in fact # the optimal value, or it may indicate that there is a potential problem with @@ -163,7 +160,7 @@ # you can gain deeper insights, interpret your models effectively, and address # any issues that may arise. Model inspection serves as a valuable asset in the # deployment of machine learning models, ensuring transparency, -# interpretability, and reliable decision-making. With julearn's model +# interpretability, and reliable decision-making. With ``julearn``'s model # inspection capabilities, you can confidently navigate the complexities of # machine learning models and harness their full potential in real-world # applications. diff --git a/examples/99_docs/run_pipeline_docs.py b/examples/99_docs/run_pipeline_docs.py index 0b7d68168..55827af32 100644 --- a/examples/99_docs/run_pipeline_docs.py +++ b/examples/99_docs/run_pipeline_docs.py @@ -1,28 +1,29 @@ # Authors: Vera Komeyer # Fede Raimondo # License: AGPL + """ Model Building ============== -So far we know how to parametrize:func:`.run_cross_validation` in terms of the +So far we know how to parametrize :func:`.run_cross_validation` in terms of the input data (see :ref:`data_usage`). In this section, we will have a look on how we can parametrize the learning algorithm and the preprocessing steps, also known as the *pipeline*. A machine learning pipeline is a process to automate the workflow of -a predictive model. It can be thought of as a combination of pipes and +a predictive model. It can be thought of as a combination of pipes and filters. At a pipeline's starting point, the raw data is fed into the first filter. The output of this filter is then fed into the next filter -(through a pipe). In supervised Machine Learning, different filters inside the +(through a pipe). In supervised machine learning, different filters inside the pipeline modify the data, while the last step is a learning algorithm that generates predictions. Before using the pipeline to predict new data, the -pipeline has to be trained (*fitted*) on data. We call this, as scikit-learn +pipeline has to be trained (*fitted*) on data. We call this, as ``scikit-learn`` does, *fitting* the pipeline. -Julearn aims to provide a user-friendly way to build and evaluate complex +``julearn`` aims to provide a user-friendly way to build and evaluate complex machine learning pipelines. The :func:`.run_cross_validation` function is the -entry point to safely evaluate pipelines by making it easy to specify, +entry point to safely evaluate pipelines by making it easy to specify, customize and train the pipeline. We first have a look at the most basic pipeline, only consisting of a machine learning algorithm. Then we will make the pipeline incrementally more complex. @@ -34,11 +35,11 @@ One important aspect when building machine learning models is the selection of a learning algorithm. This can be specified in :func:`.run_cross_validation` -by setting the ``model`` parameter. This parameter can be any scikit-learn -compatible learning algorithm. However, Julearn provides a built-in list of -:ref:`available_models` that can be specified by name (see ``Name (str)`` -column in :ref:`available_models`). For example, we can simply set -``model=="svm"`` to use a support vector machine (SVM) [#1]_. +by setting the ``model`` parameter. This parameter can be any ``scikit-learn`` +compatible learning algorithm. However, ``julearn`` provides a list of built-in +:ref:`available_models` that can be specified by name (see ``Name`` column in +:ref:`available_models`). For example, we can simply set +``model=="svm"`` to use a Support Vector Machine (SVM) [#1]_. Let's first specify the data parameters as we learned in :ref:`data_usage`: """ @@ -58,7 +59,7 @@ } ############################################################################## -# Now we can run the cross validation with the SVM as learning algorithm: +# Now we can run the cross validation with SVM as the learning algorithm: scores = run_cross_validation( X=X, @@ -85,7 +86,7 @@ ############################################################################## # Feature preprocessing -# ----------------------- +# --------------------- # There are cases in which the input data, and in particular the features, # should be transformed before passing them to the learning algorithm. One # scenario can be that certain learning algorithms need the features in a @@ -95,15 +96,15 @@ # # Importantly in a machine learning workflow, all transformations done to the # data have to be done in a cv-consistent way. That means that -# data-transformation steps have to be done on the training data of each -# respective cross validation fold and then *only* apply the parameters of the +# data transformation steps have to be done on the training data of each +# respective cross-validation fold and then *only* apply the parameters of the # transformation to the validation data of the respective fold. One should # **never** do preprocessing on the entire dataset and then do # cross-validation on the already preprocessed features (or more # generally transformed data) because this leads to leakage of information from # the validation data into the model. This is exactly where # :func:`.run_cross_validation` comes in handy, because you can simply add your -# wished preprocessing step (:ref:`available_transformers`) and it +# desired preprocessing step (:ref:`available_transformers`) and it # takes care of doing the respective transformations in a cv-consistent manner. # # Let's have a look at how we can add a z-scoring step to our pipeline: @@ -122,22 +123,21 @@ ############################################################################## # .. note:: -# Learning algorithms (what we specified in the `model` parameter), are -# estimators. Preprocessing steps however, are usually transformers, because -# they transform the input data in a certain way. Therefore the parameter -# description in the api of :func:`.run_cross_validation`, -# defines valid input for the `preprocess` parameter as `TransformerLike`. +# Learning algorithms (what we specified in the ``model`` parameter), are +# **estimators**. Preprocessing steps however, are usually **transformers**, +# because they transform the input data in a certain way. Therefore, the +# parameter description in the API of :func:`.run_cross_validation`, +# defines valid input for the ``preprocess`` parameter as +# ``TransformerLike``:: # -# .. code-block:: -# -# preprocess : str, TransformerLike or list | None -# Transformer to apply to the features. If string, use one of the -# available transformers. If list, each element can be a string or -# scikit-learn compatible transformer. If None (default), no -# transformation is applied. +# preprocess : str, TransformerLike or list | None +# Transformer to apply to the features. If string, use one of the +# available transformers. If list, each element can be a string or +# scikit-learn compatible transformer. If None (default), no +# transformation is applied. ############################################################################## -# But what if we want to add more pre-processing steps? +# But what if we want to add more preprocessing steps? # For example, in the case that there are many features available, we might # want to reduce the dimensionality of the features before passing them to the # learning algorithm. A commonly used approach is a principal component @@ -145,7 +145,6 @@ # want to keep our previously applied z-scoring, we can simply add the PCA as # another preprocessing step as follows: - scores = run_cross_validation( X=X, y=y, @@ -160,7 +159,7 @@ ############################################################################## # This is nice, but with more steps added to the pipeline this can become -# intransparent. To simplify building complex pipelines, Julearn provides a +# opaque. To simplify building complex pipelines, ``julearn`` provides a # :class:`.PipelineCreator` which helps keeping things neat. # # .. _pipeline_creator: @@ -170,7 +169,7 @@ # # The :class:`.PipelineCreator` is a class that helps the user create complex # pipelines with straightforward usage by adding, in order, the desired steps -# to the pipeline. Once that the pipeline is specified, the +# to the pipeline. Once the pipeline is specified, the # :func:`.run_cross_validation` will detect that it is a pipeline creator and # will automatically create the pipeline and run the cross-validation. # @@ -190,9 +189,9 @@ ############################################################################## # Then we use the ``add`` method to add every desired step to the pipeline. -# Both, the pre-processing steps and the learning algorithm are added in the +# Both, the preprocessing steps and the learning algorithm are added in the # same way. -# As with the :func:`.run_cross_validation` functiona, one can use the names +# As with the :func:`.run_cross_validation` function, one can use the names # of the step as indicated in :ref:`available_pipeline_steps`. creator.add("zscore") @@ -216,10 +215,9 @@ print(scores) - ############################################################################## # Awesome! We covered how to create a basic machine learning pipeline and -# even added multiple feature pre-preprocessing steps. +# even added multiple feature prepreprocessing steps. # # Let's jump to the next important aspect in the process of building a machine # learning model: **Hyperparameters**. We here cover the basics of setting @@ -242,17 +240,17 @@ # support vector machine or the coefficients/weights in a linear or logistic # regression. # -# **Hyperparameters** in turn, are _configuration(s)_ of a learning algorithm, +# **Hyperparameters** in turn, are *configuration(s)* of a learning algorithm, # which cannot be estimated from data, but nevertheless need to be specified to # determine how the model parameters will be learnt. The best value for a # hyperparameter on a given problem is usually not known and therefore has to # be either set manually, based on experience from a previous similar problem, -# set by using a heuristic (rule of thumb) or by being _tuned_. Examples are +# set by using a heuristic (rule of thumb) or by being *tuned*. Examples are # the learning rate for training a neural network, the ``C`` and ``sigma`` # hyperparameters for support vector machines or the number of estimators in a # random forest. # -# Manually specifying hyperparameters with Julearn is as simple as using the +# Manually specifying hyperparameters with ``julearn`` is as simple as using the # :class:`.PipelineCreator` and set the hyperparameter when the step is added. # # Let's say we want to set the ``with_mean`` parameter of the z-score @@ -265,10 +263,11 @@ creator.add("svm") print(creator) + ############################################################################### # Usable transformers or estimators can be seen under # :ref:`available_pipeline_steps`. The basis for most of these steps are the -# respective scikit-learn estimators or transformers. To see the valid +# respective ``scikit-learn`` estimators or transformers. To see the valid # hyperparameters for a certain transformer or estimator, just follow the # respective link in :ref:`available_pipeline_steps` which will lead you to the # `scikit-learn`_ documentation where you can read more about them. @@ -284,6 +283,7 @@ creator.add("svm", C=0.9, kernel="linear") print(creator) + ############################################################################### # .. _apply_to_feature_types: # @@ -294,10 +294,10 @@ # :class:`.PipelineCreator` makes things easier. Beside the straightforward # definition of hyperparameters, the :class:`.PipelineCreator` also allows to # specify if a certain step must only be applied to certain features types -# (see:ref:`data_usage` on how to define feature types). +# (see :ref:`data_usage` on how to define feature types). # -# In out example, we can now choose to do two PCA steps, one for the *petal* -# featuers, and one for the *sepal* features. +# In our example, we can now choose to do two PCA steps, one for the *petal* +# features, and one for the *sepal* features. # # First, we need to define the ``X_types`` so we have both *petal* and *sepal* # features: @@ -308,8 +308,8 @@ } ############################################################################### -# Then, we modify the previous creator to add the ``pca`` step to the creator -# and specify that it should only be applied to the *petal* and *sepal* +# Then, we modify the previous creator to add the ``pca`` step to the creator +# and specify that it should only be applied to the *petal* and *sepal* # features. Since we also want the ``zscore`` applied to all features, we need # to specify this as well, indicating that we want to apply it to both # *petal* and *sepal* features: @@ -321,8 +321,9 @@ creator.add("svm") print(creator) + ############################################################################### -# We here additionally had specified as a hyperparameter of the _PCA_ +# We have additionally specified as a hyperparameter of the ``pca`` # that we want to use only the first component. For the ``svm`` we used # the default hyperparameters. # @@ -342,9 +343,9 @@ ############################################################################### # We covered how to set up basic pipelines, how to use the # :class:`.PipelineCreator`, how to use the ``apply_to`` parameter of the -# :class:`.PipelineCreator` and covered basics of hyperparameters. Additional, -# we saw a basic use-case of target pre-processing. In the next -# step we will understand the returns of :func:`.run_cross_validation`, i.e. +# :class:`.PipelineCreator` and covered basics of hyperparameters. Additionally, +# we saw a basic use-case of target preprocessing. In the next +# step we will understand the returns of :func:`.run_cross_validation`, i.e., # the model output and the scores of the performed cross-validation. # # .. topic:: References: diff --git a/examples/99_docs/run_stacked_models_docs.py b/examples/99_docs/run_stacked_models_docs.py index 3f844f805..6d537ce26 100644 --- a/examples/99_docs/run_stacked_models_docs.py +++ b/examples/99_docs/run_stacked_models_docs.py @@ -5,16 +5,15 @@ Stacking Models =============== -Scikit-learn already provides a stacking implementation for +``scikit-learn`` already provides a stacking implementation for :class:`stacking regression` as well -as for :class:`stacking classification` -as well. - -Now, scikit-learn's stacking implementation will fit each estimator on all of -the data. However, this may not always be what you want. Sometimes you want one -estimator in the ensemble to be fitted on one type of features, while fitting -another estimator on another type of features. Julearn's API provides some -extra flexibility to build more flexible and customisable stacking pipelines. +as for :class:`stacking classification`. + +Now, ``scikit-learn``'s stacking implementation will fit each estimator on all +of the data. However, this may not always be what you want. Sometimes you want +one estimator in the ensemble to be fitted on one type of features, while fitting +another estimator on another type of features. ``julearn``'s API provides some +extra flexibility to build more flexible and customizable stacking pipelines. In order to explore its capabilities, let's first look at this simple example of fitting each estimator on all of the data. For example, we can stack a support vector regression (SVR) and a random forest regression (RF) to predict @@ -23,7 +22,7 @@ Fitting each estimator on all of the features --------------------------------------------- First, of course, let's import some necessary packages. Let's also configure -Julearn's logger to get some additional information about what is happening: +``julearn``'s logger to get some additional information about what is happening: """ from sklearn.datasets import make_regression @@ -38,22 +37,22 @@ ############################################################################### # Now, that we have these out of the way, we can create some artificial toy -# data to demonstrate a very simple stacking estimator within Julearn. We will -# use a dataset with 20 features and 200 samples. +# data to demonstrate a very simple stacking estimator within ``julearn``. We +# will use a dataset with 20 features and 200 samples. -# prepare data +# Prepare data X, y = make_regression(n_features=20, n_samples=200) -# make dataframe +# Make dataframe X_names = [f"feature_{x}" for x in range(1, 21)] data = pd.DataFrame(X) data.columns = X_names data["target"] = y ############################################################################### -# To build a stacking pipeline, we have to initialise each estimator that we +# To build a stacking pipeline, we have to initialize each estimator that we # want to use in stacking, and then of course the stacking estimator itself. -# Let's start by initialising an SVR. For this we can use the +# Let's start by initializing an SVR. For this we can use the # :class:`.PipelineCreator`. Keep in mind that this is only an example, and # the hyperparameter grids we use here are somewhat arbitrary: @@ -76,7 +75,7 @@ ############################################################################### # We can now provide these two models to a :class:`.PipelineCreator` to -# initialise a stacking model. The interface for this is very similar to a +# initialize a stacking model. The interface for this is very similar to a # :class:`sklearn.pipeline.Pipeline`: # Create the stacking model @@ -89,7 +88,7 @@ ############################################################################### # This final stacking :class:`.PipelineCreator` can now simply be handed over -# to Julearn's :func:`.run_cross_validation`: +# to ``julearn``'s :func:`.run_cross_validation`: scores, final = run_cross_validation( X=X_names, @@ -104,29 +103,28 @@ # Fitting each estimator on a specific feature type # ------------------------------------------------- # -# As you can see, fitting a standard scikit-learn stacking estimator is -# relatively simple with Julearn. However, sometimes it may be desirable to +# As you can see, fitting a standard ``scikit-learn`` stacking estimator is +# relatively simple with ``julearn``. However, sometimes it may be desirable to # have a bit more control over which features are used to fit each estimator. # For example, there may be two types of features. One of these feature types # we may want to use for fitting the SVR, and one of these feature types we # may want to use for fitting the RF. To demonstrate how this can be done in -# Julearn, let's now create some very similar toy data, but distinguish +# ``julearn``, let's now create some very similar toy data, but distinguish # between two different types of features: ``"type1"`` and ``"type2"``. - -# prepare data +# Prepare data X, y = make_regression(n_features=20, n_samples=200) -# prepare feature names and types +# Prepare feature names and types X_types = { "type1": [f"type1_{x}" for x in range(1, 11)], "type2": [f"type2_{x}" for x in range(1, 11)], } -# first 10 features are "type1", second 10 features are "type2" +# First 10 features are "type1", second 10 features are "type2" X_names = X_types["type1"] + X_types["type2"] -# make df, apply correct column names according to X_names +# Make dataframe, apply correct column names according to X_names data = pd.DataFrame(X) data.columns = X_names data["target"] = y @@ -140,7 +138,7 @@ model_1.add("svm", kernel="linear", C=np.geomspace(1e-2, 1e2, 10)) ############################################################################### -# Afterwards, lets configure a :class:`.PipelineCreator` to fit a RF on the +# Afterwards, let's configure a :class:`.PipelineCreator` to fit a RF on the # features of ``"type2"```: model_2 = PipelineCreator(problem_type="regression", apply_to="type2") @@ -166,7 +164,7 @@ apply_to="*", ) -# run +# Run scores, final = run_cross_validation( X=X_names, X_types=X_types, diff --git a/examples/99_docs/run_target_transformer_docs.py b/examples/99_docs/run_target_transformer_docs.py index 3de7bd472..f7f27a003 100644 --- a/examples/99_docs/run_target_transformer_docs.py +++ b/examples/99_docs/run_target_transformer_docs.py @@ -1,6 +1,7 @@ # Authors: Vera Komeyer # Fede Raimondo # License: AGPL + """ Applying preprocessing to the target ------------------------------------ @@ -11,18 +12,17 @@ when having a regression-task (continuous target variable), one might want to predict the z-scored target. This can be achieved by using a :class:`.TargetPipelineCreator` -as one step in the general pipeline. +as a step in the general pipeline. -Lets start by loading the data and importing the required modules: +Let's start by loading the data and importing the required modules: """ import pandas as pd from julearn import run_cross_validation from julearn.pipeline import PipelineCreator, TargetPipelineCreator - from sklearn.datasets import load_diabetes ############################################################################### -# Load the diabetes dataset from sklearn as a pandas dataframe +# Load the diabetes dataset from ``scikit-learn`` as a ``pandas.DataFrame`` features, target = load_diabetes(return_X_y=True, as_frame=True) print("Features: \n", features.head()) @@ -48,8 +48,8 @@ ############################################################################## # Next, we create the general pipeline using a :class:`.PipelineCreator`. We -# pass the ``target_creator`` as one step of the pipeline and specify that it -# should only be applied to the ``target``, which makes it clear for Julearn +# pass the ``target_creator`` as a step of the pipeline and specify that it +# should only be applied to the ``target``, which makes it clear for ``julearn`` # to only apply it to ``y``: creator = PipelineCreator( @@ -73,5 +73,5 @@ # feature and target transformations. However, features transformations can be # directly specified as step in the :class:`.PipelineCreator`, while target # transformations have to be specified using the -# :class:`.TargetPipelineCreator`, which is then passed to the overall +# :class:`.TargetPipelineCreator`, which is then passed to the overall # :class:`.PipelineCreator` as an extra step. diff --git a/examples/README.rst b/examples/README.rst index 391712353..1f30d9f31 100644 --- a/examples/README.rst +++ b/examples/README.rst @@ -1,4 +1,4 @@ Examples ======== -The following are a set of examples that use julearn. \ No newline at end of file +The following are a set of examples that use ``julearn``. diff --git a/examples/XX_disabled/dis_run_n_jobs.py b/examples/XX_disabled/dis_run_n_jobs.py index 6f9fc02fd..20d745d72 100644 --- a/examples/XX_disabled/dis_run_n_jobs.py +++ b/examples/XX_disabled/dis_run_n_jobs.py @@ -1,6 +1,6 @@ """ -Parallelize Julearn -=================== +Parallelize ``julearn`` +======================= In this example we will parallelize outer cross-validation and/or inner cross-validation for hyperparameter search. @@ -8,22 +8,21 @@ .. include:: ../../links.inc """ # Authors: Sami Hamdan -# # License: AGPL + from seaborn import load_dataset from julearn import run_cross_validation ############################################################################### -# prepare some simple standard input +# Prepare some simple standard input. df_iris = load_dataset("iris") df_iris = df_iris[df_iris["species"].isin(["versicolor", "virginica"])] X = ["sepal_length", "sepal_width", "petal_length"] y = "species" - ############################################################################### -# run without any parallelization +# Run without any parallelization. model_params = { "svm__C": [1, 2, 3], } @@ -38,11 +37,10 @@ ) ############################################################################### -# To add parallelization to the outer cross-validation we -# will add the n_jobs argument to run_cross_validation. -# We can use verbose > 0 to get more information -# about the parallelization done. -# Here, I will set the parallel jobs to 2. +# To add parallelization to the outer cross-validation we will add the ``n_jobs`` +# argument to ``run_cross_validation``. We can use ``verbose > 0`` to get more +# information about the parallelization done. Here, we'll set the parallel jobs +# to 2. scores = run_cross_validation( X=X, y=y, @@ -54,13 +52,12 @@ verbose=3, ) - ############################################################################### # We can also parallelize over the hyperparameter search/inner cv. -# This will work by using the n_jobs argument of the searcher itself, e.g. -# by default `sklearn.model_selection.GridSearchCV`. -# To adjust the parameters of the search we have to use the search_params -# inside of the model_params like this: +# This will work by using the ``n_jobs`` argument of the searcher itself, e.g., +# by default :class:`sklearn.model_selection.GridSearchCV`. +# To adjust the parameters of the search we have to use the ``search_params`` +# inside of the ``model_params`` like this: model_params = dict( svm__C=[1, 2, 3], ) @@ -76,9 +73,8 @@ search_params=search_params, ) - ############################################################################### -# Depending on your resources you can use n_jobs for outer cv, inner cv or -# even as a model_parameter for some models like `rf`. -# Additionally, you can also use the scikit-learn's `parallel_backend` for +# Depending on your resources you can use ``n_jobs`` for outer cv, inner cv or +# even as a ``model_parameter`` for some models like ``rf``. +# Additionally, you can also use the ``scikit-learn``'s ``parallel_backend`` for # parallelization. diff --git a/examples/XX_disabled/dis_run_target_confound_removal.py b/examples/XX_disabled/dis_run_target_confound_removal.py index 84f36c8c2..269974412 100644 --- a/examples/XX_disabled/dis_run_target_confound_removal.py +++ b/examples/XX_disabled/dis_run_target_confound_removal.py @@ -2,33 +2,27 @@ Confound Removal (model comparison) =================================== -This example uses the 'iris' dataset, performs simple binary classification +This example uses the ``iris`` dataset, performs simple binary classification with and without confound removal using a Random Forest classifier. """ - # Authors: Shammi More # Federico Raimondo # Leonard Sasse # License: AGPL -import matplotlib.pyplot as plt -import pandas as pd -import seaborn as sns from seaborn import load_dataset from julearn import run_cross_validation -from julearn.model_selection import StratifiedBootstrap from julearn.pipeline import PipelineCreator, TargetPipelineCreator from julearn.utils import configure_logging - ############################################################################### -# Set the logging level to info to see extra information +# Set the logging level to info to see extra information. configure_logging(level="INFO") ############################################################################### -# Load the iris data from seaborn +# Load the iris data from seaborn. df_iris = load_dataset("iris") ############################################################################### @@ -36,18 +30,15 @@ # classification. df_iris = df_iris[df_iris["species"].isin(["versicolor", "virginica"])] - ############################################################################### # As features, we will use the sepal length, width and petal length and use # petal width as confound. - X = ["sepal_length", "sepal_width"] y = "petal_length" confounds = ["petal_width"] - -# In order to tell 'run_cross_validation' which columns are confounds, -# and which columns are features, we have to define the X_types: +# In order to tell ``run_cross_validation`` which columns are confounds, +# and which columns are features, we have to define the ``X_types``: X_types = {"features": X, "confound": confounds} target_creator = TargetPipelineCreator() @@ -71,4 +62,4 @@ pos_labels=["virginica"], ) -print(scores_cr) \ No newline at end of file +scores_cr