diff --git a/docs_sources/index.Rmd b/docs_sources/index.Rmd index 9b79b88e..18335dda 100644 --- a/docs_sources/index.Rmd +++ b/docs_sources/index.Rmd @@ -12,7 +12,7 @@ These datasets cover a broad range of applications including binary/multi-class In the interactive [plotly](https://plotly.com/) chart below, each dot represents a dataset colored based on its associated task (classification vs. regression). In log scale, the *x* and *y* axis shows the number of observations and features respectively. Please click on the legend to hide/show the groups of datasets. -Click on each dot to access the dataset's [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report. +Click on each dot to access the dataset's [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report. *Note*: If a dataset has more than 20 features, we randomly chose 20 to be displayed in its profiling report. Therefore, please disregard the `Number of variables` in the corresponding report and, instead, use the correct `n_features` in the chart and table below. @@ -84,7 +84,7 @@ ply Browse, sort, filter and search the complete table of summary statistics below. -* Click on the dataset's name to access its [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report. +* Click on the dataset's name to access its [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report. * Click on the GitHub Octocat to access its metadata. diff --git a/paper/paper.md b/paper/paper.md index e17f4c63..b35599a0 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -122,10 +122,10 @@ API reference guides that detail all user-facing functions and variables in PMLB ## Pandas profiling reports -For each dataset, we use [`pandas-profiling`](https://pandas-profiling.github.io/pandas-profiling/) to generate summary statistic reports. -In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `pandas-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples. +For each dataset, we use [`ydata-profiling`](https://docs.profiling.ydata.ai/latest/) to generate summary statistic reports. +In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `ydata-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples. Browsing a report allows users and contributors to easily assess dataset quality and make any necessary changes. -For example, if a feature is flagged by `pandas-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset. +For example, if a feature is flagged by `ydata-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset. The profiling reports can be accessed by clicking on the dataset name in the interactive data table or the data point in the interactive chart on the PMLB website. Alternatively, all reports can be viewed on the repository's [gh-pages](https://github.com/EpistasisLab/pmlb/tree/gh-pages/profile) branch, or generated manually by users on their local computing resources. diff --git a/pmlb/profiling.py b/pmlb/profiling.py index 9d80a7c1..d80bfe16 100644 --- a/pmlb/profiling.py +++ b/pmlb/profiling.py @@ -3,7 +3,7 @@ import subprocess import pandas as pd -from pandas_profiling import ProfileReport +from ydata_profiling import ProfileReport from .pmlb import ( fetch_data, get_updated_datasets, last_commit_message diff --git a/setup.py b/setup.py index c3553dcb..031086af 100644 --- a/setup.py +++ b/setup.py @@ -41,7 +41,7 @@ def calculate_version(): ], extras_require={ 'dev': ['nose', 'numpy', 'scipy', 'tabulate', 'parameterized', - 'matplotlib', 'seaborn', 'pandas-profiling'], + 'matplotlib', 'seaborn', 'ydata-profiling'], }, classifiers=[ 'Intended Audience :: Developers',