Skip to content

A Python library for removing uninformative variables from datasets

License

Notifications You must be signed in to change notification settings

floscha/featurefilter

Repository files navigation

Featurefilter

Build Status Coverage Status Codacy Badge PyPI Version License: MIT

Featurefilter is a Python library for removing uninformative variables from datasets.

Features

  • 100% test coverage
  • Pandas backend
  • Support for scikit-learn pipelines
  • Support for scikit-learn selectors
  • PySpark backend (planned for version 0.2)

Usage Examples

All examples can also be found in the example notebook.

Remove columns with too many NA values

import numpy as np
import pandas as pd

from featurefilter import NaFilter

df = pd.DataFrame({'A': [0, np.nan, np.nan],
                   'B': [0, 0, np.nan]})

na_filter = NaFilter(max_na_ratio=0.5)
na_filter.columns_to_drop = ['A']
na_filter.fit_transform(df)

Remove columns with too low or high variance

import pandas as pd

from featurefilter import VarianceFilter

df = pd.DataFrame({'A': [0., 1.], 'B': [0., 0.]})

variance_filter = VarianceFilter()
variance_filter.fit_transform(df)

Remove columns with too high correlation to the target variables

import pandas as pd

from featurefilter import TargetCorrelationFilter

df = pd.DataFrame({'A': [0, 0], 'B': [0, 1], 'Y': [0, 1]})

target_correlation_filter = TargetCorrelationFilter(target_column='Y')
target_correlation_filter.fit_transform(df)

Remove columns using generalized linear models (GLMs)

import pandas as pd

from featurefilter import GLMFilter

df = pd.DataFrame({'A': [0, 0, 1, 1],
                   'B': [0, 1, 0, 1],
                   'Y': [0, 0, 1, 1]})

glm_filter = GLMFilter(target_column='Y', top_features=1)
glm_filter.fit_transform(df)

Remove columns using tree-based models

import pandas as pd

from featurefilter import TreeBasedFilter

df = pd.DataFrame({'A': [0, 0, 1, 1],
                   'B': [0, 1, 0, 1],
                   'Y': ['a', 'a', 'b', 'b']})

tree_based_filter = TreeBasedFilter(target_column='Y',
                                    categorical_target=True,
                                    top_features=1)
tree_based_filter.fit_transform(df)

Remove columns using multiple filters combined with scikit-learn's Pipeline API

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline

from featurefilter import NaFilter, VarianceFilter

df = pd.DataFrame({'A': [0, np.nan, np.nan],
                   'B': [0, 0, 0],
                   'C': [0, np.nan, 1]})

pipeline = Pipeline([
    ('na_filter', NaFilter(max_na_ratio=0.5)),
    ('variance_filter', VarianceFilter())
])

pipeline.fit_transform(df)

Remove columns using existing selectors provided by scikit-learn

import pandas as pd
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression

from featurefilter import SklearnWrapper

df = pd.DataFrame({'A': [0, 0, 1, 1],
                   'B': [0, 1, 0, 1],
                   'Y': [0, 0, 1, 1]})

model = RFECV(LinearRegression(),
              min_features_to_select=1,
              cv=3)
selector = SklearnWrapper(model, target_column='Y')
selector.fit_transform(df)

About

A Python library for removing uninformative variables from datasets

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published