Simple Pipeline Example

The Dataset

Info provided when I downloaded it was:

Thunder Basin Antelope Study

The data (X1, X2, X3, X4) are for each year.

X1 = spring fawn count/100
X2 = size of adult antelope population/100
X3 = annual precipitation (inches)
X4 = winter severity index (1=mild, 5=severe)

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.base import BaseEstimator
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression

antelope_df = pd.read_csv("antelope.csv")

antelope_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	spring_fawn_count	adult_antelope_population	annual_precipitation	winter_severity_index
0	2.9	9.2	13.2	2.0
1	2.4	8.7	11.5	3.0
2	2.0	7.2	10.8	4.0
3	2.3	8.5	12.3	2.0
4	3.2	9.6	12.6	3.0
5	1.9	6.8	10.6	5.0
6	3.4	9.7	14.1	1.0
7	2.1	7.9	11.2	3.0

X = antelope_df.drop("spring_fawn_count", axis=1)
y = antelope_df["spring_fawn_count"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=3)

Code without a Pipeline

For the sake of example, let's say we want to replace the annual_precipitation column with a binary column low_precipitation, which indicates whether the annual precipitation was below 12

class PrecipitationTransformer(BaseEstimator):
    """Replaces the annual_precipitation column with a binary low_precipitation column
    
    Note: this class will be used inside a scikit-learn Pipeline
    
    Attributes:
        verbose: if True, prints out when fitting or transforming is happening
        
    Methods:
        _is_low(): returns 1 if record has precipitation below 12; 0 if else
        
        fit(): fit all the transformers one after the other 
               then fit the transformed data using the final estimator
               
        transform(): apply transformers, and transform with the final estimator
    """
    
    def __init__(self, verbose=False):
        self.verbose = verbose
    
    def fit(self, X, y=None):
        if self.verbose:
            print("fitting (PrecipitationTransformer)")
        return self
    
    
    def _is_low(self, annual_precipitation):
        """Flag if precipitation is less than 12"""
        if annual_precipitation < 12:
            return 1
        else:
            return 0
    
    
    def transform(self, X, y=None):
        """Copies X and modifies it before returning X_new"""
        if self.verbose:
            print("transforming (PrecipitationTransformer)")
        X_new = X.copy()
        X_new["low_precipitation"] = X_new["annual_precipitation"].apply(self._is_low)
        
        return X_new

We could use this custom transformer by itself:

precip_transformer = PrecipitationTransformer()
precip_transformer.fit(X_train)
X_train_precip_transformed = precip_transformer.transform(X_train)
X_train_precip_transformed

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	adult_antelope_population	annual_precipitation	winter_severity_index	low_precipitation
7	7.9	11.2	3.0	1
2	7.2	10.8	4.0	1
4	9.6	12.6	3.0	0
3	8.5	12.3	2.0	0
6	9.7	14.1	1.0	0

We also could use a OneHotEncoder without a pipeline:

(winter_severity_index appears numeric but the data dictionary indicates that it's categorical)

ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
ohe.fit(X_train_precip_transformed[["winter_severity_index"]])
winter_severity_encoded = pd.DataFrame(ohe.transform(X_train_precip_transformed[["winter_severity_index"]]), index=X_train_precip_transformed.index)
X_train_winter_transformed = pd.concat([winter_severity_encoded, X_train_precip_transformed], axis=1)
X_train_winter_transformed.drop("winter_severity_index", axis=1, inplace=True)
X_train_winter_transformed

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	adult_antelope_population	annual_precipitation	low_precipitation
7	0.0	0.0	1.0	0.0	7.9	11.2	1
2	0.0	0.0	0.0	1.0	7.2	10.8	1
4	0.0	0.0	1.0	0.0	9.6	12.6	0
3	0.0	1.0	0.0	0.0	8.5	12.3	0
6	1.0	0.0	0.0	0.0	9.7	14.1	0

Then we could fit a model on the training set and evaluate it on the test set:

# instantiate model
model = LinearRegression()

# fit on training data
model.fit(X_train_winter_transformed, y_train)

# transform test data
X_test_precip_transformed = precip_transformer.transform(X_test)
test_winter_severity_encoded = pd.DataFrame(
    ohe.transform(X_test_precip_transformed[["winter_severity_index"]]), index=X_test_precip_transformed.index)
X_test_winter_transformed = pd.concat([test_winter_severity_encoded, X_test_precip_transformed], axis=1)
X_test_winter_transformed.drop("winter_severity_index", axis=1, inplace=True)

# evaluate on test data
model.score(X_test_winter_transformed, y_test)

0.4748448011930302

Not a very good score! But this is basically fake data anyway

Let's show that same logic with a pipeline instead

Code with a Pipeline

Let's add the steps one at a time

First, just the custom transformer. Let's use verbose=True so we can see when it is fitting and transforming:

pipe1 = Pipeline(steps=[
    ("transform_precip", PrecipitationTransformer(verbose=True))
])

pipe1.fit(X_train, y_train)

fitting (PrecipitationTransformer)





Pipeline(memory=None,
         steps=[('transform_precip', PrecipitationTransformer(verbose=True))],
         verbose=False)

pipe1.transform(X_train)

transforming (PrecipitationTransformer)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	adult_antelope_population	annual_precipitation	winter_severity_index	low_precipitation
7	7.9	11.2	3.0	1
2	7.2	10.8	4.0	1
4	9.6	12.6	3.0	0
3	8.5	12.3	2.0	0
6	9.7	14.1	1.0	0

Now add the OneHotEncoder. We have to wrap it inside a ColumnTransformer because it only applies to certain columns (we don't want to one-hot encode the entire dataframe).

pipe2 = Pipeline(steps=[
    ("transform_precip", PrecipitationTransformer(verbose=True)),
    ("encode_winter", ColumnTransformer(transformers=[
        ("ohe", OneHotEncoder(sparse=False, handle_unknown="ignore"), ["winter_severity_index"])], remainder="passthrough"))
])

pipe2.fit(X_train, y_train)

fitting (PrecipitationTransformer)
transforming (PrecipitationTransformer)





Pipeline(memory=None,
         steps=[('transform_precip', PrecipitationTransformer(verbose=True)),
                ('encode_winter',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('ohe',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='ignore',
                                                                sparse=False),
                                                  ['winter_severity_index'])],
                                   verbose=False))],
         verbose=False)

Note that it actually calls transform on the PrecipitationTransformer this time, in case the next step (OHE) is dependent on that, even though it didn't call transform on the OHE yet

pipe2.transform(X_train)

transforming (PrecipitationTransformer)





array([[ 0.        ,  0.        ,  1.        ,  0.        ,  7.9000001 ,
        11.19999981,  1.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ,  7.19999981,
        10.80000019,  1.        ],
       [ 0.        ,  0.        ,  1.        ,  0.        ,  9.6       ,
        12.60000038,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ,  8.5       ,
        12.30000019,  0.        ],
       [ 1.        ,  0.        ,  0.        ,  0.        ,  9.69999981,
        14.10000038,  0.        ]])

We have lost the column labels at this point, and it decided to put things a different order, but these are the same 7 columns we had at this point without the pipeline

We could stop right here and use the pipeline for preprocessing, but leave the model out of the pipeline:

model = LinearRegression()
model.fit(pipe2.transform(X_train), y_train)
model.score(pipe2.transform(X_test), y_test)

transforming (PrecipitationTransformer)
transforming (PrecipitationTransformer)





0.4748448011930302

Or we could go one step further and add the model to the pipeline:

pipe3 = Pipeline(steps=[
    ("transform_precip", PrecipitationTransformer(verbose=True)),
    ("encode_winter", ColumnTransformer(transformers=[
        ("ohe", OneHotEncoder(sparse=False, handle_unknown="ignore"), ["winter_severity_index"])], remainder="passthrough")),
    ("linreg_model", LinearRegression())
])

pipe3.fit(X_train, y_train)

fitting (PrecipitationTransformer)
transforming (PrecipitationTransformer)





Pipeline(memory=None,
         steps=[('transform_precip', PrecipitationTransformer(verbose=True)),
                ('encode_winter',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('ohe',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='ignore',
                                                                sparse=False),
                                                  ['winter_severity_index'])],
                                   verbose=False)),
                ('linreg_model',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

pipe3.score(X_test, y_test)

transforming (PrecipitationTransformer)





0.4748448011930302

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
antelope.csv		antelope.csv
index.ipynb		index.ipynb
ml_example_full_pipeline.py		ml_example_full_pipeline.py
ml_example_ohe_pipeline.py		ml_example_ohe_pipeline.py
ml_example_preprocessing_pipeline.py		ml_example_preprocessing_pipeline.py
ml_example_without_pipelines.py		ml_example_without_pipelines.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Pipeline Example

The Dataset

Code without a Pipeline

Code with a Pipeline

About

Releases

Packages

Languages

hoffm386/simple-sklearn-pipeline-example

Folders and files

Latest commit

History

Repository files navigation

Simple Pipeline Example

The Dataset

Code without a Pipeline

Code with a Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages