Skip to content

Elegant πŸ“‘ text preprocessing pipeline 🚰 available as pip package 🐍 based on scikit-learn pipeline. Combines Transformer and Column Transformer into a single object.

Notifications You must be signed in to change notification settings

szymonrucinski/pippi-lang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Text cleaning Pipeline

Build package Check styleRun Tests


Description

This code contains a pipeline for pre-processing text data for sentiment analysis. It includes steps for removing stop words, HTML tags, changing letter size, and removing punctuation. Future code will include text-transformations like word-embedding and word-vectorization.

To install this package simply run:

pip install pippi-lang

Example

Elegant data pipelines are a key component of any data science project. They allow you to automate the process of cleaning, transforming, and analyzing data. This code is a simple example of how to create a pipeline for text data using cutom transformers and the sklearn Pipeline class.

from pippi import (
    TransformLettersSize,
    RemoveStopWords,
    Lemmatize,
    RemovePunctuation,
    RemoveHTMLTags,
)
from sklearn.pipeline import Pipeline
import pandas as pd

    pipeline = Pipeline(
        steps=[
            ("remove_stop_words", RemoveStopWords(columns=["review","sentiment"])),
            ("remove_html_tags", RemoveHTMLTags(columns=df.columns.to_list())),
            ("uppercase_letters", TransformLettersSize(columns=["sentiment"], case_transform="upper")),
            ("remove_punctuation", RemovePunctuation(columns=["review"])),
        ]
    )
    output = pipeline.fit_transform(df)
    df = pd.DataFrame(output, columns=["review", "sentiment"])

Pipeline Visualization:

[RemoveStopWords] -> [RemoveHTMLTags] -> [TransformLettersSize] ->   [RemovePunctuation]

About

Elegant πŸ“‘ text preprocessing pipeline 🚰 available as pip package 🐍 based on scikit-learn pipeline. Combines Transformer and Column Transformer into a single object.

Topics

Resources

Stars

Watchers

Forks