Skip to content

Latest commit

 

History

History
46 lines (37 loc) · 2.08 KB

README.md

File metadata and controls

46 lines (37 loc) · 2.08 KB

Text cleaning Pipeline

Build package Check styleRun Tests


Description

This code contains a pipeline for pre-processing text data for sentiment analysis. It includes steps for removing stop words, HTML tags, changing letter size, and removing punctuation. Future code will include text-transformations like word-embedding and word-vectorization.

To install this package simply run:

pip install pippi-lang

Example

Elegant data pipelines are a key component of any data science project. They allow you to automate the process of cleaning, transforming, and analyzing data. This code is a simple example of how to create a pipeline for text data using cutom transformers and the sklearn Pipeline class.

from pippi import (
    TransformLettersSize,
    RemoveStopWords,
    Lemmatize,
    RemovePunctuation,
    RemoveHTMLTags,
)
from sklearn.pipeline import Pipeline
import pandas as pd

    pipeline = Pipeline(
        steps=[
            ("remove_stop_words", RemoveStopWords(columns=["review","sentiment"])),
            ("remove_html_tags", RemoveHTMLTags(columns=df.columns.to_list())),
            ("uppercase_letters", TransformLettersSize(columns=["sentiment"], case_transform="upper")),
            ("remove_punctuation", RemovePunctuation(columns=["review"])),
        ]
    )
    output = pipeline.fit_transform(df)
    df = pd.DataFrame(output, columns=["review", "sentiment"])

Pipeline Visualization:

[RemoveStopWords] -> [RemoveHTMLTags] -> [TransformLettersSize] ->   [RemovePunctuation]