Preprocessor is a preprocessing library for tweet data written in Python.
When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.
Currently supports cleaning, tokenizing and parsing:
- URLs
- Hashtags
- Mentions
- Reserved words (RT, FAV)
- Emojis
- Smileys
Supports Python 2.7 and 3.3+
>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'
>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'
>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58
>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'
Preprocessor will go through all of the options by default unless you specify some options.
Option Name | Option Short Code |
---|---|
URL | p.OPT.URL |
Mention | p.OPT.MENTION |
Hashtag | p.OPT.HASHTAG |
Reserved Words | p.OPT.RESERVED |
Emoji | p.OPT.EMOJI |
Smiley | p.OPT.SMILEY |
Number | p.OPT.NUMBER |
using pip:
$ pip install tweet-preprocessor