-
Notifications
You must be signed in to change notification settings - Fork 53
/
Tasks to do
33 lines (29 loc) · 1.55 KB
/
Tasks to do
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
TODO:
- Phrases tagging and replacing the words – ex. Well done becomes well_done. PPDB is a good source for the same
- Named Entity Recognition
- NNP Phrases tagging , ex. International Institute of Information Technology
- Tagging item details before/after each streaming sentence, to maintain context. Like for Item X, the review may be ‘I loved it.’ This converts to <Item X> I loved it </Item X>
- Filters for sentiment, weighting outcomes by sentiment.
- Gender Detection
- Associate the headers description with pricing, reviews and embeddings
A. PRE-PROCESSING:
1. No reviews less than 10 words – find how much is getting cleaned up to decide the threshold
2. Create price ranges – 10, 100, 250, 500, 1000, 2000, 5000, 99999
3. Tag <PRICE> <CATEGORY> <PRODUCT ID> Title + ‘.’ + Review <PRODUCT ID> <CATEGORY> <PRICE>
B. TOKENIZER:
1. Phrase & Idioms Tagger - Scrape & load to Redis : bottom line - > bottom_line
2. Proper Noun Tagger: Donald Trump - > Donald_Trump
C. NAMED ENTITY TAGGER:
1. Pre-process the Wiki Titles to load into Redis from enwiki-latest-all-titles.gz (Download from https://dumps.wikimedia.org/enwiki/latest/):
- remove "_" and keep " "
- remove words between ()
- remove gibberish
- remove words with exclamation marks
- split on comma and make into different words
- remove words starting with small letters
- remove duplicates
2. Create NNP Parser to identify nouns
D. RECORD METADATA
1. Number of reviews processed
2. Exceptions caught
3. Total time taken