Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch vs. Streaming Classification - "Using the Model" #11

Open
jduprey opened this issue Apr 23, 2012 · 1 comment
Open

Batch vs. Streaming Classification - "Using the Model" #11

jduprey opened this issue Apr 23, 2012 · 1 comment

Comments

@jduprey
Copy link

jduprey commented Apr 23, 2012

Hello,

I wasn't sure which was the best forum to post this issue/question to - the yahoo groups or hear. It seems issues have more activity than in the groups. (I've cross posted: http://tech.groups.yahoo.com/group/y_lda/message/15)

I'm a total newbie to LDA, so please forgive me if I don't quite formulate this
question concisely.

From the single machine instructions for "Using the Model"
(/Yahoo_LDA/docs/html/single__machine__usage.html#using_model) it indicates that
you can run in either batch OR streaming mode.

In batch mode, the output are several files: lda.docToTop.txt lda.topToWor.txt
lda.worToTop.txt

lda.docToTop.txt is what I like - document - topic assignments.
e.g.
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (65,0.138889)
(54,0.111111) (9,0.0833333) (21,0.0833333) (27,0.0833333) (87,0.0833333)
(29,0.0555556) (52,0.0555556) (56,0.0555556) (72,0.0555556)

However, in streaming mode, it seems to be returning to me document word to
topic assignments similar to batch mode's lda.worToTop.txt.
e.g.
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,87)
(past,87) (months,72) (noticed,21) (guy,52) (surf,27) (magazine,87)
(published,10) (finally,21) (run,21) (copyright,54) (surfboards,27) (rights,54)
(reserved,54) (june,72) (launches,73) (improved,9) (site,54) (order,73)
(custom,56) (surfboards,27) (online,52) (improvements,9) (top,9) (selling,6)
(models,29) (middot,65) (rocket,44) (fish,56) (middot,65) (speed,65) (egg,95)
(middot,65) (classic,29) (middot,65) (squash,55)

Can I make streaming mode return doc - topic assignments?

If not, can I compute the doc-topic assignments easily from the doc word - topic
assignment output?

I would like to call the streaming mode from a Java process.

Please help. :)

Thanks!
-John

@jduprey
Copy link
Author

jduprey commented Apr 24, 2012

I found the logic in the batch mode that reports doc-topic:
void Unigram_Model_Training_Builder::create_output()

Basically doc topic assignments are computed from word-topic assignments using a score ratio of the total count of each topic in topic-word divided by total number of words:
topicCount / totalNumWordsInDoc

The logic responsible for returning results in the stream mode is
void Unigram_Model_Streamer::write(void* token)

I added the logic from create_output() to the streamer::write() method and now it returns [doc-topic,score] [doc-topic,score] ... || (word,topic) (word,topic) ...

e.g.
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports www.sauritchsurfboards.com/ recreation/sports/aquatic_sports [3,0.0555556] [9,0.0555556] [12,0.0555556] [40,0.0555556] [78,0.0555556] [33,0.0555556] [58,0.0277778] [60,0.0277778] [65,0.0277778] [67,0.0277778] || (watch,7) (past,49) (months,73) (noticed,58) (guy,30) (surf,72) (magazine,44) (published,78) (finally,23) (run,9) (copyright,40) (surfboards,65) (rights,92) (reserved,42) (june,87) (launches,3) (improved,27) (site,29) (order,40) (custom,12) (surfboards,3) (online,69) (improvements,9) (top,57) (selling,60) (models,33) (middot,99) (rocket,78) (fish,16) (middot,35) (speed,97) (egg,26) (middot,12) (classic,67) (middot,10) (squash,33)

Does anyone see an erro in this logic?

If anyone is interested, I can post the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant