Skip to content

doraemoncito/naivebayes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A simple Naive Bayes classifier written in Java

Build with:

mvn clean install

and run with

./target/naivebayes-1.0-SNAPSHOT-spring-boot.jar --newsgroups ../src/test/resources/mini_newsgroups --use-stemmer --stop-words ../src/main/resources/stop_words.txt --prune-level 3

A sample run on the supplied test data will yield results along these lines:

Shuffling samples...
Running fold 01
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 9066 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 91, P(vj) = 0.0506, n = 34175 (1536  distinct words)
Class 'rec.autos': |docsj| = 89, P(vj) = 0.0494, n = 12769 (904  distinct words)
Class 'talk.religion.misc': |docsj| = 91, P(vj) = 0.0506, n = 23916 (1325  distinct words)
Class 'comp.windows.x': |docsj| = 93, P(vj) = 0.0517, n = 23168 (1229  distinct words)
Class 'rec.sport.baseball': |docsj| = 90, P(vj) = 0.0500, n = 15102 (979  distinct words)
Class 'comp.graphics': |docsj| = 93, P(vj) = 0.0517, n = 33930 (1699  distinct words)
Class 'talk.politics.mideast': |docsj| = 90, P(vj) = 0.0500, n = 27424 (1553  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 89, P(vj) = 0.0494, n = 12494 (825  distinct words)
Class 'sci.med': |docsj| = 94, P(vj) = 0.0522, n = 30483 (1740  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 89, P(vj) = 0.0494, n = 11228 (780  distinct words)
Class 'sci.crypt': |docsj| = 92, P(vj) = 0.0511, n = 22816 (1262  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 90, P(vj) = 0.0500, n = 13269 (884  distinct words)
Class 'misc.forsale': |docsj| = 86, P(vj) = 0.0478, n = 5505 (576  distinct words)
Class 'rec.motorcycles': |docsj| = 90, P(vj) = 0.0500, n = 12555 (949  distinct words)
Class 'talk.politics.misc': |docsj| = 89, P(vj) = 0.0494, n = 30792 (1528  distinct words)
Class 'sci.electronics': |docsj| = 86, P(vj) = 0.0478, n = 24285 (1310  distinct words)
Class 'rec.sport.hockey': |docsj| = 90, P(vj) = 0.0500, n = 15775 (1080  distinct words)
Class 'sci.space': |docsj| = 94, P(vj) = 0.0522, n = 25004 (1595  distinct words)
Class 'alt.atheism': |docsj| = 92, P(vj) = 0.0511, n = 20807 (1163  distinct words)
Class 'talk.politics.guns': |docsj| = 82, P(vj) = 0.0456, n = 21029 (1315  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[08] 00  00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  00  00  00  | 01 soc.religion.christian
 00 [08] 00  00  00  00  00  00  00  00  00  00  00  00  01  01  00  00  00  01  | 02 rec.autos
 01  00 [04] 00  00  00  01  00  00  00  00  00  00  00  00  00  00  00  01  02  | 03 talk.religion.misc
 00  00  00 [02] 00  03  00  00  00  01  00  00  00  00  00  01  00  00  00  00  | 04 comp.windows.x
 00  00  00  00 [08] 00  00  00  00  00  00  00  00  01  01  00  00  00  00  00  | 05 rec.sport.baseball
 00  01  00  00  00 [06] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 06 comp.graphics
 01  00  00  00  00  00 [09] 00  00  00  00  00  00  00  00  00  00  00  00  00  | 07 talk.politics.mideast
 00  00  00  00  00  02  00 [08] 00  00  00  00  00  00  00  00  00  00  01  00  | 08 comp.sys.ibm.pc.hardware
 00  00  00  00  00  00  00  00 [05] 00  00  00  00  00  01  00  00  00  00  00  | 09 sci.med
 00  00  00  03  00  03  00  01  00 [03] 00  00  01  00  00  00  00  00  00  00  | 10 comp.os.ms-windows.misc
 00  00  00  00  00  00  00  00  00  00 [08] 00  00  00  00  00  00  00  00  00  | 11 sci.crypt
 00  00  00  00  00  01  00  00  00  00  00 [07] 01  00  00  01  00  00  00  00  | 12 comp.sys.mac.hardware
 00  00  00  00  00  00  00  02  00  00  00  02 [06] 00  02  02  00  00  00  00  | 13 misc.forsale
 00  00  00  00  00  00  00  00  00  00  00  00  00 [07] 02  00  00  00  01  00  | 14 rec.motorcycles
 01  00  02  00  00  00  02  00  00  00  00  00  00  00 [06] 00  00  00  00  00  | 15 talk.politics.misc
 00  00  00  00  00  02  00  01  01  00  00  00  00  00  00 [10] 00  00  00  00  | 16 sci.electronics
 00  00  00  00  00  00  00  00  01  00  00  00  00  00  00  00 [09] 00  00  00  | 17 rec.sport.hockey
 00  00  00  00  00  01  00  00  00  00  00  00  00  00  00  00  00 [05] 00  00  | 18 sci.space
 00  00  04  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [04] 00  | 19 alt.atheism
 00  00  01  00  00  00  00  00  00  00  01  00  00  01  07  00  00  00  00 [08] | 20 talk.politics.guns

Correctly classified instances: 131 (65.5%)

Running fold 02
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 9078 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 91, P(vj) = 0.0506, n = 32318 (1465  distinct words)
Class 'rec.autos': |docsj| = 85, P(vj) = 0.0472, n = 12651 (906  distinct words)
Class 'talk.religion.misc': |docsj| = 88, P(vj) = 0.0489, n = 22746 (1289  distinct words)
Class 'comp.windows.x': |docsj| = 89, P(vj) = 0.0494, n = 22663 (1232  distinct words)
Class 'rec.sport.baseball': |docsj| = 83, P(vj) = 0.0461, n = 12848 (884  distinct words)
Class 'comp.graphics': |docsj| = 91, P(vj) = 0.0506, n = 33192 (1662  distinct words)
Class 'talk.politics.mideast': |docsj| = 87, P(vj) = 0.0483, n = 23921 (1449  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 91, P(vj) = 0.0506, n = 12806 (821  distinct words)
Class 'sci.med': |docsj| = 92, P(vj) = 0.0511, n = 30003 (1696  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 92, P(vj) = 0.0511, n = 12863 (849  distinct words)
Class 'sci.crypt': |docsj| = 92, P(vj) = 0.0511, n = 23335 (1264  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 91, P(vj) = 0.0506, n = 14425 (925  distinct words)
Class 'misc.forsale': |docsj| = 92, P(vj) = 0.0511, n = 7013 (682  distinct words)
Class 'rec.motorcycles': |docsj| = 95, P(vj) = 0.0528, n = 13269 (986  distinct words)
Class 'talk.politics.misc': |docsj| = 89, P(vj) = 0.0494, n = 31665 (1583  distinct words)
Class 'sci.electronics': |docsj| = 91, P(vj) = 0.0506, n = 26110 (1372  distinct words)
Class 'rec.sport.hockey': |docsj| = 90, P(vj) = 0.0500, n = 16620 (1095  distinct words)
Class 'sci.space': |docsj| = 91, P(vj) = 0.0506, n = 25088 (1603  distinct words)
Class 'alt.atheism': |docsj| = 89, P(vj) = 0.0494, n = 19173 (1095  distinct words)
Class 'talk.politics.guns': |docsj| = 91, P(vj) = 0.0506, n = 22567 (1343  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[09] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 01 soc.religion.christian
 00 [10] 00  00  00  00  00  00  00  00  01  00  00  01  01  01  00  00  00  01  | 02 rec.autos
 00  00 [05] 00  00  00  00  00  00  00  00  00  00  00  02  00  00  00  02  03  | 03 talk.religion.misc
 00  00  00 [04] 00  05  00  00  00  00  00  00  00  00  01  01  00  00  00  00  | 04 comp.windows.x
 01  00  00  00 [13] 00  00  00  00  00  00  00  00  00  02  00  00  00  00  01  | 05 rec.sport.baseball
 00  00  00  00  00 [07] 00  00  01  00  01  00  00  00  00  00  00  00  00  00  | 06 comp.graphics
 00  00  00  00  00  00 [11] 00  00  00  00  00  00  00  02  00  00  00  00  00  | 07 talk.politics.mideast
 00  00  00  00  00  00  00 [06] 00  00  00  03  00  00  00  00  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 00  00  00  00  00  00  00  00 [08] 00  00  00  00  00  00  00  00  00  00  00  | 09 sci.med
 00  00  00  01  00  01  00  03  00 [01] 00  00  00  00  00  02  00  00  00  00  | 10 comp.os.ms-windows.misc
 00  00  00  00  02  00  00  00  00  00 [05] 00  00  00  00  00  00  00  00  01  | 11 sci.crypt
 00  00  00  00  00  01  00  01  00  00  00 [06] 00  00  01  00  00  00  00  00  | 12 comp.sys.mac.hardware
 00  00  00  01  00  01  00  01  00  00  00  01 [04] 00  00  00  00  00  00  00  | 13 misc.forsale
 00  00  00  00  00  00  00  00  00  01  00  00  00 [04] 00  00  00  00  00  00  | 14 rec.motorcycles
 00  00  01  00  00  00  01  00  01  00  00  00  00  00 [06] 00  00  00  00  02  | 15 talk.politics.misc
 00  00  00  00  00  01  00  01  00  00  00  01  00  01  00 [05] 00  00  00  00  | 16 sci.electronics
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  01  00 [09] 00  00  00  | 17 rec.sport.hockey
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [09] 00  00  | 18 sci.space
 03  00  04  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [04] 00  | 19 alt.atheism
 00  00  00  00  00  00  00  00  00  00  01  00  00  00  01  00  00  00  00 [07] | 20 talk.politics.guns

Correctly classified instances: 133 (66.5%)

Running fold 03
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 8875 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 91, P(vj) = 0.0506, n = 33948 (1523  distinct words)
Class 'rec.autos': |docsj| = 89, P(vj) = 0.0494, n = 12257 (868  distinct words)
Class 'talk.religion.misc': |docsj| = 91, P(vj) = 0.0506, n = 22847 (1300  distinct words)
Class 'comp.windows.x': |docsj| = 91, P(vj) = 0.0506, n = 22753 (1215  distinct words)
Class 'rec.sport.baseball': |docsj| = 91, P(vj) = 0.0506, n = 15096 (993  distinct words)
Class 'comp.graphics': |docsj| = 89, P(vj) = 0.0494, n = 33244 (1683  distinct words)
Class 'talk.politics.mideast': |docsj| = 89, P(vj) = 0.0494, n = 25464 (1465  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 92, P(vj) = 0.0511, n = 13052 (832  distinct words)
Class 'sci.med': |docsj| = 84, P(vj) = 0.0467, n = 27051 (1596  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 91, P(vj) = 0.0506, n = 12016 (820  distinct words)
Class 'sci.crypt': |docsj| = 90, P(vj) = 0.0500, n = 21914 (1224  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 91, P(vj) = 0.0506, n = 14388 (899  distinct words)
Class 'misc.forsale': |docsj| = 96, P(vj) = 0.0533, n = 7250 (706  distinct words)
Class 'rec.motorcycles': |docsj| = 89, P(vj) = 0.0494, n = 12043 (915  distinct words)
Class 'talk.politics.misc': |docsj| = 93, P(vj) = 0.0517, n = 31674 (1575  distinct words)
Class 'sci.electronics': |docsj| = 87, P(vj) = 0.0483, n = 24098 (1313  distinct words)
Class 'rec.sport.hockey': |docsj| = 93, P(vj) = 0.0517, n = 16377 (1097  distinct words)
Class 'sci.space': |docsj| = 88, P(vj) = 0.0489, n = 23223 (1487  distinct words)
Class 'alt.atheism': |docsj| = 90, P(vj) = 0.0500, n = 18143 (1079  distinct words)
Class 'talk.politics.guns': |docsj| = 85, P(vj) = 0.0472, n = 18482 (1174  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[08] 00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  00  00  | 01 soc.religion.christian
 01 [05] 00  00  00  00  00  00  00  00  01  00  00  01  02  00  00  00  00  01  | 02 rec.autos
 02  00 [05] 00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  01  00  | 03 talk.religion.misc
 00  00  00 [03] 00  03  00  00  01  00  00  00  00  00  00  02  00  00  00  00  | 04 comp.windows.x
 00  00  00  00 [07] 00  01  01  00  00  00  00  00  00  00  00  00  00  00  00  | 05 rec.sport.baseball
 00  00  00  00  00 [08] 00  00  01  01  00  00  00  00  00  00  00  00  00  01  | 06 comp.graphics
 00  00  00  00  00  00 [10] 00  00  00  00  00  00  00  00  00  00  00  00  01  | 07 talk.politics.mideast
 00  00  00  01  00  01  00 [02] 00  01  00  02  00  00  00  01  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 00  00  00  00  00  00  00  00 [15] 00  00  00  00  00  00  01  00  00  00  00  | 09 sci.med
 00  00  00  02  00  01  00  01  00 [03] 00  00  00  00  01  01  00  00  00  00  | 10 comp.os.ms-windows.misc
 00  00  00  00  00  00  00  00  00  00 [08] 00  00  00  01  00  00  00  00  01  | 11 sci.crypt
 00  00  00  00  00  02  00  00  00  00  00 [05] 00  00  00  01  00  01  00  00  | 12 comp.sys.mac.hardware
 00  00  00  00  00  01  00  00  00  00  00  00 [03] 00  00  00  00  00  00  00  | 13 misc.forsale
 00  01  00  00  00  02  00  00  00  00  00  00  00 [08] 00  00  00  00  00  00  | 14 rec.motorcycles
 00  00  01  00  00  00  00  00  00  00  00  00  00  00 [06] 00  00  00  00  00  | 15 talk.politics.misc
 00  00  00  00  00  00  00  00  00  00  01  01  00  00  00 [11] 00  00  00  00  | 16 sci.electronics
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [07] 00  00  00  | 17 rec.sport.hockey
 00  00  01  00  00  00  00  00  01  00  00  00  00  00  01  00  00 [09] 00  00  | 18 sci.space
 00  00  02  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [07] 01  | 19 alt.atheism
 00  00  02  00  00  00  00  00  00  00  01  00  00  00  02  00  00  00  00 [10] | 20 talk.politics.guns

Correctly classified instances: 140 (70.0%)

Running fold 04
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 8960 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 91, P(vj) = 0.0506, n = 33410 (1480  distinct words)
Class 'rec.autos': |docsj| = 94, P(vj) = 0.0522, n = 14177 (968  distinct words)
Class 'talk.religion.misc': |docsj| = 91, P(vj) = 0.0506, n = 23203 (1309  distinct words)
Class 'comp.windows.x': |docsj| = 92, P(vj) = 0.0511, n = 23098 (1236  distinct words)
Class 'rec.sport.baseball': |docsj| = 92, P(vj) = 0.0511, n = 16383 (1021  distinct words)
Class 'comp.graphics': |docsj| = 89, P(vj) = 0.0494, n = 32849 (1656  distinct words)
Class 'talk.politics.mideast': |docsj| = 91, P(vj) = 0.0506, n = 26213 (1510  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 94, P(vj) = 0.0522, n = 13144 (838  distinct words)
Class 'sci.med': |docsj| = 88, P(vj) = 0.0489, n = 28875 (1684  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 90, P(vj) = 0.0500, n = 12142 (818  distinct words)
Class 'sci.crypt': |docsj| = 86, P(vj) = 0.0478, n = 18121 (1112  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 92, P(vj) = 0.0511, n = 14433 (917  distinct words)
Class 'misc.forsale': |docsj| = 91, P(vj) = 0.0506, n = 6736 (662  distinct words)
Class 'rec.motorcycles': |docsj| = 85, P(vj) = 0.0472, n = 11696 (887  distinct words)
Class 'talk.politics.misc': |docsj| = 90, P(vj) = 0.0500, n = 21664 (1332  distinct words)
Class 'sci.electronics': |docsj| = 94, P(vj) = 0.0522, n = 26488 (1387  distinct words)
Class 'rec.sport.hockey': |docsj| = 88, P(vj) = 0.0489, n = 16023 (1068  distinct words)
Class 'sci.space': |docsj| = 82, P(vj) = 0.0456, n = 22045 (1488  distinct words)
Class 'alt.atheism': |docsj| = 90, P(vj) = 0.0500, n = 20682 (1154  distinct words)
Class 'talk.politics.guns': |docsj| = 90, P(vj) = 0.0500, n = 21274 (1298  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[08] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  01  | 01 soc.religion.christian
 00 [04] 00  00  00  00  00  00  00  00  00  01  00  00  00  01  00  00  00  00  | 02 rec.autos
 04  00 [01] 00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  02  01  | 03 talk.religion.misc
 00  00  00 [04] 00  02  00  00  00  00  01  00  00  00  00  00  00  01  00  00  | 04 comp.windows.x
 00  00  00  00 [08] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 05 rec.sport.baseball
 00  00  00  00  00 [09] 00  00  01  00  00  00  00  00  00  01  00  00  00  00  | 06 comp.graphics
 00  00  01  00  01  00 [07] 00  00  00  00  00  00  00  00  00  00  00  00  00  | 07 talk.politics.mideast
 00  00  00  00  00  00  00 [02] 00  00  00  03  00  00  00  01  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 00  00  00  00  00  00  00  00 [11] 00  00  00  00  00  00  00  00  00  00  01  | 09 sci.med
 01  00  00  02  00  00  00  02  00 [02] 00  01  00  00  00  01  00  00  00  01  | 10 comp.os.ms-windows.misc
 00  00  02  00  00  00  00  00  00  00 [11] 00  01  00  00  00  00  00  00  00  | 11 sci.crypt
 00  00  00  00  00  01  00  00  00  00  00 [06] 00  00  00  01  00  00  00  00  | 12 comp.sys.mac.hardware
 00  01  00  00  00  00  00  01  00  00  00  00 [04] 02  00  01  00  00  00  00  | 13 misc.forsale
 02  01  01  00  00  00  00  00  00  00  00  00  00 [10] 00  01  00  00  00  00  | 14 rec.motorcycles
 01  00  02  00  00  00  00  00  00  00  00  00  00  00 [07] 00  00  00  00  00  | 15 talk.politics.misc
 00  00  00  00  00  00  00  01  00  00  00  01  00  00  00 [04] 00  00  00  00  | 16 sci.electronics
 00  00  00  00  01  00  00  00  00  00  00  00  00  00  00  00 [11] 00  00  00  | 17 rec.sport.hockey
 01  00  01  00  00  00  00  01  00  00  00  00  00  00  02  01  00 [12] 00  00  | 18 sci.space
 03  01  02  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [04] 00  | 19 alt.atheism
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  02  00  00  00  00 [08] | 20 talk.politics.guns

Correctly classified instances: 133 (66.5%)

Running fold 05
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 8905 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 89, P(vj) = 0.0494, n = 32484 (1473  distinct words)
Class 'rec.autos': |docsj| = 92, P(vj) = 0.0511, n = 13990 (957  distinct words)
Class 'talk.religion.misc': |docsj| = 89, P(vj) = 0.0494, n = 21843 (1264  distinct words)
Class 'comp.windows.x': |docsj| = 89, P(vj) = 0.0494, n = 21633 (1183  distinct words)
Class 'rec.sport.baseball': |docsj| = 89, P(vj) = 0.0494, n = 15932 (998  distinct words)
Class 'comp.graphics': |docsj| = 93, P(vj) = 0.0517, n = 34285 (1693  distinct words)
Class 'talk.politics.mideast': |docsj| = 91, P(vj) = 0.0506, n = 27991 (1572  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 92, P(vj) = 0.0511, n = 12927 (844  distinct words)
Class 'sci.med': |docsj| = 88, P(vj) = 0.0489, n = 24416 (1526  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 89, P(vj) = 0.0494, n = 11869 (807  distinct words)
Class 'sci.crypt': |docsj| = 87, P(vj) = 0.0483, n = 19920 (1152  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 92, P(vj) = 0.0511, n = 13389 (868  distinct words)
Class 'misc.forsale': |docsj| = 83, P(vj) = 0.0461, n = 5799 (576  distinct words)
Class 'rec.motorcycles': |docsj| = 92, P(vj) = 0.0511, n = 12545 (938  distinct words)
Class 'talk.politics.misc': |docsj| = 89, P(vj) = 0.0494, n = 31320 (1566  distinct words)
Class 'sci.electronics': |docsj| = 91, P(vj) = 0.0506, n = 25566 (1355  distinct words)
Class 'rec.sport.hockey': |docsj| = 89, P(vj) = 0.0494, n = 14150 (947  distinct words)
Class 'sci.space': |docsj| = 94, P(vj) = 0.0522, n = 24070 (1559  distinct words)
Class 'alt.atheism': |docsj| = 94, P(vj) = 0.0522, n = 20497 (1155  distinct words)
Class 'talk.politics.guns': |docsj| = 88, P(vj) = 0.0489, n = 21465 (1316  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[08] 00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  01  01  | 01 soc.religion.christian
 01 [07] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 02 rec.autos
 04  00 [02] 00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  02  02  | 03 talk.religion.misc
 00  00  01 [04] 00  05  00  00  01  00  00  00  00  00  00  00  00  00  00  00  | 04 comp.windows.x
 00  00  00  00 [09] 00  00  00  00  00  00  00  00  01  01  00  00  00  00  00  | 05 rec.sport.baseball
 01  00  00  02  00 [02] 00  00  00  00  00  01  00  00  00  01  00  00  00  00  | 06 comp.graphics
 01  00  00  00  01  01 [04] 00  00  00  00  00  00  00  02  00  00  00  00  00  | 07 talk.politics.mideast
 00  00  00  00  00  02  00 [05] 00  00  00  01  00  00  00  00  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 01  00  00  00  00  00  00  00 [09] 00  00  00  00  00  00  00  00  00  00  02  | 09 sci.med
 00  00  00  01  00  04  00  02  00 [04] 00  00  00  00  00  00  00  00  00  00  | 10 comp.os.ms-windows.misc
 00  00  00  00  00  01  00  00  00  00 [08] 00  00  00  02  01  00  00  00  01  | 11 sci.crypt
 00  00  00  00  00  01  00  01  00  01  00 [03] 00  00  01  01  00  00  00  00  | 12 comp.sys.mac.hardware
 00  02  00  00  00  02  00  00  00  01  00  01 [06] 01  00  02  00  02  00  00  | 13 misc.forsale
 00  01  00  00  00  00  00  00  00  00  00  00  00 [04] 02  00  00  00  01  00  | 14 rec.motorcycles
 02  00  01  00  00  00  00  00  00  00  00  00  00  00 [06] 00  00  00  00  02  | 15 talk.politics.misc
 00  01  00  00  00  01  00  00  00  00  00  01  00  00  01 [05] 00  00  00  00  | 16 sci.electronics
 00  00  00  00  01  00  00  00  00  00  00  00  00  00  01  00 [08] 00  01  00  | 17 rec.sport.hockey
 00  00  00  00  00  01  00  00  00  00  00  00  00  00  02  00  00 [03] 00  00  | 18 sci.space
 00  00  01  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [05] 00  | 19 alt.atheism
 02  00  02  00  00  00  00  00  00  00  00  00  00  01  00  00  00  00  00 [07] | 20 talk.politics.guns

Correctly classified instances: 109 (54.5%)

Running fold 06
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 8999 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 86, P(vj) = 0.0478, n = 33336 (1516  distinct words)
Class 'rec.autos': |docsj| = 93, P(vj) = 0.0517, n = 13555 (952  distinct words)
Class 'talk.religion.misc': |docsj| = 95, P(vj) = 0.0528, n = 23303 (1304  distinct words)
Class 'comp.windows.x': |docsj| = 89, P(vj) = 0.0494, n = 22505 (1216  distinct words)
Class 'rec.sport.baseball': |docsj| = 94, P(vj) = 0.0522, n = 16217 (1009  distinct words)
Class 'comp.graphics': |docsj| = 89, P(vj) = 0.0494, n = 33892 (1698  distinct words)
Class 'talk.politics.mideast': |docsj| = 91, P(vj) = 0.0506, n = 26641 (1505  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 85, P(vj) = 0.0472, n = 10680 (728  distinct words)
Class 'sci.med': |docsj| = 90, P(vj) = 0.0500, n = 29686 (1697  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 92, P(vj) = 0.0511, n = 12376 (840  distinct words)
Class 'sci.crypt': |docsj| = 94, P(vj) = 0.0522, n = 23553 (1281  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 84, P(vj) = 0.0467, n = 12430 (823  distinct words)
Class 'misc.forsale': |docsj| = 89, P(vj) = 0.0494, n = 6726 (666  distinct words)
Class 'rec.motorcycles': |docsj| = 92, P(vj) = 0.0511, n = 12519 (947  distinct words)
Class 'talk.politics.misc': |docsj| = 90, P(vj) = 0.0500, n = 31203 (1522  distinct words)
Class 'sci.electronics': |docsj| = 87, P(vj) = 0.0483, n = 24565 (1318  distinct words)
Class 'rec.sport.hockey': |docsj| = 90, P(vj) = 0.0500, n = 16460 (1077  distinct words)
Class 'sci.space': |docsj| = 88, P(vj) = 0.0489, n = 18553 (1330  distinct words)
Class 'alt.atheism': |docsj| = 92, P(vj) = 0.0511, n = 20152 (1129  distinct words)
Class 'talk.politics.guns': |docsj| = 90, P(vj) = 0.0500, n = 22003 (1334  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[10] 00  01  00  00  00  00  00  00  00  00  00  00  00  02  00  00  00  01  00  | 01 soc.religion.christian
 00 [05] 00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  00  00  01  | 02 rec.autos
 00  00 [03] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  01  01  | 03 talk.religion.misc
 00  00  00 [07] 00  04  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 04 comp.windows.x
 00  00  01  00 [05] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 05 rec.sport.baseball
 00  00  00  01  00 [08] 00  01  00  00  00  00  00  00  00  01  00  00  00  00  | 06 comp.graphics
 01  00  00  00  00  00 [07] 00  00  00  00  00  00  00  01  00  00  00  00  00  | 07 talk.politics.mideast
 01  00  00  01  00  00  00 [09] 00  00  00  01  01  00  00  02  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 02  00  00  00  00  00  00  00 [08] 00  00  00  00  00  00  00  00  00  00  00  | 09 sci.med
 00  00  00  00  00  03  00  01  00 [03] 00  00  00  00  01  00  00  00  00  00  | 10 comp.os.ms-windows.misc
 00  00  00  00  00  00  00  00  00  00 [06] 00  00  00  00  00  00  00  00  00  | 11 sci.crypt
 00  00  00  01  00  02  00  01  00  00  00 [09] 00  00  00  03  00  00  00  00  | 12 comp.sys.mac.hardware
 00  00  00  00  01  01  00  01  00  00  00  00 [06] 00  00  00  00  01  01  00  | 13 misc.forsale
 00  00  00  00  00  01  00  00  00  00  00  00  00 [07] 00  00  00  00  00  00  | 14 rec.motorcycles
 01  00  00  00  01  00  00  00  00  00  00  00  00  00 [04] 00  00  00  00  04  | 15 talk.politics.misc
 00  01  00  00  00  02  00  03  00  00  00  00  00  00  00 [07] 00  00  00  00  | 16 sci.electronics
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [10] 00  00  00  | 17 rec.sport.hockey
 00  00  00  00  00  00  00  00  00  00  01  00  00  00  00  01  00 [08] 00  02  | 18 sci.space
 00  00  02  00  00  00  01  00  00  00  00  00  00  00  01  00  00  00 [04] 00  | 19 alt.atheism
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  04  00  00  00  00 [06] | 20 talk.politics.guns

Correctly classified instances: 132 (66.0%)

Running fold 07
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 8885 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 90, P(vj) = 0.0500, n = 34307 (1533  distinct words)
Class 'rec.autos': |docsj| = 91, P(vj) = 0.0506, n = 13763 (961  distinct words)
Class 'talk.religion.misc': |docsj| = 87, P(vj) = 0.0483, n = 22072 (1256  distinct words)
Class 'comp.windows.x': |docsj| = 89, P(vj) = 0.0494, n = 22480 (1222  distinct words)
Class 'rec.sport.baseball': |docsj| = 89, P(vj) = 0.0494, n = 14576 (971  distinct words)
Class 'comp.graphics': |docsj| = 88, P(vj) = 0.0489, n = 26250 (1424  distinct words)
Class 'talk.politics.mideast': |docsj| = 91, P(vj) = 0.0506, n = 25439 (1478  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 89, P(vj) = 0.0494, n = 12190 (804  distinct words)
Class 'sci.med': |docsj| = 88, P(vj) = 0.0489, n = 25012 (1477  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 94, P(vj) = 0.0522, n = 13017 (857  distinct words)
Class 'sci.crypt': |docsj| = 81, P(vj) = 0.0450, n = 20399 (1156  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 91, P(vj) = 0.0506, n = 14320 (910  distinct words)
Class 'misc.forsale': |docsj| = 90, P(vj) = 0.0500, n = 6589 (648  distinct words)
Class 'rec.motorcycles': |docsj| = 89, P(vj) = 0.0494, n = 12193 (951  distinct words)
Class 'talk.politics.misc': |docsj| = 91, P(vj) = 0.0506, n = 29130 (1483  distinct words)
Class 'sci.electronics': |docsj| = 94, P(vj) = 0.0522, n = 26409 (1381  distinct words)
Class 'rec.sport.hockey': |docsj| = 92, P(vj) = 0.0511, n = 17175 (1122  distinct words)
Class 'sci.space': |docsj| = 88, P(vj) = 0.0489, n = 25111 (1597  distinct words)
Class 'alt.atheism': |docsj| = 94, P(vj) = 0.0522, n = 20387 (1160  distinct words)
Class 'talk.politics.guns': |docsj| = 94, P(vj) = 0.0522, n = 23437 (1379  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[10] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 01 soc.religion.christian
 00 [05] 00  00  00  00  00  00  01  00  00  01  00  01  00  00  00  00  00  01  | 02 rec.autos
 02  00 [00] 00  00  00  00  00  00  00  00  00  00  01  01  01  00  00  07  01  | 03 talk.religion.misc
 00  00  00 [10] 00  01  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 04 comp.windows.x
 00  00  00  00 [09] 00  00  00  00  00  00  00  00  00  00  00  02  00  00  00  | 05 rec.sport.baseball
 00  00  00  00  00 [11] 00  00  00  00  00  00  00  00  00  00  00  00  01  00  | 06 comp.graphics
 01  00  00  00  00  00 [06] 00  00  00  00  00  00  00  02  00  00  00  00  00  | 07 talk.politics.mideast
 00  00  00  00  00  01  00 [07] 00  01  00  01  00  00  00  01  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 00  00  00  00  00  00  00  00 [11] 00  00  00  00  00  00  00  00  00  01  00  | 09 sci.med
 00  00  00  00  00  02  00  01  00 [01] 00  01  00  00  00  01  00  00  00  00  | 10 comp.os.ms-windows.misc
 00  00  00  00  00  01  00  00  01  00 [11] 01  00  00  02  01  00  00  00  02  | 11 sci.crypt
 00  00  00  00  00  01  00  00  00  00  01 [06] 00  00  00  01  00  00  00  00  | 12 comp.sys.mac.hardware
 00  00  00  00  00  01  01  01  00  00  00  00 [04] 00  00  02  00  01  00  00  | 13 misc.forsale
 00  00  00  00  01  00  01  00  00  00  00  00  00 [08] 01  00  00  00  00  00  | 14 rec.motorcycles
 00  01  00  00  00  00  00  00  00  00  00  00  00  00 [05] 00  00  00  00  03  | 15 talk.politics.misc
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  01 [04] 00  00  00  01  | 16 sci.electronics
 00  00  00  00  01  00  00  00  00  00  00  00  00  00  00  00 [07] 00  00  00  | 17 rec.sport.hockey
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  01  00 [11] 00  00  | 18 sci.space
 02  00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  00  00 [03] 00  | 19 alt.atheism
 00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  00  01 [04] | 20 talk.politics.guns

Correctly classified instances: 133 (66.5%)

Running fold 08
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 8882 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 91, P(vj) = 0.0506, n = 34548 (1541  distinct words)
Class 'rec.autos': |docsj| = 85, P(vj) = 0.0472, n = 12745 (903  distinct words)
Class 'talk.religion.misc': |docsj| = 93, P(vj) = 0.0517, n = 22628 (1290  distinct words)
Class 'comp.windows.x': |docsj| = 87, P(vj) = 0.0483, n = 22503 (1227  distinct words)
Class 'rec.sport.baseball': |docsj| = 92, P(vj) = 0.0511, n = 15865 (1000  distinct words)
Class 'comp.graphics': |docsj| = 89, P(vj) = 0.0494, n = 27774 (1449  distinct words)
Class 'talk.politics.mideast': |docsj| = 88, P(vj) = 0.0489, n = 26240 (1525  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 91, P(vj) = 0.0506, n = 12749 (825  distinct words)
Class 'sci.med': |docsj| = 95, P(vj) = 0.0528, n = 30225 (1712  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 84, P(vj) = 0.0467, n = 11803 (806  distinct words)
Class 'sci.crypt': |docsj| = 89, P(vj) = 0.0494, n = 22412 (1249  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 94, P(vj) = 0.0522, n = 14512 (919  distinct words)
Class 'misc.forsale': |docsj| = 92, P(vj) = 0.0511, n = 7015 (674  distinct words)
Class 'rec.motorcycles': |docsj| = 91, P(vj) = 0.0506, n = 12552 (951  distinct words)
Class 'talk.politics.misc': |docsj| = 88, P(vj) = 0.0489, n = 28129 (1442  distinct words)
Class 'sci.electronics': |docsj| = 84, P(vj) = 0.0467, n = 12976 (885  distinct words)
Class 'rec.sport.hockey': |docsj| = 93, P(vj) = 0.0517, n = 16511 (1095  distinct words)
Class 'sci.space': |docsj| = 94, P(vj) = 0.0522, n = 24567 (1592  distinct words)
Class 'alt.atheism': |docsj| = 87, P(vj) = 0.0483, n = 20198 (1115  distinct words)
Class 'talk.politics.guns': |docsj| = 93, P(vj) = 0.0517, n = 22721 (1371  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[08] 00  00  00  00  00  00  00  01  00  00  00  00  00  00  00  00  00  00  00  | 01 soc.religion.christian
 00 [07] 00  00  01  00  00  00  00  00  00  00  00  00  03  02  00  00  00  02  | 02 rec.autos
 01  00 [02] 00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  02  01  | 03 talk.religion.misc
 00  00  00 [06] 00  06  00  00  00  00  00  00  00  00  00  00  00  00  01  00  | 04 comp.windows.x
 00  00  00  00 [06] 00  00  00  00  00  00  00  00  00  00  00  00  00  02  00  | 05 rec.sport.baseball
 00  00  00  00  00 [06] 00  00  00  03  00  00  00  00  00  01  00  01  00  00  | 06 comp.graphics
 00  00  00  00  00  00 [12] 00  00  00  00  00  00  00  00  00  00  00  00  00  | 07 talk.politics.mideast
 00  00  00  00  00  02  00 [04] 00  01  00  02  00  00  00  00  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 00  00  01  00  00  00  00  00 [04] 00  00  00  00  00  00  00  00  00  00  00  | 09 sci.med
 00  00  01  00  00  05  00  03  00 [04] 00  02  00  00  00  01  00  00  00  00  | 10 comp.os.ms-windows.misc
 01  00  00  00  00  00  01  00  00  00 [09] 00  00  00  00  00  00  00  00  00  | 11 sci.crypt
 00  00  00  00  00  00  00  00  00  00  00 [06] 00  00  00  00  00  00  00  00  | 12 comp.sys.mac.hardware
 00  00  00  00  00  00  00  02  01  00  01  00 [03] 00  00  01  00  00  00  00  | 13 misc.forsale
 00  01  00  00  00  00  00  00  00  00  00  00  01 [07] 00  00  00  00  00  00  | 14 rec.motorcycles
 00  00  00  00  00  00  01  00  00  00  00  00  00  00 [09] 00  00  00  01  01  | 15 talk.politics.misc
 01  00  00  00  00  00  00  00  00  00  00  00  01  00  00 [11] 00  00  00  03  | 16 sci.electronics
 00  00  00  00  01  00  00  00  00  01  00  00  00  00  00  00 [05] 00  00  00  | 17 rec.sport.hockey
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  00 [05] 00  00  | 18 sci.space
 02  00  06  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [05] 00  | 19 alt.atheism
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  02  00  00  00  00 [05] | 20 talk.politics.guns

Correctly classified instances: 124 (62.0%)

Running fold 09
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 8993 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 91, P(vj) = 0.0506, n = 33583 (1524  distinct words)
Class 'rec.autos': |docsj| = 91, P(vj) = 0.0506, n = 13900 (972  distinct words)
Class 'talk.religion.misc': |docsj| = 86, P(vj) = 0.0478, n = 21395 (1239  distinct words)
Class 'comp.windows.x': |docsj| = 91, P(vj) = 0.0506, n = 22921 (1216  distinct words)
Class 'rec.sport.baseball': |docsj| = 91, P(vj) = 0.0506, n = 15877 (986  distinct words)
Class 'comp.graphics': |docsj| = 92, P(vj) = 0.0511, n = 24467 (1461  distinct words)
Class 'talk.politics.mideast': |docsj| = 91, P(vj) = 0.0506, n = 25814 (1493  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 87, P(vj) = 0.0483, n = 11548 (773  distinct words)
Class 'sci.med': |docsj| = 92, P(vj) = 0.0511, n = 27555 (1648  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 89, P(vj) = 0.0494, n = 12758 (863  distinct words)
Class 'sci.crypt': |docsj| = 96, P(vj) = 0.0533, n = 23049 (1256  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 89, P(vj) = 0.0494, n = 14471 (916  distinct words)
Class 'misc.forsale': |docsj| = 89, P(vj) = 0.0494, n = 6718 (657  distinct words)
Class 'rec.motorcycles': |docsj| = 90, P(vj) = 0.0500, n = 12367 (928  distinct words)
Class 'talk.politics.misc': |docsj| = 90, P(vj) = 0.0500, n = 31178 (1560  distinct words)
Class 'sci.electronics': |docsj| = 90, P(vj) = 0.0500, n = 25112 (1342  distinct words)
Class 'rec.sport.hockey': |docsj| = 86, P(vj) = 0.0478, n = 16241 (1087  distinct words)
Class 'sci.space': |docsj| = 88, P(vj) = 0.0489, n = 24151 (1571  distinct words)
Class 'alt.atheism': |docsj| = 88, P(vj) = 0.0489, n = 19983 (1143  distinct words)
Class 'talk.politics.guns': |docsj| = 93, P(vj) = 0.0517, n = 21434 (1345  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[09] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 01 soc.religion.christian
 00 [06] 00  00  00  00  00  00  00  00  00  00  00  01  01  01  00  00  00  00  | 02 rec.autos
 03  00 [05] 00  00  00  00  00  00  00  00  00  00  00  02  00  00  00  02  02  | 03 talk.religion.misc
 01  00  00 [03] 00  03  00  00  00  00  00  00  00  00  01  01  00  00  00  00  | 04 comp.windows.x
 00  00  00  00 [06] 00  00  00  00  00  00  00  00  00  00  00  03  00  00  00  | 05 rec.sport.baseball
 00  00  00  01  00 [05] 00  01  00  01  00  00  00  00  00  00  00  00  00  00  | 06 comp.graphics
 02  00  00  00  00  00 [06] 00  00  00  00  00  00  00  01  00  00  00  00  00  | 07 talk.politics.mideast
 00  00  00  01  00  01  00 [07] 00  01  00  02  00  00  00  01  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 00  00  00  00  00  00  00  00 [07] 00  00  00  00  00  01  00  00  00  00  00  | 09 sci.med
 01  00  01  00  00  01  00  01  00 [06] 01  00  00  00  00  00  00  00  00  00  | 10 comp.os.ms-windows.misc
 00  00  00  00  00  00  00  00  00  00 [03] 00  00  00  01  00  00  00  00  00  | 11 sci.crypt
 00  00  00  00  00  02  00  00  00  00  00 [08] 00  00  00  01  00  00  00  00  | 12 comp.sys.mac.hardware
 00  01  00  02  00  03  00  00  00  00  00  01 [03] 00  00  01  00  00  00  00  | 13 misc.forsale
 00  00  00  00  00  01  00  00  00  00  00  00  00 [06] 02  01  00  00  00  00  | 14 rec.motorcycles
 02  00  01  00  00  00  00  00  00  00  00  01  00  00 [05] 00  00  00  00  01  | 15 talk.politics.misc
 00  01  00  00  00  01  00  00  01  00  00  01  00  00  00 [06] 00  00  00  00  | 16 sci.electronics
 00  00  01  00  02  00  00  00  00  00  00  00  00  00  00  00 [10] 00  01  00  | 17 rec.sport.hockey
 00  00  00  00  00  03  00  00  00  01  00  00  00  01  02  00  00 [05] 00  00  | 18 sci.space
 02  00  03  00  00  00  00  00  01  00  00  00  00  00  00  00  00  00 [06] 00  | 19 alt.atheism
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  00 [06] | 20 talk.politics.guns

Correctly classified instances: 118 (59.0%)

Running fold 10
===============
Learning Naive Bayes model... (1800 training samples)
Building vocabulary...
Vocabulary contains 8822 words
Building probability matrix...
Class 'soc.religion.christian': |docsj| = 89, P(vj) = 0.0494, n = 32271 (1493  distinct words)
Class 'rec.autos': |docsj| = 91, P(vj) = 0.0506, n = 13229 (919  distinct words)
Class 'talk.religion.misc': |docsj| = 89, P(vj) = 0.0494, n = 19302 (1147  distinct words)
Class 'comp.windows.x': |docsj| = 90, P(vj) = 0.0500, n = 11627 (802  distinct words)
Class 'rec.sport.baseball': |docsj| = 89, P(vj) = 0.0494, n = 15300 (979  distinct words)
Class 'comp.graphics': |docsj| = 87, P(vj) = 0.0483, n = 33474 (1694  distinct words)
Class 'talk.politics.mideast': |docsj| = 91, P(vj) = 0.0506, n = 26385 (1541  distinct words)
Class 'comp.sys.ibm.pc.hardware': |docsj| = 90, P(vj) = 0.0500, n = 12908 (849  distinct words)
Class 'sci.med': |docsj| = 89, P(vj) = 0.0494, n = 26608 (1516  distinct words)
Class 'comp.os.ms-windows.misc': |docsj| = 90, P(vj) = 0.0500, n = 12161 (833  distinct words)
Class 'sci.crypt': |docsj| = 93, P(vj) = 0.0517, n = 22361 (1231  distinct words)
Class 'comp.sys.mac.hardware': |docsj| = 86, P(vj) = 0.0478, n = 13697 (876  distinct words)
Class 'misc.forsale': |docsj| = 92, P(vj) = 0.0511, n = 7008 (690  distinct words)
Class 'rec.motorcycles': |docsj| = 87, P(vj) = 0.0483, n = 11174 (880  distinct words)
Class 'talk.politics.misc': |docsj| = 91, P(vj) = 0.0506, n = 31308 (1562  distinct words)
Class 'sci.electronics': |docsj| = 96, P(vj) = 0.0533, n = 26346 (1383  distinct words)
Class 'rec.sport.hockey': |docsj| = 89, P(vj) = 0.0494, n = 15805 (1046  distinct words)
Class 'sci.space': |docsj| = 93, P(vj) = 0.0517, n = 24080 (1582  distinct words)
Class 'alt.atheism': |docsj| = 84, P(vj) = 0.0467, n = 20216 (1121  distinct words)
Class 'talk.politics.guns': |docsj| = 94, P(vj) = 0.0522, n = 22600 (1353  distinct words)

Classifying 200 test samples with Naive Bayes...
Columns show predicted classification, rows show actual classification
 01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  | predicted / actual 
---------------------------------------------------------------------------------+--------------------
[11] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 01 soc.religion.christian
 00 [05] 00  00  00  00  00  00  00  00  00  00  00  01  00  02  00  00  00  01  | 02 rec.autos
 06  00 [03] 00  00  00  00  00  00  00  00  00  00  00  01  00  00  00  00  01  | 03 talk.religion.misc
 00  00  00 [06] 00  03  00  00  00  00  01  00  00  00  00  00  00  00  00  00  | 04 comp.windows.x
 00  00  00  00 [11] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  | 05 rec.sport.baseball
 00  00  00  00  00 [11] 00  01  00  01  00  00  00  00  00  00  00  00  00  00  | 06 comp.graphics
 00  00  00  00  00  00 [08] 00  00  00  00  00  00  00  00  00  00  00  00  01  | 07 talk.politics.mideast
 01  00  00  00  00  01  00 [07] 00  00  01  00  00  00  00  00  00  00  00  00  | 08 comp.sys.ibm.pc.hardware
 00  00  00  00  00  03  00  00 [07] 00  00  00  00  00  00  00  00  00  01  00  | 09 sci.med
 00  00  00  00  00  04  00  00  00 [03] 00  01  00  00  01  01  00  00  00  00  | 10 comp.os.ms-windows.misc
 00  00  00  00  00  00  00  00  00  00 [06] 00  00  00  01  00  00  00  00  00  | 11 sci.crypt
 00  00  00  00  00  02  00  01  00  00  01 [08] 00  00  00  02  00  00  00  00  | 12 comp.sys.mac.hardware
 00  01  00  00  00  00  00  00  00  00  00  00 [05] 00  00  02  00  00  00  00  | 13 misc.forsale
 00  01  00  00  00  00  00  00  00  00  00  00  00 [09] 00  01  00  01  00  01  | 14 rec.motorcycles
 00  00  00  00  00  00  00  00  00  00  00  00  00  00 [07] 00  00  01  00  01  | 15 talk.politics.misc
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00 [04] 00  00  00  00  | 16 sci.electronics
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  01  00 [09] 01  00  00  | 17 rec.sport.hockey
 00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  01  00 [05] 00  00  | 18 sci.space
 04  01  01  00  00  00  01  00  00  00  00  00  00  00  00  00  00  00 [07] 02  | 19 alt.atheism
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  03  00  00  00  00 [03] | 20 talk.politics.guns

Correctly classified instances: 135 (67.5%)

Calculating average error and error bounds...
fold    error rate
----    ----------
  01        0.3450
  02        0.3350
  03        0.3000
  04        0.3350
  05        0.4550
  06        0.3400
  07        0.3350
  08        0.3800
  09        0.4100
  10        0.3250
----    ----------
mean error: 0.3560
std. dev.:  0.0146

Calculating confidence intervals...
Confidence interval at 90% --> [0.3293 , 0.3827]
Confidence interval at 95% --> [0.3231 , 0.3889]
Confidence interval at 98% --> [0.3149 , 0.3971]
Confidence interval at 99% --> [0.3087 , 0.4033]

About

A simple Naive Bayes classifier written in Java

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages