-
Notifications
You must be signed in to change notification settings - Fork 3
Document clustering program, in Java
License
ezraerb/DocumentCluster
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This file describes DocumentCluster, a program for clustering text documents based on similarity of word frequencies. Document words are first filtered against a specified stop word list, then stemmed using the classic Porter stemming algorithm. The resulting data is then converted to Term Frequency - Inverse Document Frequency values, and normalized so each document is a vector of length one. The document data is internally represented as a sparse matrix with collapsed word columns. The vectors are then clustered using the classic k-means algorithm, using cosine similarity as the distance measure. To install: 1. Copy files to a directory 2. Create a subdirectory 'data' 3. Move the supplied stopwords.txt file to this directory, or create a custom version. Words within the file must be specified on a single line seperated by commas. 4. Compile source files 5. Copy the files to cluster to the data subdirectory 6. Run as DocumentCluster [number of clusters] [list of files] The number of clusters will be the number specified or half the number of files, whichever is less. Files that have no word overlap with other files after stop-word removal will be excluded from the clustering; this happens rarely in practice. Copyright (C) 2013 Ezra Erb This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. I'd appreciate a note if you find this program useful or make updates. Please contact me through LinkedIn or github (my profile also has a link to the code depository)
About
Document clustering program, in Java
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published