Clustering is an unsupervised machine learning algorithm to partition data points based on similarity.
- Implement a cluster based visualization of 10 (or more) relevant retrieved results for a given search query. That is, no standard list based result visualization is used here but a visualization that supports exploration by illustrating the relation (e.g. similarity) between the results.
Note: You can use any simple clustering algorithms like Kmeans or HAC. You can also use Solr feature for result clustering. Use any Java/Python based Machine Learning library like WEKA 3.8 or Python Sk-learn.
- Python: Documents collection (Web scraping)
- Java: Backend
- React TypeScript: Frontend
- Lucene: Relevant document retrieval
- Weka: Clustering
- Highcharts: Cluster visualization
- Run the
document_data_generator.py
file to generate 1000 .txt documents. - These documents are indexed using the Lucene library (inverted indexing) after the server is started.
- Note: Before indexing, the indexing directory is cleared.
- The user is prompted for:
- Search query.
- Type of clustering algorithm.
- Number of clusters.
- Input parameters are verified:
- If input values are valid, processing continues.
- If input is invalid, no display is shown.
- Based on the similarity metric, the top 20 documents are retrieved.
- These documents are grouped into K clusters using the K-means clustering algorithm.
- Text data is vectorized using the Tf-idf weighting scheme for the retrieved documents.
- Data instances are converted to Weka instances to make them compatible with the Weka library for clustering.
- Clustering process:
- If the k value is entered, clustering occurs based on it.
- If the "optimal clusters" option is selected, the k value is determined by testing clustering for various values of (k).
- Dimensionality reduction is performed using Principal Component Analysis (PCA) to display data instances in a 2D space.
-
The frontend is built using React TypeScript (ReactTs).
-
Clusters are visualized using:
- Packed Bubble plot (default view)
- Additional details:
- Hovering over clusters shows the data instances grouped within, along with their:
- Title.
- Content.
- Relevance score.
- Hovering over clusters shows the data instances grouped within, along with their:
- Scatter Plot