Search Result Clustering

Description

Clustering is an unsupervised machine learning algorithm to partition data points based on similarity.

Implement a cluster based visualization of 10 (or more) relevant retrieved results for a given search query. That is, no standard list based result visualization is used here but a visualization that supports exploration by illustrating the relation (e.g. similarity) between the results.

Note: You can use any simple clustering algorithms like Kmeans or HAC. You can also use Solr feature for result clustering. Use any Java/Python based Machine Learning library like WEKA 3.8 or Python Sk-learn.

Tech Stack and Libraries Used

Technology:

Python: Documents collection (Web scraping)
Java: Backend
React TypeScript: Frontend

Libraries:

Lucene: Relevant document retrieval
Weka: Clustering
Highcharts: Cluster visualization

Home Page

Preprocessing and Resource Collection

Run the document_data_generator.py file to generate 1000 .txt documents.
These documents are indexed using the Lucene library (inverted indexing) after the server is started.
- Note: Before indexing, the indexing directory is cleared.

Input

The user is prompted for:
- Search query.
- Type of clustering algorithm.
- Number of clusters.
Input parameters are verified:
- If input values are valid, processing continues.
- If input is invalid, no display is shown.

Processing

Based on the similarity metric, the top 20 documents are retrieved.
These documents are grouped into K clusters using the K-means clustering algorithm.
Text data is vectorized using the Tf-idf weighting scheme for the retrieved documents.
Data instances are converted to Weka instances to make them compatible with the Weka library for clustering.
Clustering process:
- If the k value is entered, clustering occurs based on it.
- If the "optimal clusters" option is selected, the k value is determined by testing clustering for various values of (k).
Dimensionality reduction is performed using Principal Component Analysis (PCA) to display data instances in a 2D space.

Output

The frontend is built using React TypeScript (ReactTs).
Clusters are visualized using:
- Packed Bubble plot (default view)
- Additional details:
  - Hovering over clusters shows the data instances grouped within, along with their:
    - Title.
    - Content.
    - Relevance score.
- Scatter Plot

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
backend		backend
doc/images		doc/images
frontend/react-search-app		frontend/react-search-app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Result Clustering

Description

Tech Stack and Libraries Used

Technology:

Libraries:

Home Page

Preprocessing and Resource Collection

Input

Processing

Output

About

Releases

Packages

Languages

License

aditya0by0/search-result-clustering

Folders and files

Latest commit

History

Repository files navigation

Search Result Clustering

Description

Tech Stack and Libraries Used

Technology:

Libraries:

Home Page

Preprocessing and Resource Collection

Input

Processing

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages