GitHub - KrishivGubba/Concurrent-WebCrawler: Java Web Crawler Project: This repository implements a multi-threaded web crawler in Java, designed to fetch and parse web pages concurrently. It includes classes for managing URL queues, tracking visited URLs, parsing HTML content, and handling data storage, rate limiting, and robots.txt compliance.

Java Web Crawler Project: This repository implements a multi-threaded web crawler in Java, designed to fetch and parse web pages concurrently. It includes classes for managing URL queues, tracking visited URLs, parsing HTML content, and handling data storage, rate limiting, and robots.txt compliance.

Since the web crawler is IO task intensive, it makes sense to have a multi threaded program that fetches data from several URLs at once. The number of CPU cores on my device is 16, and I've decided to implement about 25 threads to fetch different URLs.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
src		src
.gitignore		.gitignore
README.md		README.md
WebCrawler.iml		WebCrawler.iml
jsoup-1.18.1.jar		jsoup-1.18.1.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

KrishivGubba/Concurrent-WebCrawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages