Machine_Learning_Focused_Crawler

A focused web crawler that uses Machine Learning to fetch better relevant results.

The list of files are as follows:

1. Crawler_ML.py: This is the python crawler. It runs as follows:

python Crawler_ML.py withoutML - To run Focused Crawler without Machine Learning
python Crawler_ML.py withML - To run Focused Crawler with Machine Learning

After executing the above command, the program asks for the following input:

Please Enter the Query in small letters (Words Should be Spaced): election results
Please Enter the Number of Pages to Crawl: 1000

Currently, the crawler supports queries with only the following words:

'wildfires', 'california', 'brooklyn', 'dodgers', 'shahrukh', 'khan', 'pangolin', 'armadillo', 'world', 'cup','hurricane', 'florence', 'mac', 'miller', 'kate', 'spade', 'anthony', 'bourdain', 'black', 'panther', 'mega', 'million', 'results', 'stan', 'lee', 'demi','lovato', 'election'

2. withoutML_election results.txt - This is the log file query 'election results' for the Focused Crawler without ML for large topic query

3. withML_election results.txt - This is the log file query 'election results' for the Focused Crawler with ML for large topic query

4. withoutML_brooklyn dodgers.txt - This is the log file query 'brooklyn dodgers' for the Focused Crawler without ML for rare topic query

5. withML_brooklyn dodgers.txt - This is the log file query 'brooklyn dodgers' for the Focused Crawler with ML for rare topic query

Note: 2, 3, 4, 5 outputs the following:

i) Name of the URL

ii) Time the URL was Crawled

iii) Size of the Page

iv) Status Code

v) HyperLink Text Info: (text, depth of the link)

vi) Estimated Promise (only for focused crawler)

vii) Cosine Relevance Score

viii) Statistics of the Entire Crawl that includes:

a) Crawl Start Time

b) Crawl End Time

c) Time it took to Crawl: hh:mm:ss

d) Harvest Score

6. Project Report - A pdf file that describes the Project in detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine_Learning_Focused_Crawler

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
Crawler_ML.py		Crawler_ML.py
Project Report.pdf		Project Report.pdf
README.md		README.md
best_svr_model.sav		best_svr_model.sav
withML_brooklyn dodgers.txt		withML_brooklyn dodgers.txt
withML_election results.txt		withML_election results.txt
withoutML_brooklyn dodgers.txt		withoutML_brooklyn dodgers.txt
withoutML_election results.txt		withoutML_election results.txt

IlyasHabeeb/Machine_Learning_Focused_Crawler

Folders and files

Latest commit

History

Repository files navigation

Machine_Learning_Focused_Crawler

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages