GitHub - pentanol2/Wep-Page-Spamicity-Detecor: A python script to crawl content and model and classify the spammy behavior of web pages

Spam web page detection

Spam classifier is a script for classifying web page spamicity. Spam web pages that try to attract traffic without being relevant to the search query.

To check this we try to verify the conformity of the web page heading with the content of the page body. We have a dataset of urls under the name hostnames.txt. In this job the following tasks are performed:

Data Cleaning: We check the validity of the urls in our dataset by pinging each one of them using BeautifulSoup library API and checking the http response code.

Data Preprocessing: In this stage we crawl the web page content corresponding the cleaned hostname list then the following two steps are performed:

Text Vectorization: Using tf-idf algorithm we perform text vectorization for both the header and the body content of each page. The output is a couple of vectors corresponsing to word frequencies one for the heading the other for the content.

Cosine Similarity: To numerically quanity the score of each web page we apply the cosine similarity of the vector couples we got. The corresponding new number we get is our preprocessed feature we will use in our classification problem.

Model Training: In this step we train a logistic regression model to classify our page to spam or not spam by mapping the Cosine Silimary to the actual labels column in our dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
.gitignore		.gitignore
README.md		README.md
final_frame.csv		final_frame.csv
hostnames.txt		hostnames.txt
labels.txt		labels.txt
patch		patch
raw-assessments.txt		raw-assessments.txt
requirements.txt		requirements.txt
setup.py		setup.py
spam_classifier.py		spam_classifier.py
wokring_host.csv		wokring_host.csv
wokring_links_hosts.csv		wokring_links_hosts.csv
working_links_assess.csv		working_links_assess.csv
working_links_labels.csv		working_links_labels.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam web page detection

About

Releases

Packages

Languages

pentanol2/Wep-Page-Spamicity-Detecor

Folders and files

Latest commit

History

Repository files navigation

Spam web page detection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages