Skip to content

A python script to crawl content and model and classify the spammy behavior of web pages

Notifications You must be signed in to change notification settings

pentanol2/Wep-Page-Spamicity-Detecor

Repository files navigation

Spam web page detection

Spam classifier is a script for classifying web page spamicity. Spam web pages that try to attract traffic without being relevant to the search query.

To check this we try to verify the conformity of the web page heading with the content of the page body. We have a dataset of urls under the name hostnames.txt. In this job the following tasks are performed:

  1. Data Cleaning: We check the validity of the urls in our dataset by pinging each one of them using BeautifulSoup library API and checking the http response code.

  2. Data Preprocessing: In this stage we crawl the web page content corresponding the cleaned hostname list then the following two steps are performed:

    Text Vectorization: Using tf-idf algorithm we perform text vectorization for both the header and the body content of each page. The output is a couple of vectors corresponsing to word frequencies one for the heading the other for the content.

    Cosine Similarity: To numerically quanity the score of each web page we apply the cosine similarity of the vector couples we got. The corresponding new number we get is our preprocessed feature we will use in our classification problem.

  3. Model Training: In this step we train a logistic regression model to classify our page to spam or not spam by mapping the Cosine Silimary to the actual labels column in our dataset.

About

A python script to crawl content and model and classify the spammy behavior of web pages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages