Skip to content

kanchwala-yusuf/sitemap-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sitemap-generator

Project sitemap-generator helps a user create a sitemap tree of a given website.

It crawls the given website, excludes external links (outside of the website domain) and from all the crawled paths/links creates a sitemap tree.

This project can be executed:

On your local machine

Prerequisites:

  • Python3. To install refer this

Setup:

  • Clone this project on your setup.
  • It is a good practice to create virtualenv. To install virtualenv utiltiy refer this
    virtualenv -p python3.5 venv
    source venv/bin/activate
  • Install dependencies
    pip install -r requirements.txt
  • Setup the config.toml file: Sample file:
    [server]
    url =  ""
    port = ""
    
    # Valid values: local, server
    type = "local"
    
    [log]
    # Valid values: error, debug, info, warn
    level = "info"

Once the setup is done, we are ready to try it out locally.

Using cli

  • Getting help:
    $ python sitemapctl.py -h
    usage: sitemapctl.py [-h] [--url URL]
    
    Sitemap generator
    
    optional arguments:
      -h, --help  show this help message and exit
        --url URL   Eg. 'https://www.mywebsite.com'
  • Executing locally
    $ python sitemapctl.py --url "https://www.somewebsite.com"

Inside docker

Inside docker an api server is created so that sitemap-generator service can be available as a REST endpoint.

Prerequisites:

  • Docker needs to be installed. To install refer this
  • Python3. To install refer this

Setup

  • Clone this project on your setup.
  • It is a good practice to create virtualenv. To install virtualenv utiltiy refer this
    virtualenv -p python3.5 venv
    source venv/bin/activate
  • Install dependencies
    pip install -r requirements.txt
  • Setup the config.toml file: Sample file:
    [server]
    url =  "http://localhost"
    port = "5002"
    
    # Valid values: local, server
    type = "server"
    
    [log]
    # Valid values: error, debug, info, warn
    level = "info"
  • Create a docker image:
    docker build -t crawler:latest .
  • Run the docker container: Make sure to use the correct port in the docker run command
    docker run -p 5002:5002 crawler:latest

Using cli

  • Use of cli is the same as done for a local setup
    $ python sitemapctl.py --url "https://www.somewebsite.com"

Using APIs

  • Health check:
    $ curl -i -H "Content-Type: application/json" -X GET http://localhost:5002/_health
    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 24
    Server: Werkzeug/0.15.2 Python/3.6.8
    Date: Mon, 29 Apr 2019 07:19:44 GMT
    
    {
      "msg": "I am ok!"
    }
  • Generate sitemap: This returns a json object with url, failed (urls sitemap-generator failed to crawl), sitemap. In the logs YAML based sitemap tree is dumped as well.
    curl -i -H "Content-Type: application/json" -X POST http://localhost:5002/crawl -d '{"url": "https://www.somewebsite.com"}'

On kubernetes cluster:

sitemap-generator can be deployed on a kubernetes cluster as a service and can be accessed via sitemapctl.py cli or its API endpoints.

Prerequisites:

  • A kubernetes cluster. Minkube can be used as well. To install minikube refer this
  • kubectl cli. To install refer this
  • Docker needs to be installed. To install refer this
  • Docker registry to push and pull docker images. Dockerhub can be used for this purpose.
  • Python3. To install refer this

Setup kubernetes cluster

  • Clone this project on your setup.

  • It is a good practice to create virtualenv. To install virtualenv utiltiy refer this

    virtualenv -p python3.5 venv
    source venv/bin/activate
  • Install dependencies

    pip install -r requirements.txt
  • Setup the config.toml file: Sample file:

    [server]
    url =  "http://localhost"
    port = "5002"
    
    # Valid values: local, server
    type = "server"
    
    [log]
    # Valid values: error, debug, info, warn
    level = "info"
  • Create a docker image:

    docker build -t crawler:latest .
  • Push docker image to docker registry

    docker tag crawler:latest mydockerhubregistry/crawler:latest
    docker push mydockerhubregistry/crawler:latest
  • Correctly setup the kubernetes definition yaml files present in kube directory. Make sure to use the correct image name and port values in the definition files

  • Deploy the application:

    cd kube
    kubectl create -f crawler-deployment.yaml -f crawler-svc.yaml

Setup cli to access kubernetes cluster:

  • Again setup the config.toml file to now point to the kubernetes cluster.
  • Set url to the FQDN or IP address of the kubernetes cluster. In case of minikube, use minikube status command to get the IP address of the VM.
    $ minikube status
    host: Running
    kubelet: Running
    apiserver: Running
    kubectl: Correctly Configured: pointing to minikube-vm at 192.168.39.138
  • Get the port at which the crawler service has been configured by using the kubectl cli.
    $ kubectl get svc
    NAME          TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)       AGE
    crawler-svc   LoadBalancer   10.102.13.28   <pending>     80:31192/TCP  85m
    
    Here 31192 is the port to be used.
  • Sample config.toml:
    [server]
    url =  "http://192.168.39.138"
    port = "31192"
    
    # Valid values: local, server
    type = "server"
    
    [log]
    # Valid values: error, debug, info, warn
    level = "info"

Using cli

  • Use of cli is the same as done for a local setup
    $ python sitemapctl.py --url "https://www.somewebsite.com"

Using APIs

  • Health check:
    $ curl -i -H "Content-Type: application/json" -X GET http://192.168.39.138:31192/_health
    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 24
    Server: Werkzeug/0.15.2 Python/3.6.8
    Date: Mon, 29 Apr 2019 07:19:44 GMT
    
    {
      "msg": "I am ok!"
    }
  • Generate sitemap: This returns a json object with url, failed (urls sitemap-generator failed to crawl), sitemap. In the logs YAML based sitemap tree is dumped as well.
    curl -i -H "Content-Type: application/json" -X POST http://192.168.39.138:31192/crawl -d '{"url": "https://www.somewebsite.com"}'

About

Python web crawler and sitemap generator

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published