Skip to content
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.
/ cc-link-checker Public archive

Automated link checker for legalcode and license URLs

License

Notifications You must be signed in to change notification settings

cc-archive/cc-link-checker

Repository files navigation

Creative Commons Link Checker

This python script scrapes all the license files and automates the task of detecting broken links, timeout error and other link issues

unitAndLint Licence: MIT Code style: black chat: on Slack

Table of Contents

Pre-requisite

  • Python3
  • UTF-8 supported console

Installation

There are two suggested ways of installation. Use User, if you are interested in just running the script. Use Development, if you are interested in developing the script

User

  1. Clone the repo
    git clone https://github.com/creativecommons/cc-link-checker.git
  2. Install dependencies Using Pipfile (requires pipenv): pipenv install

Development

We recommend using pipenv to create a virtual environment and install dependencies

  1. Clone the repo
    git clone https://github.com/creativecommons/cc-link-checker.git
  2. Create virtual environment and install all dependencies
    • Normal
      pipenv install --dev
    • Use sync to install last successful environment. For example:
      pipenv sync --dev
  3. Run the script:
    pipenv run link_checker

Usage

pipenv run link_checker -h
usage: link_checker [-h] {deeds,legalcode,rdf,index,combined,canonical} ...

Check for broken links in Creative Commons license deeds, legalcode, and rdf

optional arguments:
  -h, --help            show this help message and exit

subcommands (a single subcomamnd is required):
  {deeds,legalcode,rdf,index,combined,canonical}
    deeds               check the links for each license's deed
    legalcode           check the links for each license's legalcode
    rdf                 check the links for each license's RDF
    index               check the links within index.rdf
    combined            Combined check (deeds, legalcode, rdf, and index)
    canonical           print canonical license URLs

Also see the help output each subcommand

deeds

pipenv run link_checker deeds -h
usage: link_checker deeds [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                          [--local] [--output-errors [output_file]]

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local               process local filesystem legalcode files to determine
                        valid license paths (uses LICENSE_LOCAL_PATH environment
                        variable and falls back to default:
                        '../creativecommons.org/docroot/legalcode')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)

legalcode

pipenv run link_checker legalcode -h
usage: link_checker legalcode [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                              [--local] [--output-errors [output_file]]

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local               process local filesystem legalcode files to determine
                        valid license paths (uses LICENSE_LOCAL_PATH environment
                        variable and falls back to default:
                        '../creativecommons.org/docroot/legalcode')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)

rdf

pipenv run link_checker rdf -h
usage: link_checker rdf [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                        [--local] [--local-index] [--output-errors [output_file]]

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local               process local filesystem legalcode files to determine
                        valid license paths (uses LICENSE_LOCAL_PATH environment
                        variable and falls back to default:
                        '../creativecommons.org/docroot/legalcode')
  --local-index         process local filesystem index.rdf (uses
                        INDEX_RDF_LOCAL_PATH environment variable and falls back
                        to default: './index.rdf')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)

index

pipenv run link_checker index -h
usage: link_checker index [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                          [--local-index] [--output-errors [output_file]]

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local-index         process local filesystem index.rdf (uses
                        INDEX_RDF_LOCAL_PATH environment variable and falls back
                        to default: './index.rdf')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)

combined

pipenv run link_checker combined -h
usage: link_checker combined [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                             [--local] [--local-index]
                             [--output-errors [output_file]]

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local               process local filesystem legalcode files to determine
                        valid license paths (uses LICENSE_LOCAL_PATH environment
                        variable and falls back to default:
                        '../creativecommons.org/docroot/legalcode')
  --local-index         process local filesystem index.rdf (uses
                        INDEX_RDF_LOCAL_PATH environment variable and falls back
                        to default: './index.rdf')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)

canonical

pipenv run link_checker canonical -h
usage: link_checker canonical [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                              [--local] [--include-gnu]

optional arguments:
  -h, --help           show this help message and exit
  -q, --quiet          decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL  set root URL (default: 'https://creativecommons.org')
  --limit LIMIT        Limit check lists to specified integer
  -v, --verbose        increase verbosity (can be specified multiple times)
  --local              process local filesystem legalcode files to determine valid
                       license paths (uses LICENSE_LOCAL_PATH environment variable
                       and falls back to default:
                       '../creativecommons.org/docroot/legalcode')
  --include-gnu        include GNU licenses in addition to Creative Commons
                       licenses

Integrating with CI

Due to the script capability to scrape licenses from local storage, it can be used as CI in 2 easy steps:

  1. Clone this repo in the CI container

    git clone https://github.com/creativecommons/cc-link-checker.git ~/cc-link-checker
  2. Run the link_checker.py in local(--local) and output error(--output-error) mode

    python link_checker.py --local --output-errors

The configuration for GitHub Actions, for example, is present here.

Unit Testing

Unit tests have been written using pytest framework. The tests can be run using:

  1. Install dev dependencies
    • macOS with Homebrew
      pipenv install --dev --python /usr/local/opt/[email protected]/libexec/bin/python
    • General
      pipenv install --dev
  2. Run unit tests
    pipenv run pytest -v

Tooling

Troubleshooting

  • UnicodeEncodeError:

    This error is thrown when the console is not UTF-8 supported.

  • Failing Lint build:

    Ensure style/syntax is correct:

    pipenv run black .
    pipenv run isort .
    pipenv run flake8 .
    

Code of conduct

CODE_OF_CONDUCT.md:

The Creative Commons team is committed to fostering a welcoming community. This project and all other Creative Commons open source projects are governed by our Code of Conduct. Please report unacceptable behavior to [email protected] per our reporting guidelines.

Contributing

We welcome contributions for bug fixes, enhancement and documentation. Please see CONTRIBUTING.md while contributing..

License