Skip to content
Oscar Merida edited this page Nov 4, 2024 · 4 revisions

Usage

Tome is a contrib module that understands Drupal and knows how to export published pages.

Run scripts/tome-static.sh to kick off an export that saves pages to html/. To crawl pages efficiently, Tome spins up multiple export processes. Each process exports one or more path.

  • TOME_PROCESS_COUNT environment variable controls number of parallel processes to run.
  • --path-count=X command line switch to set how many pages to export per process.

As each processes crawls pages, it collects local URLs as an "invoked path". Most of these are CSS, image, and JavaScript files but it also includes paths that are linked within content or UI elements. Between each requested path, Tome resets core caches.

Events

Tome provides a number of Events to allow custom code to affect how it operates. The usagov_ssg_postprocessing module defines custom listeners for our use.

  1. TomeEventSubscriber:
    • MODIFY_HTML: Replace links to /es with /es/
    • PATH_PLACEHOLDER: Exclude exporting paths that start with /es/node/ or /node/ and end in a NID.
    • COLLECT_PATHS: Exclude entire directories defined by usagov_tome_static_path_exclude_directories setting.
  2. PagerPathSubscriber:
    • MODIFY_HTML: For links in HTML with letter=O in query string, to a + hash tag fragment. Used on stage agency index.
    • MODIFY_DESTINATION: If tome is about to export a page with letter= in query string, modify the URL same as above.
  3. PublishedPagesSubscriber:
    • MODIFY_HTML: Extract and parse the datalayer as each path is exported and adds/updates a row in files/published-pages.csv
  4. RequestPrepareSubscriber:
    • REQUEST_PREPARE: Clears contrib caches that Tome does not handle. Mainly pathalias cache and menu-related ones.

Published Pages

Looks for a <script id="taxonomy-data"> element in the HTML exported for a page. It then decodes the JSON-paylod and calculates additional fields for it including Page ID, Friendly URL, and the Full URL. If the nodeID matches a line in the CSV file, it updates the row. Otherwise, a new row is added.

Testing Changes

If you're updating the tome export process and need to validate that your changes are not affecting the output:

  1. Make two exports for comparison

    • make a tome export of local using the dev branch, rename html/ to html-dev/ Also in the usagov_ssh_postprocessing module, renamed save the published pages CSV for the dev run to compare in the next step.
    • switch to this branch, make a tome export, rename to html-dev
  2. Compare HTML (tidy-config.txt config attached)

    • use html-tidy + find command to "normalize" the HTML output in each directory. Use the option to remove indentation. In each HTML folder run a command like the following. Ensure the HTML folder and its contents are writable. find . -name '*.html' -type f -print -exec tidy -config ../tidy-config.txt --warn-proprietary-attributes false -mq '{}' \;
    • use diff + grep to compare the contents of both directories to confirm the HTML is same between branches. Something like this where diff ignores the cache busting query string drupls adds to CSS and JS files will give you a file to review in a text editor..: diff -brwd --strip-trailing-cr --exclude=_data -I " rel=\"stylesheet\">" -I "<script src=" -I js-view-dom-id --exclude=themes html-dev/ html/ > html.diff
    • The diff file above should not have any lines starting with angle brackets. 3 Compare number of redirects in each export directory to confirm its the same. Use this command inside each directory: grep -r 'http-equiv=\"refresh\"' . | wc -l
  3. Compare published pages CSV output. You should sort each in the same order - tome may not output paths in the same order between runs. Sorting by page type, node id, full url, and title should be sufficient. Use diff to compare the output. A visual diff tool like WinMerge or Meld may be more useful in spotting differences.