clss

Today we continue to review web crawling tools from @shinmera.

This review will be short because I’ll reuse code from the previous post and improve it by replacing four lines with only one.

A system CLSS allows you to use CSS3 selectors when working with HTML nodes produced by Plump.

Here is how we can improve our simple Twitter crawler:

POFTHEDAY> (defvar *raw-html*
             (dex:get "https://twitter.com/shinmera"))

POFTHEDAY> (defvar *html* (plump:parse *raw-html*))

           ;; Now I'll replaced these lines
           ;; with one clss:select call:
           ;;
           ;; (remove-if-not (lambda (div)
           ;;                  (str:containsp "tweet-text"
           ;;                                 (plump:attribute div "class")))
           ;;                (plump:get-elements-by-tag-name *html* "p"))
POFTHEDAY> (defparameter *posts*
             (clss:select ".tweet-text" *html*))

POFTHEDAY> (type-of *posts*)
(VECTOR T 40)

POFTHEDAY> (loop repeat 5
                 for post across *posts*
                 for full-text = (plump:render-text post)
                 for short-text = (str:shorten 40 full-text)
                 do (format t "- ~A~2%" short-text))
- Hi, I'm a #gamedev. My latest project...

- いらっしゃいませ～！pic.twitter.com/wwaWDD6B3Q

- The logic of Splatoon.pic.twitter.com...

- The AI is extremely rough still, but ...

- pic.twitter.com/Cpvqytce5G

As a bonus, I want to show you that CLSS supports even pseudoclasses:

POFTHEDAY> (plump:parse "
<ul>
  <li>First item</li>
  <li>Second item</li>
  <li>Third item</li>
</ul>
")
#<PLUMP-DOM:ROOT {10031A93A3}>

POFTHEDAY> (clss:select "li:first-child"
                        *)
#(#<PLUMP-DOM:ELEMENT li {100322C883}>)

POFTHEDAY> (plump:serialize * nil)
"<li>First item</li>"

Sadly, the documentation says only that it supports almost all CSS3 selectors, but don’t enumerate them. However, we can learn this from sources:

POFTHEDAY> (rutils:hash-table-keys
            clss::*pseudo-selectors*)
("ROOT" "NTH-CHILD" "NTH-LAST-CHILD" "NTH-OF-TYPE" "NTH-LAST-OF-TYPE"
 "FIRST-CHILD" "LAST-CHILD" "FIRST-OF-TYPE" "LAST-OF-TYPE" "ONLY-CHILD"
 "ONLY-OF-TYPE" "EMPTY" "LINK" "VISITED" "ACTIVE" "HOVER" "FOCUS" "TARGET"
 "LANG" "ENABLED" "DISABLED" "CHECKED" "FIRST-LINE" "FIRST-LETTER" "BEFORE"
 "AFTER" "WARNING" "NOT" "FIRST-ONLY")

There is a clss:define-pseudo-selector macro which allows defining a custom pseudo-selector.

Yesterday we’ll learn about a more sophisticated tool for web scraping - lQuery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0073-clss.org

0073-clss.org

clss

Files

0073-clss.org

Latest commit

History

0073-clss.org

File metadata and controls

clss