Today we continue to review web crawling tools from @shinmera.
This review will be short because I’ll reuse code from the previous post and improve it by replacing four lines with only one.
A system CLSS allows you to use CSS3 selectors when working with HTML nodes produced by Plump.
Here is how we can improve our simple Twitter crawler:
POFTHEDAY> (defvar *raw-html*
(dex:get "https://twitter.com/shinmera"))
POFTHEDAY> (defvar *html* (plump:parse *raw-html*))
;; Now I'll replaced these lines
;; with one clss:select call:
;;
;; (remove-if-not (lambda (div)
;; (str:containsp "tweet-text"
;; (plump:attribute div "class")))
;; (plump:get-elements-by-tag-name *html* "p"))
POFTHEDAY> (defparameter *posts*
(clss:select ".tweet-text" *html*))
POFTHEDAY> (type-of *posts*)
(VECTOR T 40)
POFTHEDAY> (loop repeat 5
for post across *posts*
for full-text = (plump:render-text post)
for short-text = (str:shorten 40 full-text)
do (format t "- ~A~2%" short-text))
- Hi, I'm a #gamedev. My latest project...
- いらっしゃいませ~!pic.twitter.com/wwaWDD6B3Q
- The logic of Splatoon.pic.twitter.com...
- The AI is extremely rough still, but ...
- pic.twitter.com/Cpvqytce5G
As a bonus, I want to show you that CLSS
supports even pseudoclasses:
POFTHEDAY> (plump:parse "
<ul>
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
</ul>
")
#<PLUMP-DOM:ROOT {10031A93A3}>
POFTHEDAY> (clss:select "li:first-child"
*)
#(#<PLUMP-DOM:ELEMENT li {100322C883}>)
POFTHEDAY> (plump:serialize * nil)
"<li>First item</li>"
Sadly, the documentation says only that it supports almost all CSS3 selectors, but don’t enumerate them. However, we can learn this from sources:
POFTHEDAY> (rutils:hash-table-keys
clss::*pseudo-selectors*)
("ROOT" "NTH-CHILD" "NTH-LAST-CHILD" "NTH-OF-TYPE" "NTH-LAST-OF-TYPE"
"FIRST-CHILD" "LAST-CHILD" "FIRST-OF-TYPE" "LAST-OF-TYPE" "ONLY-CHILD"
"ONLY-OF-TYPE" "EMPTY" "LINK" "VISITED" "ACTIVE" "HOVER" "FOCUS" "TARGET"
"LANG" "ENABLED" "DISABLED" "CHECKED" "FIRST-LINE" "FIRST-LETTER" "BEFORE"
"AFTER" "WARNING" "NOT" "FIRST-ONLY")
There is a clss:define-pseudo-selector
macro which allows defining a
custom pseudo-selector.
Yesterday we’ll learn about a more sophisticated tool for web scraping - lQuery.