-
Notifications
You must be signed in to change notification settings - Fork 6
dom_query by Example
Note
This guide is currently in progress.
The Document
struct in dom_query
is designed to handle full HTML documents. You can create a Document
by passing in HTML content, which can be provided in several formats: &str
, String
, or StrTendril
.
use dom_query::Document;
use tendril::StrTendril;
// HTML content as a string slice
let contents_str = r#"<!DOCTYPE html>
<html><head><title>Test Page</title></head><body></body></html>"#;
let doc = Document::from(contents_str);
// HTML content as a String
let contents_string = contents_str.to_string();
let doc = Document::from(contents_string);
// HTML content as a StrTendril
let contents_tendril = StrTendril::from(contents_str);
let doc = Document::from(contents_tendril);
// Checking the root element of the `Document`
assert!(doc.root().is_document());
When parsing a full HTML document, Document
will recognize a <!DOCTYPE>
if it exists at the start of the input. In this case, the Doctype
will be added as the first child of the root Document
node. If you provide an HTML snippet without a <!DOCTYPE>
, Document
will ignore the Doctype.
assert!(doc.root().first_child().unwrap().is_doctype());
For cases where you need to parse only a part of an HTML document, such as a snippet or component, dom_query
provides Document::fragment()
. This function also accepts &str
, String
, or StrTendril
, but behaves a little differently from Document::from()
in that it treats the input as a fragment instead of a full document.
use dom_query::Document;
use tendril::StrTendril;
// Parsing an HTML fragment from a string slice
let contents_str = r#"<div><p>Example Fragment</p></div>"#;
let fragment = Document::fragment(contents_str);
// Parsing from a String
let contents_string = contents_str.to_string();
let fragment = Document::fragment(contents_string);
// Parsing from a StrTendril
let contents_tendril = StrTendril::from(contents_str);
let fragment = Document::fragment(contents_tendril);
// Checking the root element of the fragment
assert!(!fragment.root().is_document());
assert!(fragment.root().is_fragment());
When using Document::fragment()
, note that Doctype declarations are ignored, focusing only on the fragment itself.
// Confirming Doctype is excluded in the fragment
assert!(!fragment.root().first_child().unwrap().is_doctype());
Document::fragment()
is also used internally within the library to create new elements within the document tree.
The dom_query
crate provides several selection methods to locate HTML elements in the document. Using CSS-like selectors, you can select both single and multiple elements.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Test Page</title>
</head>
<body>
<h1>Test Page</h1>
<ul>
<li>One</li>
<li><a href="/2">Two</a></li>
<li><a href="/3">Three</a></li>
</ul>
</body>
</html>"#;
let document = Document::from(html);
// Select a single element
let a = document.select("ul li:nth-child(2)");
let text = a.text().to_string();
assert!(text == "Two");
// Selecting multiple elements
document.select("ul > li:has(a)").iter().for_each(|sel| {
assert!(sel.is("li"));
});
// Optionally select an element with `try_select`, which returns an `Option`
let no_sel = document.try_select("p");
assert!(no_sel.is_none());
The Selection::is
method checks whether elements in the current selection match a given selector, without performing a deep search within the elements.
dom_query
supports pseudo-classes that goes from selectors
crate and a few others from itself.
See also: List of supported CSS pseudo-classes
To retrieve only the first match of a selector, Selection::select_single
method is available. This method is useful when you want a single match without iterating through all matches.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<ul class="list">
<li>1</li><li>2</li><li>3</li>
</ul>
<ul class="list">
<li>4</li><li>5</li><li>6</li>
</ul>
</body>
</html>"#.into();
// selecting a first match
let single_selection = doc.select_single(".list");
assert_eq!(single_selection.length(), 1);
assert_eq!(single_selection.inner_html().to_string().trim(), "<li>1</li><li>2</li><li>3</li>");
// selecting all matches
let selection = doc.select(".list");
assert_eq!(selection.length(), 2);
// but when you call property methods usually you will get the result of the first match
assert_eq!(selection.inner_html().to_string().trim(), "<li>1</li><li>2</li><li>3</li>");
// This creates a Selection from the first node in the selection
let first_selection = doc.select(".list").first();
assert_eq!(first_selection.length(), 1);
assert_eq!(first_selection.inner_html().to_string().trim(), "<li>1</li><li>2</li><li>3</li>");
// This approach also creates a new Selection from the next node, each iteration
let next_selection = doc.select(".list").iter().next().unwrap();
assert_eq!(next_selection.length(), 1);
assert_eq!(next_selection.inner_html().to_string().trim(), "<li>1</li><li>2</li><li>3</li>");
// currently, to get data from all matches you need to iterate over them:
let all_matched: String = selection
.iter()
.map(|s| s.inner_html().trim().to_string())
.collect();
assert_eq!(
all_matched,
"<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li>"
);
// same thing as previous, but a little cheaper, because we iterating over the nodes,
// and do not create a new Selection on each iteration
let all_matched: String = doc
.select(".list").nodes()
.iter()
.map(|s| s.inner_html().trim().to_string())
.collect();
assert_eq!(
all_matched,
"<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li>"
);
Elements can be selected in relation to a parent element. Here, a Document
is queried for ul elements, and then descendant selectors are applied within that context.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Test Page</title>
</head>
<body>
<h1>Test Page</h1>
<ul class="list-a">
<li>One</li>
<li><a href="/2">Two</a></li>
<li><a href="/3">Three</a></li>
</ul>
<ul class="list-b">
<li><a href="/4">Four</a></li>
</ul>
</body>
</html>"#;
let document = Document::from(html);
// selecting parent elements
let ul = document.select("ul");
ul.select("li").iter().for_each(|el| {
// descendant select matches only inside the children context
assert!(el.is("li"));
});
// also descendant selector may include elements of the higher level than the parent.
// It may be useful to specify the exact element you want to select
let el = ul.select("body ul.list-b li").first();
let text = el.text();
assert_eq!("Four", text.to_string());
For repeated queries, dom_query
allows using precompiled matchers. This approach enhances performance when matching the same pattern across multiple documents.
use dom_query::{Document, Matcher};
let html1 = r#"<!DOCTYPE html><html><head><title>Test Page 1</title></head><body></body></html>"#;
let html2 = r#"<!DOCTYPE html><html><head><title>Test Page 2</title></head><body></body></html>"#;
let doc1 = Document::from(html1);
let doc2 = Document::from(html2);
// create a matcher once, reuse on different documents
let title_matcher = Matcher::new("title").unwrap();
let title_el1 = doc1.select_matcher(&title_matcher);
assert_eq!(title_el1.text(), "Test Page 1".into());
let title_el2 = doc2.select_matcher(&title_matcher);
assert_eq!(title_el2.text(), "Test Page 2".into());
let title_single = doc1.select_single_matcher(&title_matcher);
assert_eq!(title_single.text(), "Test Page 1".into());
You can use Node::ancestors()
to retrieve the sequence of ancestor nodes for a given element in the document tree, which can be helpful when you need to navigate upward from a specific node.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html>
<head>Test</head>
<body>
<div id="great-ancestor">
<div id="grand-parent">
<div id="parent">
<div id="child">Child</div>
</div>
</div>
</div>
</body>
</html>
"#.into();
// Select an element
let child_sel = doc.select("#child");
assert!(child_sel.exists());
// Access the selected node
let child_node = child_sel.nodes().first().unwrap();
// Get all ancestor nodes for the `#child` node
let ancestors = child_node.ancestors(None);
let ancestor_sel = Selection::from(ancestors);
// In this case, all ancestor nodes up to the root <html> are included
assert!(ancestor_sel.is("html")); // Root <html> is included
assert!(ancestor_sel.is("#parent")); // Direct parent is also included
// `Selection::is` performs a shallow match, so it will not match `#child` in this selection.
assert!(!ancestor_sel.is("#child"));
// You can limit the number of ancestor nodes returned by specifying `max_limit`
let limited_ancestors = child_node.ancestors(Some(2));
let limited_ancestor_sel = Selection::from(limited_ancestors);
// With a limit of 2, only `#grand-parent` and `#parent` ancestors are included
assert!(limited_ancestor_sel.is("#grand-parent"));
assert!(limited_ancestor_sel.is("#parent"));
assert!(!limited_ancestor_sel.is("#great-ancestor")); // This node is excluded due to the limit
The dom_query
crate provides versatile selector pseudo-classes, built on both its own functionality and the capabilities of the selectors
crate. These pseudo-classes allow targeting elements based on attributes, text content, and context within the document.
use dom_query::Document;
let html = include_str!("../test-pages/rustwiki_2024.html");
let doc = Document::from(html);
// Search for list items (`li`) within a `tr` element that contains an `a` element
// with the title "Programming paradigm"
let paradigm_selection = doc.select(
r#"table tr:has(a[title="Programming paradigm"]) td.infobox-data ul > li"#
);
println!("Rust programming paradigms:");
for item in paradigm_selection.iter() {
println!(" {}", item.text());
}
println!("{:-<50}", "");
// Select items based on `th` containing text "Influenced by" and
// the following `tr` containing `td` with list items.
let influenced_by_selection = doc.select(
r#"table tr:has-text("Influenced by") + tr td ul > li > a"#
);
println!("Rust influenced by:");
for item in influenced_by_selection.iter() {
println!(" {}", item.text());
}
println!("{:-<50}", "");
// Extract all links within a paragraph containing "foreign function interface" text.
// Since a part of the text is in a separate tag, we use the `:contains` pseudo-class.
let links_selection = doc.select(
r#"p:contains("Rust has a foreign function interface") a[href^="/"]"#
);
println!("Links in the FFI block:");
for item in links_selection.iter() {
println!(" {}", item.attr("href").unwrap());
}
println!("{:-<50}", "");
// :only-text selects an element that contains only a single text node,
// with no child elements.
// It can be combined with other pseudo-classes to achieve more specific selections.
// For example, to select a <div> inside an <a>
// that has no siblings and no child elements other than text.
println!("Single <div> inside an <a> with text only:");
for el in doc.select("a div:only-text:only-child").iter() {
println!("{}", el.text().trim());
}
-
:has(selector)
: Finds elements that contain a matching element anywhere within. -
:has-text("text")
: Matches elements based on their immediate text content, ignoring any nested elements. This makes it ideal for selecting nodes where the direct text is crucial for differentiation. -
:contains("text")
: Selects elements containing the specified text within them, useful when searching in a block of text. -
:only-text
: Selects elements that contain only a single text node, with no other child nodes.
These pseudo-classes allow for precise and expressive searches within the DOM, enabling the selection of content-rich elements based on structural or attribute-driven conditions. For a full list of supported pseudo-classes, refer to the Supported CSS Pseudo-Classes List.
Serialization enables extracting HTML content of elements, either with or without outer tags. This can be useful for accessing structured content within elements.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><div class="content"><h1>Test Page</h1></div></body>
</html>"#;
let doc = Document::from(html);
let heading_selector = doc.select("div.content");
// Serialization including the outer HTML tag
let content = heading_selector.html();
assert_eq!(content.to_string(), r#"<div class="content"><h1>Test Page</h1></div>"#);
// Serialization excluding the outer HTML tag
let inner_content = heading_selector.inner_html();
assert_eq!(inner_content.to_string(), "<h1>Test Page</h1>");
The html()
and inner_html
() methods return serialized content as StrTendril
. If no elements match the selector, html()
and inner_html()
will return an empty value, whereas try_html()
and try_inner_html()
return an Option<StrTendril>
, allowing for handling of None
.
// Using `try_html()`, which returns an Option<StrTendril>.
// If there are no matching elements, it returns None.
let opt_no_content = doc.select("div.no-content").try_html();
assert_eq!(opt_no_content, None);
// The `html()` method will return an empty `StrTendril` if there are no matches
let no_content = doc.select("div.no-content").html();
assert_eq!(no_content, "".into());
// Similarly, `inner_html()` and `try_inner_html()` work the same way
assert_eq!(doc.select("div.no-content").try_inner_html(), None);
assert_eq!(doc.select("div.no-content").inner_html(), "".into());
The text()
method retrieves all descendant text content within the selected element, concatenating any nested text nodes into a single string.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><div><h1>Test <span>Page</span></h1></div></body>
</html>"#;
let doc = Document::from(html);
let body_selection = doc.select("body div").first();
let text = body_selection.text();
assert_eq!(text.to_string(), "Test Page");
The immediate_text()
method retrieves the immediate text content of the selected element, excluding any text content from its descendants.
This is useful when you need to access the text content of an element without including the text content of its child elements.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><div><h1>Test <span>Page</span></h1></div></body>
</html>"#;
let doc = Document::from(html);
let body_selection = doc.select("body div h1").first();
// accessing immediate text without descendants
let text = body_selection.immediate_text();
assert_eq!(text.to_string(), "Test ");
The dom_query
crate provides several methods for accessing and manipulating the attributes of an HTML element.
Note
All methods listed below apply to both Selection and Node.
You can use the attr()
method to retrieve the value of a specific attribute. If the attribute does not exist, it will return None
.
You can use the attr_or()
method to retrieve the value of a specific attribute, and return a default value if the attribute does not exist.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><input hidden="" id="k" class="important" type="hidden" name="k" data-k="100"></body>
</html>"#;
let doc = Document::from(html);
let mut input_selection = doc.select("input[name=k]");
let val = input_selection.attr("data-k").unwrap();
assert_eq!(val.to_string(), "100");
// try to get an attribute that does not exist
let val_or = input_selection.attr_or("data-l", "0");
assert_eq!(val_or.to_string(), "0");
You can use the remove_attr()
method to remove a specific attribute from the element.
If it called from the Selection
then it will remove an attribute from all elements in the selection.
input_selection.remove_attr("data-k");
You can use the remove_attrs()
method to remove multiple attributes from the element.
If it called from the Selection
then it will remove all listed attributes from all elements in the selection.
input_selection.remove_attrs(&["id", "class"]);
You can use the set_attr()
method to set the value of a specific attribute.
If it called from the Selection
then it will set an attribute to all elements in the selection.
input_selection.set_attr("data-k", "200");
You can use the has_attr()
method to check if a specific attribute exists on the element.
If it called from the Selection
then it will check if an attribute exists on the first element in the selection.
let is_hidden = input_selection.has_attr("hidden");
assert!(is_hidden);
You can use the remove_all_attrs()
method to remove all attributes from the element.
If it called from the Selection
then it will remove all attributes from all elements in the selection.
input_selection.remove_all_attrs();
assert_eq!(input_selection.html(), r#"<input>"#.into());
The dom_query
crate provides various methods to manipulate the DOM. Below are some examples demonstrating how to append new HTML nodes, set new content, remove selections, and replace selections with new HTML.
use dom_query::Document;
let html_contents = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
<div class="content">
<p>9,8,7</p>
</div>
<div class="remove-it">
Remove me
</div>
<div class="replace-it">
<div>Replace me</div>
</div>
</body>
</html>"#;
let doc = Document::from(html_contents);
// Select the div with class "content"
let mut content_selection = doc.select("body .content");
// Append a new HTML node to the selection
content_selection.append_html(r#"<div class="inner">inner block</div>"#);
assert!(doc.select("body .content .inner").exists());
// Set a new content to the selection, replacing existing content
let mut set_selection = doc.select(".inner");
set_selection.set_html(r#"<p>1,2,3</p>"#);
assert_eq!(doc.select(".inner").html(), r#"<div class="inner"><p>1,2,3</p></div>"#.into());
// Remove the selection with class "remove-it"
doc.select(".remove-it").remove();
assert!(!doc.select(".remove-it").exists());
// Replace the selection with new HTML, the current selection will not change
let mut replace_selection = doc.select(".replace-it");
replace_selection.replace_with_html(r#"<div class="replaced">Replaced</div>"#);
assert_eq!(replace_selection.text().trim(), "Replace me");
// But the document will reflect the changes
assert_eq!(doc.select(".replaced").text(),"Replaced".into());
-
Append HTML:
- The
append_html
method is used to add a new HTML node to the existing selection.
- The
-
Set HTML:
- The
set_html
method replaces the existing content of the selection with new HTML.
- The
-
Remove Selection:
- The
remove
method deletes the elements matching the selector from the document.
- The
-
Replace with HTML:
- The
replace_with_html
method replaces the selected elements with new HTML. Note that the selection itself remains unchanged, but the document reflects the new content.
- The
The dom_query
crate allows you to easily rename selected elements without changing their contents. Selection::rename
does the same for the entire selection, while Node::rename
does it for a single element.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
<div class="content">
<div>1</div>
<div>2</div>
<div>3</div>
<span>4</span>
</div>
<body>
</html>"#.into();
let mut sel = doc.select("div.content > div, div.content > span");
// Before renaming, there are 3 `div` and 1 `span`
assert_eq!(sel.length(), 4);
sel.rename("p");
// After renaming, there are no `div` and `span` elements
assert_eq!(doc.select("div.content > div, div.content > span").length(), 0);
// But there are four `p` elements
assert_eq!(doc.select("div.content > p").length(), 4);
The dom_query
crate allows you to create and manipulate HTML elements with ease. Below are examples demonstrating how to create new elements, set attributes, append HTML, and replace content.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div id="main">
<p id="first">It's</p>
<div>
</body>
</html>"#.into();
// Selecting a node we want to attach a new element
let main_sel = doc.select_single("#main");
let main_node = main_sel.nodes().first().unwrap();
// Creating a simple element
let el = doc.tree.new_element("p");
// Setting attributes
el.set_attr("id", "second");
// Setting text content
el.set_text("test");
main_node.append_child(&el);
assert!(doc.select(r#"#main #second:has-text("test")"#).exists());
// Appending a more complex element using `append_html`
main_node.append_html(r#"<p id="third">Wonderful</p>"#);
assert_eq!(doc.select("#main #third").text().as_ref(), "Wonderful");
assert!(doc.select("#first").exists());
// Replacing existing element content with new HTML using `set_html`
main_node.set_html(r#"<p id="the-only">Wonderful</p>"#);
assert_eq!(doc.select("#main #the-only").text().as_ref(), "Wonderful");
assert!(!doc.select("#first").exists());
// Completely replacing the contents of the node,
// including itself, using `replace_with_html`
main_node.replace_with_html(r#"<span>Tweedledum</span> and <span>Tweedledee</span>"#);
assert!(!doc.select("#main").exists());
assert_eq!(doc.select("span + span").text().as_ref(), "Tweedledee");
-
Creating a Simple Element:
- Use
doc.tree.new_element()
to create a new element. - Set attributes using
node.set_attr()
. - Set text content using
node.set_text()
. - Append the new element to the selected node using
node.append_child()
.
- Use
-
Appending HTML:
- Use
append_html
to add a more complex HTML node to the existing selection. - This method is more convenient for adding multiple elements to the selected node.
- Use
-
Setting New HTML Content:
- Use
set_html
to replace the existing content of the selected node with new HTML. - It changes the inner HTML contents of the node.
- Use
-
Replacing Node Contents Completely:
- Use
replace_with_html
to replace the entire content of the node, including the node itself.
- Use
Additionally, methods like replace_with_html
, set_html
, and append_html
can specify more than one element in the provided string.