dom

Helper functions for "net/html" that make it easier to interact with *html.Node.

🚀 Getting Started - 📚 Documentation - 🧑‍💻 Examples

Installation

go get -u github.com/JohannesKaufmann/dom

Note

This "dom" libary was developed for the needs of the html-to-markdown library. That beeing said, please submit any functions that you need.

Getting Started

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/JohannesKaufmann/dom"
	"golang.org/x/net/html"
)

func main() {
	input := `
	<ul>
		<li><a href="github.com/JohannesKaufmann/dom">dom</a></li>
		<li><a href="github.com/JohannesKaufmann/html-to-markdown">html-to-markdown</a></li>
	</ul>
	`

	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		log.Fatal(err)
	}

	// - - - //

	firstLink := dom.FindFirstNode(doc, func(node *html.Node) bool {
		return dom.NodeName(node) == "a"
	})

	fmt.Println("href:", dom.GetAttributeOr(firstLink, "href", ""))
}

Node vs Element

The naming scheme in this library is:

"Node" means *html.Node{}
- This means any node in the tree of nodes.
"Element" means *html.Node{Type: html.ElementNode}
- This means only nodes with the type of ElementNode. For example <p>, <span>, <a>, ... but not #text, , ...

For most functions, there are two versions. For example:

FirstChildNode() and FirstChildElement()
AllChildNodes() and AllChildElements()
...

Documentation

Attributes & Content

You can get the attributes of a node using GetAttribute, GetAttributeOr or the more specialized GetClasses that returns a slice of strings.

For matching nodes, HasID and HasClass can be used.

If you want to collect the #text of all the child nodes, you can call CollectText.

name := dom.NodeName(node)
// "h2"

href := dom.GetAttributeOr(node, "href", "")
// "github.com"

isHeading := dom.HasClass(node, "repo__name")
// `true`

content := dom.CollectText(node)
// "Lorem ipsum"

Children & Siblings

You can already use node.FirstChild to get the first child node. For the convenience we added FirstChildNode() and FirstChildElement() which returns *html.Node.

To get all direct children, use AllChildNodes and AllChildElements which returns []*html.Node.

PrevSiblingNode and PrevSiblingElement
NextSiblingNode and NextSiblingElement

Find Nodes

Searching for nodes deep in the tree is made easier with:

firstParagraph := dom.FindFirstNode(doc, func(node *html.Node) bool {
    return dom.NodeName(node) == "p"
})
// *html.Node


allParagraphs := dom.FindAllNodes(doc, func(node *html.Node) bool {
    return dom.NodeName(node) == "p"
})
// []*html.Node

🧑‍💻 Example code, find
🧑‍💻 Example code, selectors

Get next/previous neighbors

What is special about this? The order!

If you are somewhere in the DOM, you can call GetNextNeighborNode to get the next node, even if it is further up the tree. The order is the same as you would see the elements in the DOM.

node := startNode
for node != nil {
    fmt.Println(dom.NodeName(node))

    node = dom.GetNextNeighborNode(node)
}

If we start the for loop at the <button> and repeatedly call GetNextNeighborNode this would be the order that the nodes are visited.

#document
├─html
│ ├─head
│ ├─body
│ │ ├─nav
│ │ │ ├─p
│ │ │ │ ├─#text "up"
│ │ ├─main
│ │ │ ├─button   *️⃣
│ │ │ │ ├─span  0️⃣
│ │ │ │ │ ├─#text "start"  1️⃣
│ │ │ ├─div  2️⃣
│ │ │ │ ├─h3  3️⃣
│ │ │ │ │ ├─#text "heading"  4️⃣
│ │ │ │ ├─p  5️⃣
│ │ │ │ │ ├─#text "description"  6️⃣
│ │ ├─footer  7️⃣
│ │ │ ├─p  8️⃣
│ │ │ │ ├─#text "down"  9️⃣

If you only want to visit the ElementNode's (and skip the #text Nodes) you can use GetNextNeighborElement instead.

If you want to skip the children you can use GetNextNeighborNodeExcludingOwnChild. In the example above, when starting at the <button> the next node would be the <div>.

The same functions also exist for the previous nodes, e.g. GetPrevNeighborNode.

🧑‍💻 Example code, next basics
🧑‍💻 Example code, next inside a loop

Remove & Replace Node

if dom.HasClass(node, "lang__old") {
	newNode := &html.Node{
		Type: html.TextNode,
		Data: "🪦",
	}
	dom.ReplaceNode(node, newNode)
}


for _, node := range emptyTextNodes {
	dom.RemoveNode(node)
}

🧑‍💻 Example code, remove and replace

Unwrap Node

#document
├─html
│ ├─head
│ ├─body
│ │ ├─article   *️⃣
│ │ │ ├─h3
│ │ │ │ ├─#text "Heading"
│ │ │ ├─p
│ │ │ │ ├─#text "short description"

If we take the input above and run UnwrapNode(articleNode) we can "unwrap" the <article>. That means removing the <article> while keeping the children (<h3> and <p>).

#document
├─html
│ ├─head
│ ├─body
│ │ ├─h3
│ │ │ ├─#text "Heading"
│ │ ├─p
│ │ │ ├─#text "short description"

RenderRepresentation

import (
	"fmt"
	"log"
	"strings"

	"github.com/JohannesKaufmann/dom"
	"golang.org/x/net/html"
)

func main() {
	input := `<a href="/about">Read More</a>`

	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(dom.RenderRepresentation(doc))
}

The tree representation helps to visualize the tree-structure of the DOM. And the #text nodes stand out.

Tip

This function could be useful for debugging & testcases. For example in neighbors_test.go

#document
├─html
│ ├─head
│ ├─body
│ │ ├─a (href=/about)
│ │ │ ├─#text "Read More"

While the normal "net/html" Render() function would have produced this:

<html><head></head><body><a href="/about">Read More</a></body></html>

🧑‍💻 Example code, dom representation

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attr.go		attr.go
attr_test.go		attr_test.go
change.go		change.go
change_test.go		change_test.go
dom.go		dom.go
dom_representation.go		dom_representation.go
dom_representation_test.go		dom_representation_test.go
dom_test.go		dom_test.go
example_test.go		example_test.go
find.go		find.go
find_test.go		find_test.go
go.mod		go.mod
go.sum		go.sum
neighbors.go		neighbors.go
neighbors_test.go		neighbors_test.go
tags.go		tags.go
tags_test.go		tags_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dom

Installation

Getting Started

Node vs Element

Documentation

Attributes & Content

Children & Siblings

Find Nodes

Get next/previous neighbors

Remove & Replace Node

Unwrap Node

RenderRepresentation

About

Releases 1

Packages

Languages

License

JohannesKaufmann/dom

Folders and files

Latest commit

History

Repository files navigation

dom

Installation

Getting Started

Node vs Element

Documentation

Attributes & Content

Children & Siblings

Find Nodes

Get next/previous neighbors

Remove & Replace Node

Unwrap Node

RenderRepresentation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages