A dead simple Go HTML whitelist-sanitization library.
Efficiently support the following types of HTML sanitization through simple programmatic or JSON configuration:
- Removal of all non-whitelisted elements
- Unwrapping of all non-whitelisted elements
Given a whitelist configuration
{
"elements": {
"div": ["id", "class"],
"b": [],
"i": []
}
}
and basic input
<div class="my-class" style="position:relative;">
<i>Something emphasized</i>
<p>
here is a
<i>paragraph</i>
</p>
<b>Something bold</b>
</div>
Removal
Removal of non-whitelisted elements in the provided example would yield
<div class="my-class">
<i>Something emphasized</i>
<b>Something bold</b>
</div>
Note how the style
attribute was removed from the div
element and the p
element was removed entirely
Unwrapping
Unwrapping of non-whitelisted elements in the provided example would yield
<div class="my-class">
<i>Something emphasized</i>
here is a
<i>paragraph</i>
<b>Something bold</b>
</div>
Note how the style
attribute was still removed from the div
element, while the p
element was 'unwrapped' (ie. it's children were attached to it's parent)
Create JSON configuration. Below are the currently supported options
key | value type | default | description |
---|---|---|---|
stripComments | boolean |
false |
Whether or not to strip comment nodes |
stripWhitespace | boolean |
false |
Whether or not to strip whitespace (leading and trailing tabs or spaces) |
elements | Object |
{} |
a list of K-V pairs where the keys are whitelisted element tags and the values are arrays of whitelisted attribtues for that element |
{
"stripComments": true,
"stripWhitespace": true,
"elements": {
"html": ["xmlns"],
"head": [],
"body": [],
"div": ["id", "class"],
}
}
Create a sanitize.Whitelist
object from a json file with sanitize.WhitelistFromFile(filepath string)
or from a []byte with sanitize.NewWhitelist(byteArray []byte)
and use it to parse some HTML:
whitelist, err := sanitize.WhitelistFromFile("./path/to/file.json")
// or create from a json []byte
// whitelist, err := sanitize.NewWhitelist(byteArray)
f, _ := os.Open("./path/to/example.html")
sanitized, _ := whitelist.SanitizeRemove(f) // takes any io.Reader
fmt.Printf("sanitized html: %d", sanitized)
whitelist, err := sanitize.WhitelistFromFile("./path/to/file.json")
f, _ := os.Open("./path/to/example.html")
// sanitize a full HTML document by removing
// non-whitelisted elements and attributes
sanitized, _ := whitelist.SanitizeRemove(f)
// sanitize a full HTML document by reattaching
// the children of non-whitelisted elements to the
// non-whitelisted parent; also removes non whitelisted
// attributes for any element
sanitized, _ := whitelist.SanitizeUnwrap(f)
// sanitize an HTML document fragment (ie no html,
// head, or body tags) by removing
// non-whitelisted elements and attributes
sanitized, _ := whitelist.SanitizeRemoveFragment(f)
// sanitize an HTML document fragment (ie no html,
// head, or body tags) by reattaching
// the children of non-whitelisted elements to the
// non-whitelisted parent; also removes non whitelisted
// attributes for any element
sanitized, _ := whitelist.SanitizeUnwrapFragment(f)
- Support sanitization that unwraps non-whitelisted nodes, allowing the text and/or whitelisted subtree through
- Whitelist-level configuration options (eg.
stripWhitespace
) - Efficient attribute checking by not allocating a new slice on every whitelisted attribute for an element
- Support sanitization of HTML fragments (instead of just full documents)
- Support non
string
type attribute values - Refactor configuration parsing to have []byte interface instead of expecting a filepath
- Create sane defaults
- Usable godoc documentation
Head over to the issues page or open a pull request. Please ensure your code is documented, all existing tests pass, and any new features have tests before submitting a pull request. If you want to check in whether a pull request for a new feature would be accepted, feel free to open an issue.
MIT