diff --git a/.coveragerc b/.coveragerc index 3204f75c0..25dd4f829 100644 --- a/.coveragerc +++ b/.coveragerc @@ -1,7 +1,7 @@ [run] omit= - pymdownx/spoilers.py + pymdownx/plainhtml.py [report] omit= - pymdownx/spoilers.py + pymdownx/plainhtml.py diff --git a/.dictionary b/.dictionary index 867b2afea..a4eb3612e 100644 --- a/.dictionary +++ b/.dictionary @@ -62,6 +62,7 @@ SVG SVGs Slugify SmartSymbols +StripHTML Stylesheets SuperFences Tasklist diff --git a/docs/src/markdown/changelog.md b/docs/src/markdown/changelog.md index a9471ce61..a7981be4a 100644 --- a/docs/src/markdown/changelog.md +++ b/docs/src/markdown/changelog.md @@ -2,52 +2,53 @@ ## 4.6.0 -- **NEW**: Arithmatex now *just* uses the script wrapper output as it is the most reliable output, and now previews can be achieved by providing a span with class `MathJax_Preview` that gets auto hidden when the math is rendered. `insert_as_script`, `tex_inline_wrap`, and `tex_block_wrap` have all been deprecated as they are now entirely unnecessary. A new option has been added called `preview` that controls whether the script output generates a preview or not when the rendered math output is loading. Users no longer need to configure `tex2jax.js` in there MathJax configuration anymore. -- **FIX**: PlainHTML has better script and style content avoidance to keep from stripping HTML tags and attributes from style and script content. -- **FIX**: PlainHTML can strip attributes that are not quoted. +- **NEW**: Arithmatex now *just* uses the script wrapper output as it is the most reliable output, and now previews can be achieved by providing a span with class `MathJax_Preview` that gets auto hidden when the math is rendered. `insert_as_script`, `tex_inline_wrap`, and `tex_block_wrap` have all been deprecated as they are now entirely unnecessary. A new option has been added called `preview` that controls whether the script output generates a preview or not when the rendered math output is loading. Users no longer need to configure `tex2jax.js` in there MathJax configuration anymore. (#171) +- **NEW**: PlainHTML has been renamed to StripHTML. `strip_attributes` is now a list instead of a string with a default of `[]`. `pymdownx.plainhtml` is still available with the old convention for backwards compatibility, but will be removed for version 5.0. (!176) +- **FIX**: PlainHTML has better script and style content avoidance to keep from stripping HTML tags and attributes from style and script content. (!174) +- **FIX**: PlainHTML can strip attributes that are not quoted. (!174) ## 4.5.1 Nov 28, 2017 -- **FIX**: If an invalid provider is given, default to `github`. If no `user` or `repo` is specified, do not convert links that depend on those default values (#169). +- **FIX**: If an invalid provider is given, default to `github`. If no `user` or `repo` is specified, do not convert links that depend on those default values. (#169) ## 4.5.0 Nov 26, 2017 -- **NEW**: Add GitLab style compare link shorthand and link shortening (#160). -- **NEW**: Deprecate GitHub extension. It is now recommended to just include the extensions you want to create a GitHub feel instead of relying on a an extension to package something close-ish (#159). +- **NEW**: Add GitLab style compare link shorthand and link shortening. (#160) +- **NEW**: Deprecate GitHub extension. It is now recommended to just include the extensions you want to create a GitHub feel instead of relying on a an extension to package something close-ish. (#159) ## 4.4.0 Nov 23, 2017 -- **NEW**: Add social media mentions -- Twitter only right now (#156). +- **NEW**: Add social media mentions -- Twitter only right now. (#156) - **FIX**: Use correct regular expression for GitLab and Bitbucket. ## 4.3.0 Nov 14, 2017 -- **NEW**: Shorthand format for referencing non-default provider commits, issues, pulls, and mentions (!147). -- **NEW**: Shorthand format for mentioning a repo via `@user/repo` (!149). -- **NEW**: Add repository provider specific classes (!149). -- **NEW**: Make repository labels configurable (!149). +- **NEW**: Shorthand format for referencing non-default provider commits, issues, pulls, and mentions. (!147) +- **NEW**: Shorthand format for mentioning a repo via `@user/repo`. (!149) +- **NEW**: Add repository provider specific classes. (!149) +- **NEW**: Make repository labels configurable. (!149) - **FIX**: Adjust pattern boundaries auto-links. ## 4.2.0 Nov 13, 2017 -- **NEW**: MagicLink can now auto-link a GitHub like shorthand for repository references (!139). -- **NEW**: MagicLink now renders pull request links with a slightly different output from issues (!139). -- **NEW**: Deprecate `base_repo_url` in MagicLink in favor of the new `provider`, `user`, and `repo` (!139). -- **NEW**: MagicLink now adds classes to repository links (!139). -- **NEW**: MagicLink now adds title to repository links (!139). -- **NEW**: MagicLink no longer styles repository commit hashes as code (!143). -- **FIX**: MagicLink repository link outputs now better reflect default user and repository context (!143). -- **FIX**: PlainHTML should not strip tags that are part of JavaScript code (!140). +- **NEW**: MagicLink can now auto-link a GitHub like shorthand for repository references. (!139) +- **NEW**: MagicLink now renders pull request links with a slightly different output from issues. (!139) +- **NEW**: Deprecate `base_repo_url` in MagicLink in favor of the new `provider`, `user`, and `repo`. (!139) +- **NEW**: MagicLink now adds classes to repository links. (!139) +- **NEW**: MagicLink now adds title to repository links. (!139) +- **NEW**: MagicLink no longer styles repository commit hashes as code. (!143) +- **FIX**: MagicLink repository link outputs now better reflect default user and repository context. (!143) +- **FIX**: PlainHTML should not strip tags that are part of JavaScript code. (!140) ## 4.1.0 @@ -59,22 +60,22 @@ Oct 11, 2017 Aug 29, 2017 -- **NEW**: Details extension will now derive a title from the class if only a class is provided (#107). +- **NEW**: Details extension will now derive a title from the class if only a class is provided. (#107) - **NEW**: Remove deprecated legacy emoji generator format. - **NEW**: Remove deprecated `use_codehilite_settings`. - **NEW**: Remove deprecated `spoilers` extension redirect. -- **NEW**: Update emoji databases: EmojiOne (3.1.2) and Twemoji to (2.5.0). +- **NEW**: Update emoji databases: EmojiOne (3.1.2) and Twemoji to .(2.5.0) ## 3.5.0 Jun 13, 2017 -- **NEW**: Add new slugs to preserve case (!103). -- **NEW**: Add new GFM specific slug (both percent encoded and normal) that only lowercases ASCII chars just like GFM does (#101). -- **FIX**: PathConverter should not try and convert obscured email address (with HTML entities) (#100). -- **FIX**: Don't normalize Unicode in slugs with `NFKD`, use `NFC` instead (#98). -- **FIX**: Don't let EscapeAll escape CriticMarkup placeholders. EscapeAll will no longer escape `STX` and `ETX`; they will just pass through (#95). -- **FIX**: Replace CriticMarkup placeholders after replacing raw HTML placeholders (#95). +- **NEW**: Add new slugs to preserve case. (!103) +- **NEW**: Add new GFM specific slug (both percent encoded and normal) that only lowercases ASCII chars just like GFM does. (#101) +- **FIX**: PathConverter should not try and convert obscured email address (with HTML entities). (#100) +- **FIX**: Don't normalize Unicode in slugs with `NFKD`, use `NFC` instead. (#98) +- **FIX**: Don't let EscapeAll escape CriticMarkup placeholders. EscapeAll will no longer escape `STX` and `ETX`; they will just pass through. (#95) +- **FIX**: Replace CriticMarkup placeholders after replacing raw HTML placeholders. (#95) ## 3.4.0 @@ -88,8 +89,8 @@ Jun 1, 2017 May 26, 2017 -- **NEW**: Added support for pull request link shortening in MagicLink (!88). -- **NEW**: Added new Spoilers extension (#85). +- **NEW**: Added support for pull request link shortening in MagicLink. (!88) +- **NEW**: Added new Spoilers extension. (#85) ## 3.2.1 @@ -119,14 +120,14 @@ Apr 16, 2017 - **NEW**: Added Keys extension. - **NEW**: Generalized custom fences (#60). `flow` and `sequence` fence are now just custom fences and can be disabled simply by overwriting the `custom_fences` setting. -- **NEW**: Remove deprecated `no_nl2br` in GitHub extension (#24). -- **NEW**: Remove deprecated HeaderAnchor extension (#24). -- **NEW**: Remove deprecated PyMdown extension (#24). -- **NEW**: Remove deprecated GitHubEmoji extension (#24). -- **NEW**: Remove deprecated `nested` option in SuperFences (#24). +- **NEW**: Remove deprecated `no_nl2br` in GitHub extension. (#24) +- **NEW**: Remove deprecated HeaderAnchor extension. (#24) +- **NEW**: Remove deprecated PyMdown extension. (#24) +- **NEW**: Remove deprecated GitHubEmoji extension. (#24) +- **NEW**: Remove deprecated `nested` option in SuperFences. (#24) - **NEW**: Wrapper extensions (such as GitHub and Extra) can now allow setting the included sub extensions settings (#61). Workaround settings that directly set specific extensions settings has been removed. - **NEW**: Deprecated `use_codehilite_settings` in SuperFences and InlineHilite and now does nothing. The settings will be removed in the future. If `pymdownx.highlight` is used, it's settings will be used instead of CodeHilite. Eventually, the both SuperFences and InlineHilite will require `pymdownx.highlight` to be used and will have CodeHilite support stripped. -- **FIX**: Fix MathJax CDN references and usage in documentation. MathJax CDN is shutting down and must now use Cloudflare CDN (#63). +- **FIX**: Fix MathJax CDN references and usage in documentation. MathJax CDN is shutting down and must now use Cloudflare CDN. (#63) ## 2.0.0 @@ -140,12 +141,12 @@ Feb 12, 2017 Jan 27, 2017 -- **NEW**: MagicLink special repository link shortener for GitHub, GitLab, and Bitbucket (#49). -- **FIX**: GitHub asterisk emphasis should never have had smart enabled for it (#50). +- **NEW**: MagicLink special repository link shortener for GitHub, GitLab, and Bitbucket. (#49) +- **FIX**: GitHub asterisk emphasis should never have had smart enabled for it. (#50) - **FIX**: MagicLink fix for compatibility with wrapped symbols like `~`, `*` etc. which are commonly used. - **FIX**: MagicLink encodes emails like Python Markdown does for consistency. -- **FIX**: MagicLink doesn't allow Unicode for email and does allow Unicode in a URL (#53). -- **FIX**: InlineHilite now returns a proper `etree` element so that the `attr_list` extension and function properly with it (#48). +- **FIX**: MagicLink doesn't allow Unicode for email and does allow Unicode in a URL. (#53) +- **FIX**: InlineHilite now returns a proper `etree` element so that the `attr_list` extension and function properly with it. (#48) - **FIX**: InlineHilite will no longer break if Pygments is not installed (478b410a2199d55f3e70b452516511d3810c61a5). ## 1.7.0 diff --git a/docs/src/markdown/extensions/plainhtml.md b/docs/src/markdown/extensions/plainhtml.md deleted file mode 100644 index a53236ec5..000000000 --- a/docs/src/markdown/extensions/plainhtml.md +++ /dev/null @@ -1,35 +0,0 @@ -# PlainHTML - -## Overview - -PlainHTML is a simple extension that is run at the end of post-processing. It searches the final output stripping things like `style`, `id`, `class`, and `on` attributes from HTML tags. It also removes HTML comments. If you have no desire to see these, this can strip them out. Though it does its best to be loaded at the very end of the process, it helps to include this one last when loading up your extensions. If needed, plain HTML can also be configured to strip out just comments or just attributes etc. - -!!! example "Strip Comment" - - ``` - - - Here is a test. - ``` - - ```html -

Here is a test.

- ``` - -Because comments aren't stripped until the end in a post-processing step, they are present throughout the entire Markdown conversion process and could possibly affect parsing, so be careful how you generally insert comments. - -!!! caution "Warning" - This is not meant to be a sanitizer for HTML. This is just meant to try and strip out style, script, classes, etc. to provide a plain HTML output for the times this is desired; this is not meant as a security extension. If you want something to secure the output, you should consider running a sanitizer like [Bleach][bleach]. - -## Options - -By default, PlainHTML strips the following attributes: `style`, `id`, `class`, and `on`. PlainHTML also strips HTML comments. If desired, its behavior can be configured to strip less or even more, but it is limited to attributes and comments. - -Option | Type | Default |Description ------------------------- |------- | ----------------------- |----------- -`strip_comments` | bool | `#!py True` | Strip HTML comments during post process. -`strip_js_on_attributes` | bool | `#!py True` | Strip JavaScript script attributes with the pattern on* during post process. -`strip_attributes` | string | `#!py 'id class style'` | A string specifying attribute names separated by spaces. - ---8<-- "links.md" diff --git a/docs/src/markdown/extensions/striphtml.md b/docs/src/markdown/extensions/striphtml.md new file mode 100644 index 000000000..3abe07910 --- /dev/null +++ b/docs/src/markdown/extensions/striphtml.md @@ -0,0 +1,38 @@ +# StripHTML + +## Overview + +StripHTML (formally known as PlainHTML) is a simple extension that is run at the end of post-processing. It searches the final output stripping out unwanted comments and/or tag attributes. Though it does its best to be loaded at the very end of the process, it helps to include this one last when loading up your extensions. + +!!! example "Strip Comment" + + ``` + + + Here is a test. + ``` + + ```html +

Here is a test.

+ ``` + +Because comments aren't stripped until the end in a post-processing step, they are present throughout the entire Markdown conversion process and could possibly affect parsing, so be careful how you generally insert comments. + +!!! caution "Warning" + This is not meant to be a sanitizer for HTML. This is just meant to try and strip out style, script, classes, etc. to provide a plain HTML output for the times this is desired; this is not meant as a security extension. If you want something to secure the output, you should consider running a sanitizer like [Bleach][bleach]. + +## Options + +By default, StripHTML strips the following attributes: `style`, `id`, `class`, and `on`. StripHTML also strips HTML comments. If desired, its behavior can be configured to strip less or even more, but it is limited to attributes and comments. + +Option | Type | Default | Description +------------------------ |--------- | ------------ | ----------- +`strip_comments` | bool | `#!py3 True` | Strip HTML comments during post process. +`strip_js_on_attributes` | bool | `#!py3 True` | Strip JavaScript script attributes with the pattern on* during post process. +`strip_attributes` | [string] | `#!py3 []` | A list of tag attribute names to strip. + +!!! warning "Deprecation 4.6.0" + StripHTML used to be known as `pymdownx.plainhtml`, but has been renamed to `pymdownx.striphtml`. The old `plainhtml` is still available. `plainhtml` treats `strip_attributes` as a string of attributes separated by spaces and has a default of `#!py3 "id style class"`. It is encouraged to migrate to using `pymdownx.striphtml` as `pymdownx.plainhtml` will be removed in version 5.0. + +--8<-- "links.md" diff --git a/docs/src/markdown/index.md b/docs/src/markdown/index.md index 79f646f01..cafe77df6 100644 --- a/docs/src/markdown/index.md +++ b/docs/src/markdown/index.md @@ -57,7 +57,7 @@ Check out documentation on each extension to learn more about how to configure a [Highlight](extensions/highlight.md) allows you to configure the syntax highlighting of SuperFences and InlineHilite. Also passes standard Markdown indented code blocks through the syntax highlighter. !!! summary "InlineHilite" - [InlineHilite](extensions/inlinehilite.md) highlights inline code: `#!py from module import function as func`. + [InlineHilite](extensions/inlinehilite.md) highlights inline code: `#!py3 from module import function as func`. !!! summary "Keys" [Keys](extensions/keys.md) makes inserting key inputs into documents as easy as pressing ++ctrl+alt+delete++. @@ -71,9 +71,6 @@ Check out documentation on each extension to learn more about how to configure a !!! summary "PathConverter" [PathConverter](extensions/pathconverter.md) converts paths to absolute or relative to a given base path. -!!! summary "PlainHTML" - [PlainHTML](extensions/plainhtml.md) can strip out HTML comments and specific tag attributes. - !!! summary "ProgressBar" [ProgressBar](extensions/progressbar.md) creates progress bars quick and easy. @@ -85,6 +82,9 @@ Check out documentation on each extension to learn more about how to configure a !!! summary "Snippets" [Snippets](extensions/snippets.md) include other Markdown or HTML snippets into the current Markdown file being parsed. +!!! summary "StripHTML" + [StripHTML](extensions/striphtml.md) can strip out HTML comments and specific tag attributes. + !!! summary "SuperFences" [SuperFences](extensions/superfences.md) is like Python Markdown's fences, but better. Nest fences under lists, admonitions, and other syntaxes. Also create special custom fences for content like UML. diff --git a/docs/src/mkdocs.yml b/docs/src/mkdocs.yml index 130347f9c..9e7c2395c 100644 --- a/docs/src/mkdocs.yml +++ b/docs/src/mkdocs.yml @@ -44,7 +44,7 @@ pages: - MagicLink: extensions/magiclink.md - Mark: extensions/mark.md - PathConverter: extensions/pathconverter.md - - PlainHTML: extensions/plainhtml.md + - StripHTML: extensions/striphtml.md - ProgressBar: extensions/progressbar.md - SmartSymbols: extensions/smartsymbols.md - Snippets: extensions/snippets.md @@ -102,8 +102,7 @@ markdown_extensions: - pymdownx.progressbar: - pymdownx.arithmatex: - pymdownx.mark: - - pymdownx.plainhtml: - strip_attributes: '' + - pymdownx.striphtml: - pymdownx.snippets: base_path: docs/src/markdown/_snippets - pymdownx.keys: diff --git a/mkdocs.yml b/mkdocs.yml index 3764b0b6e..8dd7e5d7f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -44,7 +44,7 @@ pages: - MagicLink: extensions/magiclink.md - Mark: extensions/mark.md - PathConverter: extensions/pathconverter.md - - PlainHTML: extensions/plainhtml.md + - StripHTML: extensions/striphtml.md - ProgressBar: extensions/progressbar.md - SmartSymbols: extensions/smartsymbols.md - Snippets: extensions/snippets.md @@ -102,8 +102,7 @@ markdown_extensions: - pymdownx.progressbar: - pymdownx.arithmatex: - pymdownx.mark: - - pymdownx.plainhtml: - strip_attributes: '' + - pymdownx.striphtml: - pymdownx.snippets: base_path: docs/src/markdown/_snippets - pymdownx.keys: diff --git a/pymdownx/plainhtml.py b/pymdownx/plainhtml.py index 2602253f1..f07257e71 100644 --- a/pymdownx/plainhtml.py +++ b/pymdownx/plainhtml.py @@ -25,82 +25,9 @@ """ from __future__ import unicode_literals from markdown import Extension -from markdown.postprocessors import Postprocessor -import re - - -RE_TAG_HTML = re.compile( - r'''(?x) - (?: - (?P(?:\r?\n?\s*)(?:\s*)(?=\r?\n)|)| - (?P - (?P<(?Pstyle|script)) - (?P(?:\s+[\w\-:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s"'`=<>]+))?)*) - (?P>.*?) - )| - (?P<(?P[\w\:\.\-]+)) - (?P(?:\s+[\w\-:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s"'`=<>]+))?)*) - (?P\s*(?P/)?>)| - (?P[\w\:\.\-]+)\s*>) - ) - ''', - re.DOTALL | re.UNICODE -) - -TAG_BAD_ATTR = r'''(?x) -(?P - (?: - \s+(?:%s) - (?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s"'`=<>]+)) - )* -) -''' - - -class PlainHtmlPostprocessor(Postprocessor): - """Post processor to strip out unwanted content.""" - - def repl(self, m): - """Replace comments and unwanted attributes.""" - - if m.group('comments'): - tag = '' if self.strip_comments else m.group('comments') - else: - if m.group('scripts'): - tag = m.group('script_open') - if self.re_attributes is not None: - tag += self.re_attributes.sub('', m.group('script_attr')) - else: - tag += m.group('script_attr') - tag += m.group('script_rest') - elif m.group('close_tag'): - tag = m.group(0) - else: - tag = m.group('open') - if self.re_attributes is not None: - tag += self.re_attributes.sub('', m.group('attr')) - else: - tag += m.group('attr') - tag += m.group('close') - return tag - - def run(self, text): - """Strip out ids and classes for a simplified HTML output.""" - - attr_str = self.config.get('strip_attributes', 'id class style').strip() - attributes = [re.escape(a) for a in attr_str.split(' ')] if attr_str else [] - if self.config.get('strip_js_on_attributes', True): - attributes.append(r'on[\w]+') - if len(attributes): - self.re_attributes = re.compile( - TAG_BAD_ATTR % '|'.join(attributes), - re.DOTALL | re.UNICODE - ) - else: - self.re_attributes = None - self.strip_comments = self.config.get('strip_comments', True) - - return RE_TAG_HTML.sub(self.repl, text) +from . import striphtml +from .util import PymdownxDeprecationWarning +import warnings class PlainHtmlExtension(Extension): @@ -131,8 +58,22 @@ def __init__(self, *args, **kwargs): def extendMarkdown(self, md, md_globals): """Strip unwanted attributes to give a plain HTML.""" - plainhtml = PlainHtmlPostprocessor(md) - plainhtml.config = self.getConfigs() + warnings.warn( + "'PlainHTML' has been renamed to 'StripHTML' (pymdownx.striphtml).\n" + "The usage of pymdownx.plainhtml is deprecated and will be removed\n" + "in the future. It is advised to switch over to StripHTML, but please\n" + "read the documentation as some of the option formats and defaults are\n" + "are different in the new StripHTML extension.", + PymdownxDeprecationWarning + ) + + config = self.getConfigs() + plainhtml = striphtml.StripHtmlPostprocessor( + config.get('strip_comments'), + config.get('strip_js_on_attributes'), + config.get('strip_attributes').split(), + md + ) md.postprocessors.add("plain-html", plainhtml, "_end") md.registerExtension(self) diff --git a/pymdownx/striphtml.py b/pymdownx/striphtml.py new file mode 100644 index 000000000..56980937f --- /dev/null +++ b/pymdownx/striphtml.py @@ -0,0 +1,152 @@ +""" +Strip HTML (previously named Plain HTML). + +pymdownx.striphtml +An extension for Python Markdown. +Strip classes, styles, and ids from html + +MIT license. + +Copyright (c) 2014 - 2017 Isaac Muse + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated +documentation files (the "Software"), to deal in the Software without restriction, including without limitation +the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, +and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions +of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED +TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL +THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF +CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER +DEALINGS IN THE SOFTWARE. +""" +from __future__ import unicode_literals +from markdown import Extension +from markdown.postprocessors import Postprocessor +import re + + +RE_TAG_HTML = re.compile( + r'''(?x) + (?: + (?P(?:\r?\n?\s*)(?:\s*)(?=\r?\n)|)| + (?P + (?P<(?Pstyle|script)) + (?P(?:\s+[\w\-:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s"'`=<>]+))?)*) + (?P>.*?) + )| + (?P<(?P[\w\:\.\-]+)) + (?P(?:\s+[\w\-:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s"'`=<>]+))?)*) + (?P\s*(?P/)?>)| + (?P[\w\:\.\-]+)\s*>) + ) + ''', + re.DOTALL | re.UNICODE +) + +TAG_BAD_ATTR = r'''(?x) +(?P + (?: + \s+(?:%s) + (?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s"'`=<>]+)) + )* +) +''' + + +class StripHtmlPostprocessor(Postprocessor): + """Post processor to strip out unwanted content.""" + + def __init__(self, strip_comments, strip_js_on_attributes, strip_attributes, md): + """Initialize.""" + + self.strip_comments = strip_comments + self.re_attributes = None + attributes = [re.escape(a.strip()) for a in strip_attributes] + if strip_js_on_attributes: + attributes.append(r'on[\w]+') + if attributes: + self.re_attributes = re.compile( + TAG_BAD_ATTR % '|'.join(attributes), + re.DOTALL | re.UNICODE + ) + + super(StripHtmlPostprocessor, self).__init__(md) + + def repl(self, m): + """Replace comments and unwanted attributes.""" + + if m.group('comments'): + tag = '' if self.strip_comments else m.group('comments') + else: + if m.group('scripts'): + tag = m.group('script_open') + if self.re_attributes is not None: + tag += self.re_attributes.sub('', m.group('script_attr')) + else: + tag += m.group('script_attr') + tag += m.group('script_rest') + elif m.group('close_tag'): + tag = m.group(0) + else: + tag = m.group('open') + if self.re_attributes is not None: + tag += self.re_attributes.sub('', m.group('attr')) + else: + tag += m.group('attr') + tag += m.group('close') + return tag + + def run(self, text): + """Strip out ids and classes for a simplified HTML output.""" + + strip = self.strip_comments or self.strip_js_on_attributes or self.re_attributes + return RE_TAG_HTML.sub(self.repl, text) if strip else text + + +class StripHtmlExtension(Extension): + """StripHTML extension.""" + + def __init__(self, *args, **kwargs): + """Initialize.""" + + self.config = { + 'strip_comments': [ + True, + "Strip HTML comments at the end of processing. " + "- Default: True" + ], + 'strip_attributes': [ + [], + "A string of attributes separated by spaces." + "- Default: 'id class style']" + ], + 'strip_js_on_attributes': [ + True, + "Strip JavaScript script attribues with the pattern on*. " + " - Default: True" + ] + } + super(StripHtmlExtension, self).__init__(*args, **kwargs) + + def extendMarkdown(self, md, md_globals): + """Strip unwanted HTML attributes and/or comments.""" + + config = self.getConfigs() + striphtml = StripHtmlPostprocessor( + config.get('strip_comments'), + config.get('strip_js_on_attributes'), + config.get('strip_attributes'), + md + ) + md.postprocessors.add("strip-html", striphtml, "_end") + md.registerExtension(self) + + +def makeExtension(*args, **kwargs): + """Return extension.""" + + return StripHtmlExtension(*args, **kwargs) diff --git a/pymdownx/util.py b/pymdownx/util.py index b393547b7..307f0d749 100644 --- a/pymdownx/util.py +++ b/pymdownx/util.py @@ -21,7 +21,7 @@ if PY34: import html # noqa html_unescape = html.unescape # noqa - else: + else: # pragma: no cover html_unescape = HTMLParser().unescape # noqa else: uchr = unichr # noqa diff --git a/tests/extensions/plainhtml/tests.yml b/tests/extensions/plainhtml/tests.yml deleted file mode 100644 index 7002a89eb..000000000 --- a/tests/extensions/plainhtml/tests.yml +++ /dev/null @@ -1,11 +0,0 @@ -__default__: {} - -plainhtml: - extensions: - pymdownx.plainhtml: - -plainhtml (no attr strip): - extensions: - pymdownx.plainhtml: - strip_js_on_attributes: false - strip_attributes: '' diff --git a/tests/extensions/plainhtml/plainhtml (no attr strip).html b/tests/extensions/striphtml/striphtml (no attr strip).html similarity index 100% rename from tests/extensions/plainhtml/plainhtml (no attr strip).html rename to tests/extensions/striphtml/striphtml (no attr strip).html diff --git a/tests/extensions/plainhtml/plainhtml (no attr strip).txt b/tests/extensions/striphtml/striphtml (no attr strip).txt similarity index 100% rename from tests/extensions/plainhtml/plainhtml (no attr strip).txt rename to tests/extensions/striphtml/striphtml (no attr strip).txt diff --git a/tests/extensions/plainhtml/plainhtml.html b/tests/extensions/striphtml/striphtml.html similarity index 100% rename from tests/extensions/plainhtml/plainhtml.html rename to tests/extensions/striphtml/striphtml.html diff --git a/tests/extensions/plainhtml/plainhtml.txt b/tests/extensions/striphtml/striphtml.txt similarity index 100% rename from tests/extensions/plainhtml/plainhtml.txt rename to tests/extensions/striphtml/striphtml.txt diff --git a/tests/extensions/striphtml/tests.yml b/tests/extensions/striphtml/tests.yml new file mode 100644 index 000000000..6e9f90787 --- /dev/null +++ b/tests/extensions/striphtml/tests.yml @@ -0,0 +1,14 @@ +__default__: {} + +striphtml: + extensions: + pymdownx.striphtml: + strip_attributes: + - id + - style + - class + +striphtml (no attr strip): + extensions: + pymdownx.striphtml: + strip_js_on_attributes: false