Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Switch to disable String-to-HTML-parsing #268

Open
SmithChart opened this issue Oct 19, 2022 · 14 comments
Open

RFC: Switch to disable String-to-HTML-parsing #268

SmithChart opened this issue Oct 19, 2022 · 14 comments
Labels
enhancement New feature or request

Comments

@SmithChart
Copy link
Contributor

Cheers,
the real world hit my Lona project again. Before starting with a PR I would like to hear your opinion on how to go forward here.

tl;dr: I would like to add a settings-switch that disables the auto HTML-parsing in HTML() -objects.
IMHO auto-parsing is a footgun. With it enabled the developer has to escape each and every user input that could be passed (directly) into an HTML-object at some point. If the developers fails to do so a user can either break the application or inject custom HTML into the frontend.
And: With this feature enabled HTML-objects behave differently than other nodes (where there is no auto-parsing).

Consider the following example:

from lona.html import HTML, H1, Div
from lona import LonaApp, LonaView


app = LonaApp(__file__)


@app.route('/1')
class MyView(LonaView):
    def handle_request(self, request):
        html = HTML(
            H1('Hello World'),
            '<p>We can create HTML by simply writing some string here...</p>',
        )
        print(html.nodes)
        return html


@app.route('/2')
class MyView(LonaView):
    def handle_request(self, request):
        user_input = 'Somebody <[email protected]>'
        html = HTML(
            H1('Hello World'),
            user_input,
        )
        print(html)
        return html


@app.route('/3')
class MyView(LonaView):
    def handle_request(self, request):
        user_input = 'Somebody <[email protected]>'
        html = HTML(
            H1('Hello World'),
            Div(user_input),
        )
        print(html)
        return html


app.run(port=8080)

In /1 Lona will take the String containing the <p> and will construct an actual Node from that. The output will be:

<h1 data-lona-node-id="1">
  Hello World
</h1>
<!--lona-widget:7-->
<p data-lona-node-id="5">
  We can create HTML by simply writing some string here...
</p>
<!--end-lona-widget:7-->

Now in /2 we don't have a nice developer formatted string but some user input. (In our case this most often is an E-Mail message-ID we paste in a comment field for future reference).
Lona will assume that something containing < or > must be HTML and tries to parse it. This will fail:

LonaRuntimeWorker_0            ERROR    15:26:16.171305 lona.view_runtime Exception raised while running <__main__.MyView object at 0x7f9c782153d0>
  Traceback (most recent call last):
    File "/home/chris/work/Projects/intern/lag-intranet/env/lib/python3.9/site-packages/lona/view_runtime.py", line 318, in start
      raw_response_dict = self.view.handle_request(self.request) or ''
    File "/home/chris/work/Projects/intern/lag-intranet/demo.py", line 23, in handle_request
      html = HTML(
    File "/home/chris/work/Projects/intern/lag-intranet/env/lib/python3.9/site-packages/lona/html/widgets.py", line 22, in __init__
      self.nodes.append(HTML(node))
    File "/home/chris/work/Projects/intern/lag-intranet/env/lib/python3.9/site-packages/lona/html/widgets.py", line 25, in __init__
      self.nodes = html_string_to_node_list(
    File "/home/chris/work/Projects/intern/lag-intranet/env/lib/python3.9/site-packages/lona/html/parsing.py", line 205, in html_string_to_node_list
      raise ValueError(
  ValueError: Invalid html: missing end tag </[email protected]>

In /3 the user input will now be passed to the frontend without being parsed as HTML by neither Lona nor the Browser.

I am not sure if anybody is using Lona for the reason that one can build HTML-trees from strings. At least I / we are not using it that way.
I would suggest to add a switch to the settings that would disable the isinstance(node, str) check here:

if isinstance(node, str):
. This way any string that is passed into a HTML would be directly passed on to the client and user input could not break the application in this way.

By leaving parsing enabled by default the behaviour of the application would not change. But disabling it would still allow an developer to pass user input to the frontend inside a HTML and not only inside a Node.

On the other hand one could also argue that having different behaviour in parsing between HTML and Node is not desired. But I don't have made a decision if I would like to go that far.

So: What's your opinion: Add that switch? Should the feature stay enabled by default?

@fscherf
Copy link
Member

fscherf commented Oct 19, 2022

Hi @SmithChart,

currently there is no way to implement a settings switch that changes the behavior of the lona.html API, because HTML nodes don't get a reference to the server they get rendered by.

lona.html.HTML supports autoescaping if needed. When a given string starts with \, the string gets transformed into a text node directly (the \ gets removed, even if nothing needs to be escaped). Because it gets send to the frontend as text node, and gets rendered as such, this mechanism isn't vulnerable to XSS attacks.

That code like in your example crashes is a feature, not a bug. If your HTML snippet would be rendered into a website, the browser would try to display the email address as a HTML tag which would result in broken output. Lona always crashes if something is given that may not be displayable as intended by the developer.

@SmithChart
Copy link
Contributor Author

currently there is no way to implement a settings switch that changes the behavior of the lona.html API, because HTML nodes don't get a reference to the server they get rendered by.

Ack.

lona.html.HTML supports autoescaping if needed. When a given string starts with \, the string gets transformed into a text node directly (the \ gets removed, even if nothing needs to be escaped). Because it gets send to the frontend as text node, and gets rendered as such, this mechanism isn't vulnerable to XSS attacks.

I have seen that feature in the source. Should we advertise that in the documentation?

I could fix all these problems by simply wrapping all user input in a <span> or something. In this case all texts will simply be send in text mode and displayed the way I want.
I guess the point I want to make is: From a users-perspective it's not expected that HTML and Node behave differently.

That code like in your example crashes is a feature, not a bug. If your HTML snippet would be rendered into a website, the browser would try to display the email address as a HTML tag which would result in broken output. Lona always crashes if something is given that may not be displayable as intended by the developer.

I can't reproduce this behaviour. On my side neither Firefox nor Chromium render text that contain HTML-tags. I'll add the following view to my example above:

@app.route('/4')
class MyView(LonaView):
    def handle_request(self, request):
        user_input = '<strong>This text is not bold!</strong>'
        devel_input = '<strong>This text is bold!</strong>'
        html = HTML(
            H1('Hello World'),
            Div(user_input),
            f'\\{user_input}',
            devel_input,
        )
        print(html)
        return html

This will give me an output like:
file0qUU9A

@fscherf
Copy link
Member

fscherf commented Oct 21, 2022

I have seen that feature in the source. Should we advertise that in the documentation?

https://lona-web.org/1.x/end-user-documentation/html.html#using-html-strings

I could fix all these problems by simply wrapping all user input in a <span> or something. In this case all texts will simply be send in text mode and displayed the way I want. I guess the point I want to make is: From a users-perspective it's not expected that HTML and Node behave differently.

Hm. The now widget, soon node is named HTML. I thought it makes sense that it accepts HTML in various forms. Do you have a better proposal?

@fscherf fscherf added the enhancement New feature or request label Oct 23, 2022
@SmithChart
Copy link
Contributor Author

SmithChart commented Oct 25, 2022

I have seen that feature in the source. Should we advertise that in the documentation?

https://lona-web.org/1.x/end-user-documentation/html.html#using-html-strings

Sorry, must have missed that :/

I could fix all these problems by simply wrapping all user input in a <span> or something. In this case all texts will simply be send in text mode and displayed the way I want. I guess the point I want to make is: From a users-perspective it's not expected that HTML and Node behave differently.

Hm. The now widget, soon node is named HTML. I thought it makes sense that it accepts HTML in various forms. Do you have a better proposal?

I am not against parsing a String as HTML. But maybe it should not be enabled by default. In my opinion parsing string as HTML is the special case, while forwarding user input to the client is the default case.

In my case: HTML('<some></text>') forwards the HTML to the client and HTML('<strong>Hello World</strong>', parse_html=True) does the parsing.

But currently I would need user_input='<some></text>' + HTML(f'\\{user_input}'). or HTML('\\'+user_input) every time I handle user input.

@maratori
Copy link
Member

I agree that parsing HTML is a special case. And I think it isn't used in 90% (or more?) apps now.

But I'm not sure we should change something in lona 1. It may be redesigned in lona 2.

I could fix all these problems by simply wrapping all user input in a <span> or something.

I recommend to wrap it with TextNode.

@fscherf
Copy link
Member

fscherf commented Oct 25, 2022

I agree that parsing HTML is a special case. And I think it isn't used in 90% (or more?) apps now.

For me it's a pretty important feature because its great for building components for third-party libraries.

https://github.com/lona-web-org/lona-bootstrap-5/blob/master/lona_bootstrap_5/modal.py#L12

But I'm not sure we should change something in lona 1. It may be redesigned in lona 2.

I am open to proposals! But to be honest, i am not unhappy withe the current api

@maratori
Copy link
Member

I'm not insist, but it could be a dedicated lona.html.ParseHTML instead of lona.html.HTML.

@fscherf
Copy link
Member

fscherf commented Oct 26, 2022

If we decide lona.html.HTML shouldn't parse HTML strings anymore, and we decide we get rid of widgets or VirtualNodes or however we call them in the future, lona.html.HTML will be obsolete. lona.html.HTML was always a special node that had more of an dsl (domain specific language) purpose.

Code like:

return HTML(
    H1('Hello World'),
    P('Lorem Ipsum'),
)

never produced an actual <html> tag, but described the users intend pretty good. For me lona.html.HTML always was more like a container to wrap up what i want to show to the browser.

I thought it was pretty straight forward that lona.html.HTML can parse strings into nodes because that's pretty much what the browser does: You give it a string and it converts it to high-level objects that you can transform (the DOM), or you skip the parsing step and give it the high-level objects directly.

@fscherf
Copy link
Member

fscherf commented Jan 30, 2023

Much time has passed, and i am still not sure what to do here. I am not completely happy with what lona.html.HTML does at the moment, because it does multiple things at once, and some of them seem to be counter intuitive to some users.

Note: I am describing the behavior that is currently planned for Lona 2. lona.html.HTML behaves differently in Lona 1.
(https://github.com/lona-web-org/lona/blob/master/lona/html/parsing.py#L202)

1. It gets used as a token, to communicate "this function/view returns HTML"

Every HTML tree, used by a Lona view or component can have exactly one root node. So code like

return (
    H1('Hello'),
    P('World'),
)

would not work. When lona.html.HTML gets used, it works

return HTML(
    H1('Hello'),
    P('World'),
)

because lona.html.HTML wraps all HTML, that does not have exactly one root node, into one div (without the p, lona.html.HTML will return just the h1 tag).

2. It "corrects" HTML node trees with more than one root node

When given a HTML tree, as string or as Node objects, that has more than one root node, the whole tree gets wrapped into one div. If the given tree already has only one root node, the given tree gets returned.

3. It parses HTML strings

Since lona.html.HTML always was supposed to return a HTML node tree, that can be rendered by the JavaScript client, instead of producing a <html> tag, it made sense to me to allow HTML as a string. I agree, this can be confusing, because its the only "tag" that can do that, so code like this will not work as expected.

html = HTML(
    '<div id="foo">foo</div>',
    Div('<div id="bar">bar</div>'),
)

Possible Solution

I agree, it would be more intuitive to split these functionalities, and add lona.html.parse_html or something like this. I am not sure how i want to handle cases like this then:

return HTML(
    H1('Hello'),
    P('World'),
)

We could either enforce that the application code has to provide exactly one root node, and wrap all HTML into one div if needed (which will be pretty much everytime), or we allow lists of nodes as return type, and Lona wraps all nodes into one node if needed (this creates more problems for the renderer though)

There is even one more problem: lona.html.HTML does not return a <html> tag, and no tag in the Lona standard library does. You can create one using HTML('<html></html>') but the resulting tag is not lona.html.HTML, but an auto-generated one. For SPA's this is no problem. Most views don't need a real HTML tag, but when the Node model gets used in an non-interactive view, it would be nice if all HTML5 tags were supported and acted consistently.

I lean towards a solution where we add lona.html.parse_html and raise an error if a view returns more than one node. It's less convenient, but also less magic, that has to be explained. Also we then can add all missing HTML5 tags to the standard library.

@SmithChart
Copy link
Contributor Author

I lean towards a solution where we add lona.html.parse_html and raise an error if a view returns more than one node. It's less convenient, but also less magic, that has to be explained. Also we then can add all missing HTML5 tags to the standard library.

Even some more time has passed here... (Sorry for disappearing.)
What is the current situation with HTML5 Tags? If I am right they are now part of the standard library?

I agree, it would be more intuitive to split these functionalities, and add lona.html.parse_html or something like this. I am not sure how i want to handle cases like this then:

return HTML(
    H1('Hello'),
    P('World'),
)

Making this impossible will very likely break 100% of my views :-)

@fscherf
Copy link
Member

fscherf commented May 31, 2023

Even some more time has passed here... (Sorry for disappearing.) What is the current situation with HTML5 Tags? If I am right they are now part of the standard library?

No worries!
Yes. I added all missing HTML5 tags in #380

Making this impossible will very likely break 100% of my views :-)

Same goes for me and all views in the test_project folder :) Such a fundamental change could be made in a major release like Lona 2 though.

I am still hesitant to remove this syntax because I (completely biased) think it is very intuitive and good at communicating the intend of that return statement.

@fscherf
Copy link
Member

fscherf commented Jul 7, 2023

@SmithChart, @maratori: I think I know what I want to do here now and opened #436

The PR adds lona.html.parse_html which does what you would think. It is the old lona.html.parsing.html_string_to_node_list but with a simpler name and a new optional argument to flatten the parsed nodes.

parse_html('<div>Hello World</div>')  # returns one div
parse_html('<div>Hello</div><div>World</div>')  # returns a list of two divs

lona.html.HTML and lona.html.HTML2 will still be able parse HTML strings, but the feature will be deprecated in the documentation and removed in Lona 2.

lona.html.HTML will be replaced by lona.html.HTML2 in Lona 2, and will be only used for this syntax:

class MyView(View):
    def handle_request(self, request):
        return HTML(
            H1('Hello World'),
            P('Lorem Ipsum'),
        )

@SmithChart
Copy link
Contributor Author

I like the idea moving String-HTML-Parsing into a separate function. This still allows to easily integrate HTML-source for library binding and (at the same time) reduces the chance of accidentally parsing user-input as HTML.

Deprecation sounds like a good plan for me. Do we have a roadmap with Lona 2 changes somewhere?

@fscherf
Copy link
Member

fscherf commented Jul 10, 2023

I like the idea moving String-HTML-Parsing into a separate function. This still allows to easily integrate HTML-source for library binding and (at the same time) reduces the chance of accidentally parsing user-input as HTML.

Great! I will update the PR and the documentation and un-draft it afterwards.

Deprecation sounds like a good plan for me. Do we have a roadmap with Lona 2 changes somewhere?

No. I have a private todo-list, and you can find all spots when greping for 2.0 in the code :)
A proper roadmap is a good idea though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants