Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML insertion / format: html performance issue #234

Open
cxj opened this issue Sep 1, 2020 · 19 comments
Open

HTML insertion / format: html performance issue #234

cxj opened this issue Sep 1, 2020 · 19 comments

Comments

@cxj
Copy link
Contributor

cxj commented Sep 1, 2020

What's going on here, that causes such a large performance hit when inserting HTML into a template?

<?php
require "vendor/autoload.php";

// 1-kilobyte string of random characters:
$oneK = random_bytes(1024);

// 1-kilobyte string of HTML:
$html = <<<HTML
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
HTML;

$xml = ' <main> @@ </main> ';

$tss = <<<TSS
main {
    content: data(html);
    content-mode: replace;
    format: html;
}
TSS;

$tss2 = <<<TSS2
main {
    content: data(html);
    content-mode: replace;
}
TSS2;

$data['html']  = str_repeat($html, 4000);
$data2['html'] = str_repeat($oneK, 4000);

test($xml, $tss,  $data2, "format html with random string");
test($xml, $tss2, $data2, "no format with random string");

test($xml, $tss,  $data, "format html with HTML");
test($xml, $tss2, $data, "no format with HTML");

function test($xml, $tss, $data, $title) {
    $template = new Transphporm\Builder($xml, $tss);

    $start = microtime(true);
    $template->output($data)->body;
    $stop = microtime(true);

    echo sprintf("%30s time: %12.6f seconds\n", $title, ($stop - $start));
}

And the output on my laptop:

format html with random string time:     0.033743 seconds
  no format with random string time:     0.002971 seconds
         format html with HTML time:    36.562475 seconds
           no format with HTML time:     0.123278 seconds
@TRPB
Copy link
Member

TRPB commented Sep 1, 2020

this is likely caused by DOMDocument having to parse the new HTML and import it into a the primary document. It's worth testing this without Transphporm by creating one document and parsing the 1kb HTML in a second instance, then inserting it into the first using $doc->importnode($node, true)

With that test done, we can then work out the overhead of Transphporm on this process.

@TRPB
Copy link
Member

TRPB commented Sep 1, 2020

ok, it's not that. Here's a test doing that:

<?php

$oneK = random_bytes(1024);

// 1-kilobyte string of HTML:
$html = <<<HTML
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
HTML;








$doc = new \DomDocument;

$doc->loadHtml('<div>
	</div>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);



$doc2 = new \DomDocument;

$doc2->loadHtml('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);



$element = $doc->getElementsByTagName('div')[0];

foreach ($doc2->getElementsByTagName('div')[0]->childNodes as $node) {
	$imported = $doc->importNode($node, true);
	$element->appendChild($imported);
}

var_dump($doc->saveHtml());

This runs fairly quickly. The issue must be here:
https://github.com/Level-2/Transphporm/blob/6f5ff948d8d9af276699669b1ff7a35bd1dbee7a/src/Property/Content.php

I tested the convertNode method in isolation with this data (I just plugged it into the test above) and it's not that either. I wonder if the generator is slow.

@TRPB
Copy link
Member

TRPB commented Sep 1, 2020

Here's a very cut down version of what Transphporm is doing. There's nothing slow here:

<?php

$oneK = random_bytes(1024);

// 1-kilobyte string of HTML:
$html = <<<HTML
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
<tr><td>one </td><td>two</td><td>three</td><td>four</td><td>five</td><td>six</td><td>seven</td></tr>
HTML;



 function getNode($node, $document) {
		foreach ($node as $n) {
			if (is_array($n)) {
				foreach (getNode($n, $document) as $new) yield $new;
			}
			else {
				yield convertNode($n, $document);
			}
		}
	}

	 function convertNode($node, $document) {
		if ($node instanceof \DomElement || $node instanceof \DOMComment) {
			$new = $document->importNode($node, true);
		}
		else {
			if ($node instanceof \DomText) $node = $node->nodeValue;
			$new = $document->createElement('text');

			$new->appendChild($document->createTextNode($node));
			$new->setAttribute('transphporm', 'text');
		}
		return $new;
	}


	 function appendContent($element, $content) {
		foreach (getNode($content, $element->ownerDocument) as $node) {
			$element->appendChild($node);
		}
	}


$doc = new \DomDocument;

$doc->loadHtml('<div>
	</div>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);



$doc2 = new \DomDocument;

$doc2->loadHtml('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);



$element = $doc->getElementsByTagName('div')[0];

appendContent($element, $doc2->getElementsByTagName('div'));

var_dump($doc->saveHtml());

@cxj
Copy link
Contributor Author

cxj commented Sep 1, 2020

The time involved to generate the HTML in Transphporm grows exponentially, so small tests all look pretty fast. For example, the 4 million row test I posted takes roughly 37 seconds, but doubling to 8 million rows takes 267 seconds. That makes me suspect something like undesired recursive behavior on the part of DomDocument, or something else of that nature. The HTML string being inserted should not need to be parsed, to my way of thinking, but rather assumed correct. Am I misunderstanding something or overlooking some need?

@cxj
Copy link
Contributor Author

cxj commented Sep 1, 2020

Increasing the HTML being appended in your example above (appendContent()) from the 1K test to a 4M test increases the CPU time consumed by a factor of about 1400 in my initial testing.

Hmm, this may be fairly useless data, as much of that time might just be moving bytes around in memory during a copy or something.

Yeah, it looks linear going from 4M to 8M, as the time just doubles.

@cxj
Copy link
Contributor Author

cxj commented Sep 1, 2020

This call graph (from xdebug/callgrind) makes it appear that DOMElement::replaceChild() is the expensive operation.

callgraph

@cxj
Copy link
Contributor Author

cxj commented Sep 1, 2020

Not calling that replaceChild() method if the text attribute is empty greatly speeds up the test case, by a factor of 9 to 40 times faster. However, it does break numerous PHPUnit tests, not surprisingly. The generated HTML for my test case continues to be correct, so it appears at least for replacing an element with HTML as in the test, that functionality is not needed.

@cxj
Copy link
Contributor Author

cxj commented Sep 2, 2020

format: html seems to be the real killer here. It literally increases the CPU time consumed by a factor of 100 for a 1000 row table (approximately 1 million characters of HTML).

@cxj
Copy link
Contributor Author

cxj commented Sep 2, 2020

I think this is because when using format: html there is some looping resulting in repeated calls to DOMDocument::createTextNode() and DOMDocument::importNode(). The loop iterates once for each row in my test table, i.e. for each <tr> element, but not for other elements, e.g. the columns <td>.

I do not understand why the loop iterates on the <tr> element but not other elements. If it can handle an entire row of <td> elements in one go, why not multiple rows?

Since I have between 1,000 and 30,000 rows in my table, those DOMDocument method calls get expensive being repeated so many times.

@TRPB
Copy link
Member

TRPB commented Sep 2, 2020

It's because the HTML gets imported into a DOMDocument, so it wraps in a container tag as XML cannot have multiple root nodes. Because the container tag should not appear in the output, it has to iterate over all the child nodes and import them individually. Add a tbody tag around the rows and it will import just that one element.

@cj-clx
Copy link

cj-clx commented Sep 3, 2020 via email

@jpeurich
Copy link

jpeurich commented Sep 3, 2020

Tom @TRPB : There seems to be a number of "nice to know" things related to Transphporm. Any chance the documentation will ever be updated to reflect them and features that aren't yet documented? What is your procedure for others to add documentation updates?

@cxj
Copy link
Contributor Author

cxj commented Sep 3, 2020

Wrapping the inserted HTML in a container such as <tbody> does indeed greatly improve performance!

@jpeurich I cannot speak for Tom, but I think changes to the README or added documentation in the project repository itself are probably handled via Pull Requests. Changes to the Wiki pages appears to be open to anyone with a GitHub account.

@TRPB
Copy link
Member

TRPB commented Sep 3, 2020

Any chance the documentation will ever be updated to reflect them and features that aren't yet documented? What is your procedure for others to add documentation updates?

There are some features which need better documentation, the plugin system in particular. Now that I've finished my PhD I can spend more time on my open source projects. :)

In this instance, though, it's just a quirk of how DOMDocument works and I hadn't really envisaged the feature being used with such a large amount of HTML that it would make a noticeable difference.

This looks like it might be a faster alternative: https://www.php.net/manual/en/domdocumentfragment.appendxml.php but looking online it will only accept valid XML (there is no appendHTML equivalent like there is for loadXML/loadHTML) so any HTML entities or self-closing tags left open will break it.

There are a few solutions:

  1. Take the performance hit (leave the code as it is)
  2. Determine whether the code is XML or HTML and on the fly decide whether to use appendXML or importNode like it does currently. This will be much slower for small blocks of HTML because determining whether it is XML or HTML will incur a performance cost.
  3. Add format: xml which uses the method above, leaving format: html as is. It is then up to the developer to provide valid XML to the XML formatter. I might experiment with this but it adds to Transphporm's API without any actual new feature and I haven't yet tested how much faster (if at all) DOMDocumentFragment is for this task.

edit: Actually, there might be some performance benefits to using DOMDocumentFragment any time Transphporm adds to the document.

@cxj
Copy link
Contributor Author

cxj commented Sep 3, 2020

Congratulations on finishing your PhD!

Solution no. 3 above would suit my purposes well and would be the option I would choose.

I would be happy to A/B test alternatives for performance.

edit: once I've had some breakfast, I'll attempt to make some DOMDocumentFragment tests.

@TRPB
Copy link
Member

TRPB commented Sep 3, 2020

I've just had another thought on this.

Transphporm currently turns your rows into this:

<template>
   <tr>...</tr>
   <tr>...</tr>
   <tr>...</tr>
</template>

Then parses this into a DOMDocument before extracting each and importing it into the target element:

<table>
   <tr>...</tr>
   <tr>...</tr>
   <tr>...</tr>
</table>

I wonder if it would be faster to:

  1. Import the entire node at once <table><template><tr>...</tr></template></table>
  2. Move each <tr> and textNode up into its grandparent
  3. Delete the (now empty) <template> tag.

I haven't tested this but I would assume that moving an existing element around the document is faster than importing a new element into it.

@jpeurich
Copy link

jpeurich commented Sep 3, 2020

Tom @TRPB, Yes - Congratulations on completing your PhD. Sorry for getting off track in this thread, but this had to be said. Thanks for Transphporm.

@cj-clx
Copy link

cj-clx commented Sep 3, 2020

Now I find, in my "toy" command line test case, adding the container <tbody> around my inserted row content sped things up tremendously. However, in my production web code, modifying my code to insert or replace the <tbody> element with <tbody> wrapped rows, it actually lengthens the execution time by an approximate factor of 4. That's some strange WTF behavior.

@cxj
Copy link
Contributor Author

cxj commented Sep 16, 2020

We ultimately had to build the large HTML table using just straight PHP with string interpolation (e.g. HEREDOC). As described in the original issue up top, inserting this string into the template still consumed too much time. So we've unfortunately resorted to generating the full page HTML from Transphporm templating but containing a marker string as an HTML comment at the table. Then we str_replace() the table into that marker string at the end before sending the entire thing to the browser. This is very fast, but horribly tightly coupled and loses some of the great features that Transphporm provides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants