Parser: Propose new hand-coded parser (#8083)

* Parser: Propose new hand-coded PHP parser For some time we've needed a more performant PHP parser for the first stage of parsing the `post_content` document. See #1681 (early exploration) See #8044 (parser performance issue) See #1775 (parser performance, fixed in php-pegjs) I'm proposing this implementation of the spec parser as an alternative to the auto-generated parser from the PEG definition. This is not yet ready to go but I wanted to get the code in a branch so I can iterate on it and garner early feedback. This should eventually provide a setup fixture for #6831 wherein we are testing alternate parser implementations. - designed as a basic recursive-descent - but doesn't recurse on the call-stack, recurses via trampoline - moves linearly through document in one pass - relies on RegExp for tokenization - nested blocks include the nested content in their `innerHTML` this needs to go away - create test fixutre - figure out where to save this file * Fix issue with containing the nested innerHTML * Also handle newlines as whitespace * Use classes for some static typing * add type hints * remove needless comment * space where space is due * meaningless rename * remove needless function call * harmonize with spec parser * don't forget freeform HTML before blocks * account for oddity in spec-parser * add some polish, fix a thing * comment it * add JS version too * Change `.` to `[^]` because `/s` isn't well supported in JS The `s` flag on the RegExp object informs the engine to treat a dot character as a class that includes the newline character. Without it newlines aren't considered in the dot. Since this flag is new to Javascript and not well supported in different browsers I have removed it in favor of an explicit class of characters that _does_ include the newline, namely the open exclusion of `[^]` which permits all input characters. Hat-top to @Hywan for finding this. * Move code into `/packages` directory, prepare for review * take out names from RegExp pattern to not fail tests * Fix bug in parser: store HTML soup in stack frames while parsing Previously we were sending all "HTML soup" segments of HTML between blocks to the output list before any blocks were processed. We should have been tracking these segments during the parsing and only spit them out when closing a block at the top level. This change stores the index into the input document at which that soup starts if it exists and then produces the freeform block when adding a block to the output from the parse frame stack. * fix whitespace * fix oddity in spec * match styles * use class name filter on server-side parser class * fix whitespace * Document extensibility * fix typo in example code * Push failing parsing test * fix lazy/greedy bug in parser regexp * Docs: Fix typos, links, tweak style. * update from PR feedback * trim docs * Load default block parser, replacing PEG-generated one * Expand `?:` shorthand for PHP 5.2 compat * add fixtures test for default parser * spaces to tabs * could we need no assoc? * fill out return array * put that assoc back in there * isometrize * rename and add 0 * Conditionally include the parser class * Add docblocks * Standardize the package configuration
WordPress · Sep 6, 2018 · 694a19b · 694a19b
1 parent bbca724
commit 694a19b
Show file tree

Hide file tree

Showing 20 changed files with 1,004 additions and 8 deletions.
diff --git a/docs/extensibility.md b/docs/extensibility.md
@@ -74,3 +74,9 @@ There are some advanced block features which require opt-in support in the theme
 ## Autocomplete
 
 Autocompleters within blocks may be extended and overridden. See [autocomplete](../docs/extensibility/autocomplete.md).
+
+## Block Parsing and Serialization
+
+Posts in the editor move through a couple of different stages between being stored in `post_content` and appearing in the editor. Since the blocks themselves are data structures that live in memory it takes a parsing and serialization step to transform out from and into the stored format in the database.
+
+Customizing the parser is an advanced topic that you can learn more about in the [Extending the Parser](../docs/extensibility/parser.md) section.
diff --git a/docs/extensibility/parser.md b/docs/extensibility/parser.md
@@ -0,0 +1,36 @@
+# Extending the Parser
+
+When the editor is interacting with blocks, these are stored in memory as data structures comprising a few basic properties and attributes. Upon saving a working post we serialize these data structures into a specific HTML structure and save the resultant string into the `post_content` property of the post in the WordPress database. When we load that post back into the editor we have to make the reverse transformation to build those data structures from the serialized format in HTML.
+
+The process of loading the serialized HTML into the editor is performed by the _block parser_. The formal specification for this transformation is encoded in the parsing expression grammar (PEG) inside the `@wordpress/block-serialization-spec-parser` package. The editor provides a default parser implementation of this grammar but there may be various reasons for replacing that implementation with a custom implementation. We can inject our own custom parser implementation through the appropriate filter.
+
+## Server-side parser
+
+Plugins have access to the parser if they want to process posts in their structured form instead of a plain HTML-as-string representation.
+
+## Client-side parser
+
+The editor uses the client-side parser while interactively working in a post. The plain HTML-as-string representation is sent to the browser by the backend and then the editor performs the first parse to initialize itself.
+
+## Filters
+
+To replace the server-side parser, use the `block_parser_class` filter. The filter transforms the string class name of a parser class. This class is expected to expose a `parse` method.
+
+_Example:_
+
+```php
+class EmptyParser {
+  public function parse( $post_content ) {
+    // return an empty document
+    return array();
+  }
+}
+
+function my_plugin_select_empty_parser( $prev_parser_class ) {
+    return 'EmptyParser';
+}
+
+add_filter( 'block_parser_class', 'my_plugin_select_empty_parser', 10, 1 );
+```
+
+> **Note**: At the present time it's not possible to replace the client-side parser.
diff --git a/docs/manifest.json b/docs/manifest.json
@@ -287,6 +287,12 @@
 		"markdown_source": "https://raw.githubusercontent.com/WordPress/gutenberg/master/packages/block-library/README.md",
 		"parent": "packages"
 	},
+	{
+		"title": "@wordpress/block-serialization-default-parser",
+		"slug": "packages-block-serialization-default-parser",
+		"markdown_source": "https://raw.githubusercontent.com/WordPress/gutenberg/master/packages/block-serialization-default-parser/README.md",
+		"parent": "packages"
+	},
 	{
 		"title": "@wordpress/block-serialization-spec-parser",
 		"slug": "packages-block-serialization-spec-parser",

diff --git a/lib/blocks.php b/lib/blocks.php
@@ -66,8 +66,20 @@ function gutenberg_parse_blocks( $content ) {
 		);
 	}
 
-	$parser = new Gutenberg_PEG_Parser;
-	return $parser->parse( _gutenberg_utf8_split( $content ) );
+	/**
+	 * Filter to allow plugins to replace the server-side block parser
+	 *
+	 * @since 3.8.0
+	 *
+	 * @param string $parser_class Name of block parser class
+	 */
+	$parser_class = apply_filters( 'block_parser_class', 'WP_Block_Parser' );
+	// Load default block parser for server-side parsing if the default parser class is being used.
+	if ( 'WP_Block_Parser' === $parser_class ) {
+		require_once dirname( __FILE__ ) . '/../packages/block-serialization-default-parser/parser.php';
+	}
+	$parser = new $parser_class();
+	return $parser->parse( $content );
 }
 
 /**

diff --git a/lib/client-assets.php b/lib/client-assets.php
@@ -275,6 +275,13 @@ function gutenberg_register_scripts_and_styles() {
 		filemtime( gutenberg_dir_path() . 'build/dom/index.js' ),
 		true
 	);
+	wp_register_script(
+		'wp-block-serialization-default-parser',
+		gutenberg_url( 'build/block-serialization-default-parser/index.js' ),
+		array(),
+		filemtime( gutenberg_dir_path() . 'build/block-serialization-default-parser/index.js' ),
+		true
+	);
 	wp_register_script(
 		'wp-block-serialization-spec-parser',
 		gutenberg_url( 'build/block-serialization-spec-parser/index.js' ),
@@ -386,7 +393,7 @@ function gutenberg_register_scripts_and_styles() {
 		array(
 			'wp-autop',
 			'wp-blob',
-			'wp-block-serialization-spec-parser',
+			'wp-block-serialization-default-parser',
 			'wp-data',
 			'wp-deprecated',
 			'wp-dom',

diff --git a/lib/load.php b/lib/load.php
@@ -29,7 +29,6 @@
 require dirname( __FILE__ ) . '/compat.php';
 require dirname( __FILE__ ) . '/plugin-compat.php';
 require dirname( __FILE__ ) . '/i18n.php';
-require dirname( __FILE__ ) . '/parser.php';
 require dirname( __FILE__ ) . '/register.php';
 
 

diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -20,6 +20,7 @@
 		"@wordpress/autop": "file:packages/autop",
 		"@wordpress/blob": "file:packages/blob",
 		"@wordpress/block-library": "file:packages/block-library",
+		"@wordpress/block-serialization-default-parser": "file:packages/block-serialization-default-parser",
 		"@wordpress/block-serialization-spec-parser": "file:packages/block-serialization-spec-parser",
 		"@wordpress/blocks": "file:packages/blocks",
 		"@wordpress/components": "file:packages/components",

diff --git a/packages/block-serialization-default-parser/.npmrc b/packages/block-serialization-default-parser/.npmrc
@@ -0,0 +1 @@
+package-lock=false
diff --git a/packages/block-serialization-default-parser/CHANGELOG.md b/packages/block-serialization-default-parser/CHANGELOG.md
@@ -0,0 +1,3 @@
+## 1.0.0
+
+-   Initial release.
diff --git a/packages/block-serialization-default-parser/README.md b/packages/block-serialization-default-parser/README.md
@@ -0,0 +1,126 @@
+# Block Serialization Default Parser
+
+This library contains the default block serialization parser implementations for WordPress documents. It provides native PHP and JavaScript parsers that implement the specification from `@wordpress/block-serialization-spec-parser` and which normally operates on the document stored in `post_content`.
+
+## Installation
+
+Install the module
+
+```bash
+npm install @wordpress/block-serialization-default-parser --save
+```
+
+_This package assumes that your code will run in an **ES2015+** environment. If you're using an environment that has limited or no support for ES2015+ such as lower versions of IE then using [core-js](https://github.com/zloirock/core-js) or [@babel/polyfill](https://babeljs.io/docs/en/next/babel-polyfill) will add support for these methods. Learn more about it in [Babel docs](https://babeljs.io/docs/en/next/caveats)._
+
+## Usage
+
+Input post:
+```html
+<!-- wp:columns {"columns":3} -->
+<div class="wp-block-columns has-3-columns"><!-- wp:column -->
+<div class="wp-block-column"><!-- wp:paragraph -->
+<p>Left</p>
+<!-- /wp:paragraph --></div>
+<!-- /wp:column -->
+
+<!-- wp:column -->
+<div class="wp-block-column"><!-- wp:paragraph -->
+<p><strong>Middle</strong></p>
+<!-- /wp:paragraph --></div>
+<!-- /wp:column -->
+
+<!-- wp:column -->
+<div class="wp-block-column"></div>
+<!-- /wp:column --></div>
+<!-- /wp:columns -->
+```
+
+Parsing code:
+```js
+import { parse } from '@wordpress/block-serialization-default-parser';
+
+parse( post ) === [
+    {
+        blockName: "core/columns",
+        attrs: {
+            columns: 3
+        },
+        innerBlocks: [
+            {
+                blockName: "core/column",
+                attrs: null,
+                innerBlocks: [
+                    {
+                        blockName: "core/paragraph",
+                        attrs: null,
+                        innerBlocks: [],
+                        innerHTML: "\n<p>Left</p>\n"
+                    }
+                ],
+                innerHTML: '\n<div class="wp-block-column"></div>\n'
+            },
+            {
+                blockName: "core/column",
+                attrs: null,
+                innerBlocks: [
+                    {
+                        blockName: "core/paragraph",
+                        attrs: null,
+                        innerBlocks: [],
+                        innerHTML: "\n<p><strong>Middle</strong></p>\n"
+                    }
+                ],
+                innerHTML: '\n<div class="wp-block-column"></div>\n'
+            },
+            {
+                blockName: "core/column",
+                attrs: null,
+                innerBlocks: [],
+                innerHTML: '\n<div class="wp-block-column"></div>\n'
+            }
+        ],
+        innerHTML: '\n<div class="wp-block-columns has-3-columns">\n\n\n\n</div>\n'
+    }
+];
+```
+
+## Theory
+
+### What is different about this one from the spec-parser?
+
+This is a recursive-descent parser that scans linearly once through the input document. Instead of directly recursing it utilizes a trampoline mechanism to prevent stack overflow. It minimizes data copying and passing through the use of globals for tracking state through the parse. Between every token (a block comment delimiter) we can instrument the parser and intervene should we want to; for example we might put a hard limit on how long we can be parsing a document or provide additional debugging diagnostics for a document.
+
+The spec parser is defined via a _Parsing Expression Grammar_ (PEG) which answers many questions inherently that we must answer explicitly in this parser. The goal for this implementation is to match the characteristics of the PEG so that it can be directly swapped out and so that the only changes are better runtime performance and memory usage.
+
+### How does it work?
+
+Every serialized Gutenberg document is nominally an HTML document which, in addition to normal HTML, may also contain specially designed HTML comments -- the block comment delimiters -- which separate and isolate the blocks serialized in the document.
+
+This parser attempts to create a state-machine around the transitions triggered from those delimiters -- the "tokens" of the grammar. Every time we find one we should only be doing either of:
+
+ - enter a new block;
+ - exit out of a block.
+
+Those actions have different effects depending on the context; for instance, when we exit a block we either need to add it to the output block list _or_ we need to append it as the next `innerBlock` on the parent block below it in the block stack (the place where we track open blocks). The details are documented below.
+
+The biggest challenge in this parser is making the right accounting of indices required to construct the `innerHTML` values for each block at every level of nesting depth. We take a simple approach:
+
+ - Start each newly opened block with an empty `innerHTML`.
+ - Whenever we push a first block into the `innerBlocks` list, add the content from where the content of the parent block started to where this inner block starts.
+ - Whenever we push another block into the `innerBlocks` list, add the content from where the previous inner block ended to where this inner block starts.
+ - When we close out an open block, add the content from where the last inner block ended to where the closing block delimiter starts.
+ - If there are no inner blocks then we take the entire content between the opening and closing block comment delimiters as the `innerHTML`.
+
+### I meant, how does it perform?
+
+This parser operates much faster than the generated parser from the specification. Because we know more about the parsing than the PEG does we can take advantage of several tricks to improve our speed and memory usage:
+
+ - We only have one or two distinct tokens, depending on how you look at it, and they are all readily matched via a regular expression. Instead of parsing on a character-per-character basis we can allow the PCRE RegExp engine to skip over large swaths of the document for us in order to find those tokens.
+ - Since `preg_match()` takes an `offset` parameter we can crawl through the input without passing copies of the input text on every step. We can track our position in the string and only pass a number instead.
+ - Not copying all those strings means that we'll also skip many memory allocations.
+
+Further, tokenizing with a RegExp brings an additional advantage. The parser generated by the PEG provides predictable performance characteristics in exchange for control over tokenization rules -- it doesn't allow us to define RegExp patterns in the rules so as to guard against _e.g._ cataclysmic backtracking that would break the PEG guarantees.
+
+However, since our "token language" of the block comment delimiters is _regular_ and _can_ be trivially matched with RegExp patterns, we can do that here and then something magical happens: we jump out of PHP or JavaScript and into a highly-optimized RegExp engine written in C or C++ on the host system. We thereby leave the virtual machine and its overhead.
+
+<br/><br/><p align="center"><img src="https://s.w.org/style/images/codeispoetry.png?1" alt="Code is Poetry." /></p>
diff --git a/packages/block-serialization-default-parser/package.json b/packages/block-serialization-default-parser/package.json
@@ -0,0 +1,29 @@
+{
+	"name": "@wordpress/block-serialization-default-parser",
+	"version": "1.0.0-rc.0",
+	"description": "Block serialization specification parser for WordPress posts.",
+	"author": "The WordPress Contributors",
+	"license": "GPL-2.0-or-later",
+	"keywords": [
+		"wordpress",
+		"block",
+		"parser"
+	],
+	"homepage": "https://github.com/WordPress/gutenberg/tree/master/packages/block-serialization-default-parser/README.md",
+	"repository": {
+		"type": "git",
+		"url": "https://github.com/WordPress/gutenberg.git"
+	},
+	"bugs": {
+		"url": "https://github.com/WordPress/gutenberg/issues"
+	},
+	"main": "build/index.js",
+	"module": "build-module/index.js",
+	"react-native": "src/index",
+	"dependencies": {
+		"@babel/runtime": "^7.0.0"
+	},
+	"publishConfig": {
+		"access": "public"
+	}
+}