Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Feature: Try out parsica #20

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from
Draft

WIP: Feature: Try out parsica #20

wants to merge 21 commits into from

Conversation

mhsdesign
Copy link
Contributor

@mhsdesign mhsdesign commented May 18, 2023

@mhsdesign
Copy link
Contributor Author

mhsdesign commented May 19, 2023

Hi @grebaldi ;) We talked about using a parser combinator, and i do still like the idea. So i started prototyping with the library http://parsica-php.github.io

Beginning with 3870cd6 all tests using the ExpressionNode::fromString (which now uses parsica under the hood) succeed again ;)
So i implemented nearly all the logic for expression parsing (except a few edgecases) and have a complete replacement with parsica.

It was fun to do and i do like the declarative style. The only current downside so far seems to be performance. The tests on branch main take about 23ms and this branch with parsica partially implemented takes already twice as much 44ms.

There are some pitfalls when using parsica as one can build accidentally quite inefficent parsers (as demonstrated by the custom coded \PackageFactory\ComponentEngine\Parser\Parser\UtilityParser::strings, which safes currently about ~70ms in performance than if we were to do it the parsica api way.)

Edit: A faster, but less popular library than parsica seems to be: https://github.com/jubianchi/ppc/tree/master

mhsdesign added 4 commits May 19, 2023 16:01
…able functional tests

unit test pass
functional test 15 fail, because of comments and the space and new line handling of TextNode
@mhsdesign
Copy link
Contributor Author

mhsdesign commented May 19, 2023

15 Functional tests fails because of insufficient comment, and whitespace handling in textnodes and other funny things.

But basically its now working :D And every node can be parsed with parsica.

Edit: all test pass ;)

@mhsdesign
Copy link
Contributor Author

mhsdesign added 2 commits May 29, 2023 11:13
this is to avoid infinite recursion, in case the nested parser calls `ExpressionParser::get()`
@mhsdesign
Copy link
Contributor Author

This looks also pretty neat and up to date / maintained: https://github.com/mrsuh/php-bison-skeleton/tree/master

Copy link
Member

@grebaldi grebaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mhsdesign,

I finally came around to do a full code review on your parsica PR :) (most of my code comments are inconsequential nitpicks, I just added them to offer my thoughts on some patterns I've noticed)

First of all, I have to point out what a wonderful job you've done on the implementation. I am particularly amazed by the speed at which you've learned the ins and outs of parsica and were able to apply that knowledge onto the ComponentEngine language. You are indeed virtuous in the parsing business :)

The functional-style that is implied with parsica's API looks quite elegant. There are however some places (like the UtilityParser class) that reveal the awkwardness resulting from PHP's limited support for the functional paradigm.

That may also be the cause for a nasty surprise I encountered when I ran the tests with Xdebug on:

Xdebug has detected a possible infinite loop, and aborted your script with a stack depth of '256' frames

I was able to remove this warning by increasing Xdebug's built-in limit on call-stack depth, which verifies that there's not an actual infinite loop. Parsica simply causes those deep call-stacks - which isn't surprising given its approach. That however is very likely the reason for the deterioration of performance that you've mentioned.

Putting performance aside, I must add another point of criticism to the overall approach:

The Component Engine will offer a published language, which makes it imho a bit problematic to depend on a third-party at runtime for parsing it. In composer land, if we depend on parsica (or any of the other runtime parser-generators you've listed), then our dependency may add risk to consumers who depend on parsica as well. Additionally, if we'd go for a third-party library for parsing, we'd need to make sure not to leak any details of that library through our public interface.

That is why I'd much prefer a solution like https://github.com/mrsuh/php-bison-skeleton, because it only adds a dev dependency which is uncritical to Component Engine consumers.

However, I'm not quite convinced that Bison is right for us. Citing https://www.gnu.org/software/bison/:

You need to be fluent in C or C++ programming in order to use Bison.

Well, that's not entirely true, but in principle, contributors of the Component Engine would need to understand additional languages (other than PHP) like:

Not sure if I'm being too petty about this, but I think as long as we're just targeting the PHP platform, the Component Engine should stay within the limitations of PHP. The first reason is that debugging and quality control will be more approachable for contributors when PHP is all they have to deal with (I'm saying "they", but I'm very much including myself in this :D).

The second reason is more speculative: I believe that staying within PHP's limits puts the right amount of pressure on the design of the Component Engine language to stay as simple as possible (simpler language leads to simpler parser) - whereas a more complex solution like Bison may lead to feature creep.

The approach of a static parser generator is interesting though.

As outlined in #26, the Component Engine will eventually require a formal language definition. As of right now, the document mentions EBNF (which is also the basis of Bison's grammar language). But I've done some more digging in that area:

With ISO-14977 EBNF is backed by a public standard. But as this article argues, the standard is not unproblematic.

The solution is supposed to be ABNF (Augmented Backus-Naur Form), which is backed by the open standard RFC 5234 (plus another one: RFC 7405 adding case sensitive string support to ABNF).

So, an idea would be to have a static parser generator that is able to generate a parser from textual ABNF (or EBNF for that matter) descriptions. Such a solution could then easily be externalized and used elsewhere (for instance: generic DSLs in Neos :D).

But my research wasn't finished as I've stumbled upon this one: https://www.crockford.com/mckeeman.html

This document describes the McKeeman Form, an alternative notation for grammars that is much simpler than its EBNF or ABNF counterparts. This is the notation that is used for the formal definition of JSON (see: https://www.json.org/json-en.html).

Having seen that, I'm more than intruiged. I believe it would be absolutely worthwhile to create a static parser generator that reads the McKeeman Form rather than ABNF or EBNF. As a side effect, this would generate a very concise formal language definition for the Component Engine :)

But, I digress. This sounds a lot like a separate project already 😅.

Back to your PR:

I think you've shown that the Token-Parser can (and should) indeed be replaced. This made me realize a thing that I had been thinking about for quite a while now: What actually ought to be our public API?

The answer is: The AST :)

So, my major takeaway from your efforts in implementing a parser on parsica-basis is:

  1. The AST object definitions are best separated from the parsing logic.
  2. A parser generator (if any) should best act statically
  3. The McKeeman Form may be an interesting format for expressing the Component Engine language grammar
  4. A McKeeman Form parser may be able to replace the Token Parser

This is all I got for now :) I'm very interested in your thoughts on all this.

And again: Thanks so much for your effort!

Source::fromString($enumDeclarationAsString)
)->getIterator()
);
return EnumDeclarationParser::parseFromString($enumDeclarationAsString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnumDeclarationNode::fromString might as well be removed then.

Source::fromString($identifierAsString)
)->getIterator()
);
return IdentifierParser::get()->tryString($identifierAsString)->output();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole method IdentifierNode::fromString should be moved to IdentifierParser::parseFromString I think.

private function __construct(
public readonly Source $source,
public function __construct(
public readonly Path $sourcePath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent move!

Source::fromString($moduleAsString)
)->getIterator()
);
return ModuleParser::parseFromString($moduleAsString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what I'm gonna say :D

Source::fromString($stringLiteralAsString)
)->getIterator()
);
return StringLiteralParser::get()->tryString($stringLiteralAsString)->output();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same:
StringLiteralNode::fromString -> StringLiteralParser::parseFromString

Comment on lines +59 to +67
public static function hasPrecedence(Precedence $precedence): Parser
{
return self::precedenceLookahead()->bind(function (Precedence $precedenceByNextCharacters) use ($precedence) {
if ($precedence->mustStopAt($precedenceByNextCharacters)) {
return fail($precedence->name . ' must stop at ' . $precedenceByNextCharacters->name);
}
return succeed();
})->label('precedence(' . $precedence->name . ')');
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotta be honest: I do not understand at all what is happening here 😄

Coming from the ExpressionParser it looks like lookAhead is used here to understand whether we must leave the current nesting level of the expression parser. That'd be quite expensive, but I don't know if my guess is right - I sure have trouble wrapping my head around the higher-orderness of it all 😅

Comment on lines +41 to +42
string('true')->voidLeft(new BooleanLiteralNode(true)),
string('false')->voidLeft(new BooleanLiteralNode(false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string('true')->voidLeft(new BooleanLiteralNode(true)),
string('false')->voidLeft(new BooleanLiteralNode(false))
string('true')->map(fn() => new BooleanLiteralNode(true)),
string('false')->map(fn() => new BooleanLiteralNode(false))

voidLeft is deprecated, but imho it is unnecessary API noise on parsica's side to begin with (perhaps that's why it's deprecated, but the deprecation message @TODO needs test looks more like a "note to oneself" :D).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes he admitted that the deprecation only marks untested code ^^

/** @return Parser<NullLiteralNode> */
public static function get(): Parser
{
return self::$i ??= string('null')->voidLeft(new NullLiteralNode());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return self::$i ??= string('null')->voidLeft(new NullLiteralNode());
return self::$i ??= string('null')->map(fn() => new NullLiteralNode());

Comment on lines +66 to +67
// @todo specification
return alphaChar()->append(takeWhile(orPred(isAlphaNum(), isEqual('-'))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point :D I don't think we had dashes covered before 😅


private static function stringLiteral(): Parser
{
// @todo escapes? or allow `single unescaped $ dollar?`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick test under ECMAScript:

> deno repl
Deno 1.34.3
exit using ctrl+d, ctrl+c, or close()
> console.log(`Can there be a single $ in here?`);
Can there be a single $ in here?

I'd say, that speaks for

allow single unescaped $ dollar?

@mhsdesign
Copy link
Contributor Author

I think we can close this. It was a fun idea to implement and would actually work mostly. But i consider this a really valuable point:

The Component Engine will offer a published language, which makes it imho a bit problematic to depend on a third-party at runtime for parsing it. In composer land, if we depend on parsica (or any of the other runtime parser-generators you've listed), then our dependency may add risk to consumers who depend on parsica as well. Additionally, if we'd go for a third-party library for parsing, we'd need to make sure not to leak any details of that library through our public interface.

and besides this, parsica itself is not really stable yet and not performant at all - no one is working actively on the project since two years which makes this even harder to use as an actual dependency (without investing resources into it ourselves, which would be out of scope ^^)

Anyways i really like to help with the rewrite at #34

For anyone coming to this pr later on, you might see that i did odd things with the precendence and stuff - i did this before i found out about parsicas own expression support, which i didnt leverage:

https://parsica-php.github.io/docs/next/tutorial/20_expressions#arithmetic

the currently build expression parsing is actually not really functional considering complex usecases like: -1 * 2 - 20 * -1, but it currently doenst also work in the to be replaced parser see #27 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants