-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Feature: Try out parsica #20
base: main
Are you sure you want to change the base?
Conversation
…etPrecedenceMatcher` before: 23ms now: 53ms
back to 23ms ;)
All ExpressionParser tests are green
Hi @grebaldi ;) We talked about using a parser combinator, and i do still like the idea. So i started prototyping with the library http://parsica-php.github.io Beginning with 3870cd6 all tests using the It was fun to do and i do like the declarative style. The only current downside so far seems to be performance. The tests on branch main take about 23ms and this branch with parsica partially implemented takes already twice as much 44ms. There are some pitfalls when using parsica as one can build accidentally quite inefficent parsers (as demonstrated by the custom coded Edit: A faster, but less popular library than parsica seems to be: https://github.com/jubianchi/ppc/tree/master |
…able functional tests unit test pass functional test 15 fail, because of comments and the space and new line handling of TextNode
15 Functional tests fails because of insufficient comment, and whitespace handling in textnodes and other funny things. But basically its now working :D And every node can be parsed with parsica. Edit: all test pass ;) |
ALL TEST PASS :D ~90ms
But cache ExpressionParser which speeds it up to 68ms
parser combinators in php land:
or even use a functional library under the hood and build it mostly yourself: https://www.sitepoint.com/functional-programming-with-phunkie-parser-combinators-in-php/ |
this is to avoid infinite recursion, in case the nested parser calls `ExpressionParser::get()`
partially reverts: 0b23265
This looks also pretty neat and up to date / maintained: https://github.com/mrsuh/php-bison-skeleton/tree/master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mhsdesign,
I finally came around to do a full code review on your parsica PR :) (most of my code comments are inconsequential nitpicks, I just added them to offer my thoughts on some patterns I've noticed)
First of all, I have to point out what a wonderful job you've done on the implementation. I am particularly amazed by the speed at which you've learned the ins and outs of parsica and were able to apply that knowledge onto the ComponentEngine language. You are indeed virtuous in the parsing business :)
The functional-style that is implied with parsica's API looks quite elegant. There are however some places (like the UtilityParser
class) that reveal the awkwardness resulting from PHP's limited support for the functional paradigm.
That may also be the cause for a nasty surprise I encountered when I ran the tests with Xdebug on:
Xdebug has detected a possible infinite loop, and aborted your script with a stack depth of '256' frames
I was able to remove this warning by increasing Xdebug's built-in limit on call-stack depth, which verifies that there's not an actual infinite loop. Parsica simply causes those deep call-stacks - which isn't surprising given its approach. That however is very likely the reason for the deterioration of performance that you've mentioned.
Putting performance aside, I must add another point of criticism to the overall approach:
The Component Engine will offer a published language, which makes it imho a bit problematic to depend on a third-party at runtime for parsing it. In composer land, if we depend on parsica (or any of the other runtime parser-generators you've listed), then our dependency may add risk to consumers who depend on parsica as well. Additionally, if we'd go for a third-party library for parsing, we'd need to make sure not to leak any details of that library through our public interface.
That is why I'd much prefer a solution like https://github.com/mrsuh/php-bison-skeleton, because it only adds a dev dependency which is uncritical to Component Engine consumers.
However, I'm not quite convinced that Bison is right for us. Citing https://www.gnu.org/software/bison/:
You need to be fluent in C or C++ programming in order to use Bison.
Well, that's not entirely true, but in principle, contributors of the Component Engine would need to understand additional languages (other than PHP) like:
- Bison's grammar language (https://www.gnu.org/software/bison/manual/bison.html)
- M4 (https://www.gnu.org/software/m4/manual/m4.html)
Not sure if I'm being too petty about this, but I think as long as we're just targeting the PHP platform, the Component Engine should stay within the limitations of PHP. The first reason is that debugging and quality control will be more approachable for contributors when PHP is all they have to deal with (I'm saying "they", but I'm very much including myself in this :D).
The second reason is more speculative: I believe that staying within PHP's limits puts the right amount of pressure on the design of the Component Engine language to stay as simple as possible (simpler language leads to simpler parser) - whereas a more complex solution like Bison may lead to feature creep.
The approach of a static parser generator is interesting though.
As outlined in #26, the Component Engine will eventually require a formal language definition. As of right now, the document mentions EBNF (which is also the basis of Bison's grammar language). But I've done some more digging in that area:
With ISO-14977 EBNF is backed by a public standard. But as this article argues, the standard is not unproblematic.
The solution is supposed to be ABNF (Augmented Backus-Naur Form), which is backed by the open standard RFC 5234 (plus another one: RFC 7405 adding case sensitive string support to ABNF).
So, an idea would be to have a static parser generator that is able to generate a parser from textual ABNF (or EBNF for that matter) descriptions. Such a solution could then easily be externalized and used elsewhere (for instance: generic DSLs in Neos :D).
But my research wasn't finished as I've stumbled upon this one: https://www.crockford.com/mckeeman.html
This document describes the McKeeman Form, an alternative notation for grammars that is much simpler than its EBNF or ABNF counterparts. This is the notation that is used for the formal definition of JSON (see: https://www.json.org/json-en.html).
Having seen that, I'm more than intruiged. I believe it would be absolutely worthwhile to create a static parser generator that reads the McKeeman Form rather than ABNF or EBNF. As a side effect, this would generate a very concise formal language definition for the Component Engine :)
But, I digress. This sounds a lot like a separate project already 😅.
Back to your PR:
I think you've shown that the Token-Parser can (and should) indeed be replaced. This made me realize a thing that I had been thinking about for quite a while now: What actually ought to be our public API?
The answer is: The AST :)
So, my major takeaway from your efforts in implementing a parser on parsica-basis is:
- The AST object definitions are best separated from the parsing logic.
- A parser generator (if any) should best act statically
- The McKeeman Form may be an interesting format for expressing the Component Engine language grammar
- A McKeeman Form parser may be able to replace the Token Parser
This is all I got for now :) I'm very interested in your thoughts on all this.
And again: Thanks so much for your effort!
Source::fromString($enumDeclarationAsString) | ||
)->getIterator() | ||
); | ||
return EnumDeclarationParser::parseFromString($enumDeclarationAsString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EnumDeclarationNode::fromString
might as well be removed then.
Source::fromString($identifierAsString) | ||
)->getIterator() | ||
); | ||
return IdentifierParser::get()->tryString($identifierAsString)->output(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole method IdentifierNode::fromString
should be moved to IdentifierParser::parseFromString
I think.
private function __construct( | ||
public readonly Source $source, | ||
public function __construct( | ||
public readonly Path $sourcePath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent move!
Source::fromString($moduleAsString) | ||
)->getIterator() | ||
); | ||
return ModuleParser::parseFromString($moduleAsString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what I'm gonna say :D
Source::fromString($stringLiteralAsString) | ||
)->getIterator() | ||
); | ||
return StringLiteralParser::get()->tryString($stringLiteralAsString)->output(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same:
StringLiteralNode::fromString
-> StringLiteralParser::parseFromString
public static function hasPrecedence(Precedence $precedence): Parser | ||
{ | ||
return self::precedenceLookahead()->bind(function (Precedence $precedenceByNextCharacters) use ($precedence) { | ||
if ($precedence->mustStopAt($precedenceByNextCharacters)) { | ||
return fail($precedence->name . ' must stop at ' . $precedenceByNextCharacters->name); | ||
} | ||
return succeed(); | ||
})->label('precedence(' . $precedence->name . ')'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotta be honest: I do not understand at all what is happening here 😄
Coming from the ExpressionParser
it looks like lookAhead
is used here to understand whether we must leave the current nesting level of the expression parser. That'd be quite expensive, but I don't know if my guess is right - I sure have trouble wrapping my head around the higher-orderness of it all 😅
string('true')->voidLeft(new BooleanLiteralNode(true)), | ||
string('false')->voidLeft(new BooleanLiteralNode(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string('true')->voidLeft(new BooleanLiteralNode(true)), | |
string('false')->voidLeft(new BooleanLiteralNode(false)) | |
string('true')->map(fn() => new BooleanLiteralNode(true)), | |
string('false')->map(fn() => new BooleanLiteralNode(false)) |
voidLeft
is deprecated, but imho it is unnecessary API noise on parsica's side to begin with (perhaps that's why it's deprecated, but the deprecation message @TODO needs test
looks more like a "note to oneself" :D).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes he admitted that the deprecation only marks untested code ^^
/** @return Parser<NullLiteralNode> */ | ||
public static function get(): Parser | ||
{ | ||
return self::$i ??= string('null')->voidLeft(new NullLiteralNode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return self::$i ??= string('null')->voidLeft(new NullLiteralNode()); | |
return self::$i ??= string('null')->map(fn() => new NullLiteralNode()); |
// @todo specification | ||
return alphaChar()->append(takeWhile(orPred(isAlphaNum(), isEqual('-')))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point :D I don't think we had dashes covered before 😅
|
||
private static function stringLiteral(): Parser | ||
{ | ||
// @todo escapes? or allow `single unescaped $ dollar?` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick test under ECMAScript:
> deno repl
Deno 1.34.3
exit using ctrl+d, ctrl+c, or close()
> console.log(`Can there be a single $ in here?`);
Can there be a single $ in here?
I'd say, that speaks for
allow
single unescaped $ dollar?
I think we can close this. It was a fun idea to implement and would actually work mostly. But i consider this a really valuable point:
and besides this, parsica itself is not really stable yet and not performant at all - no one is working actively on the project since two years which makes this even harder to use as an actual dependency (without investing resources into it ourselves, which would be out of scope ^^) Anyways i really like to help with the rewrite at #34 For anyone coming to this pr later on, you might see that i did odd things with the precendence and stuff - i did this before i found out about parsicas own expression support, which i didnt leverage: https://parsica-php.github.io/docs/next/tutorial/20_expressions#arithmetic the currently build expression parsing is actually not really functional considering complex usecases like: |
@grebaldi