2011-11-10 11:40:11 +01:00
|
|
|
Introduction
|
|
|
|
============
|
|
|
|
|
2022-09-04 16:16:25 +02:00
|
|
|
This project is a PHP parser **written in PHP itself**.
|
2011-11-10 11:40:11 +01:00
|
|
|
|
|
|
|
What is this for?
|
|
|
|
-----------------
|
|
|
|
|
2014-09-12 00:20:22 +02:00
|
|
|
A parser is useful for [static analysis][0], manipulation of code and basically any other
|
2011-11-10 11:40:11 +01:00
|
|
|
application dealing with code programmatically. A parser constructs an [Abstract Syntax Tree][1]
|
|
|
|
(AST) of the code and thus allows dealing with it in an abstract and robust way.
|
|
|
|
|
2014-09-12 00:20:22 +02:00
|
|
|
There are other ways of processing source code. One that PHP supports natively is using the
|
2011-11-10 11:40:11 +01:00
|
|
|
token stream generated by [`token_get_all`][2]. The token stream is much more low level than
|
2012-11-05 17:44:56 +01:00
|
|
|
the AST and thus has different applications: It allows to also analyze the exact formatting of
|
2023-07-02 19:12:02 +02:00
|
|
|
a file. On the other hand, the token stream is much harder to deal with for more complex analysis.
|
2018-02-28 10:40:30 -05:00
|
|
|
For example, an AST abstracts away the fact that, in PHP, variables can be written as `$foo`, but also
|
2011-11-10 11:40:11 +01:00
|
|
|
as `$$bar`, `${'foobar'}` or even `${!${''}=barfoo()}`. You don't have to worry about recognizing
|
|
|
|
all the different syntaxes from a stream of tokens.
|
|
|
|
|
2016-01-28 19:31:28 +05:30
|
|
|
Another question is: Why would I want to have a PHP parser *written in PHP*? Well, PHP might not be
|
2011-11-10 11:40:11 +01:00
|
|
|
a language especially suited for fast parsing, but processing the AST is much easier in PHP than it
|
2023-07-02 19:12:02 +02:00
|
|
|
would be in other, faster languages like C. Furthermore the people most likely wanting to do
|
2012-11-05 17:44:56 +01:00
|
|
|
programmatic PHP code analysis are incidentally PHP developers, not C developers.
|
2011-11-10 11:40:11 +01:00
|
|
|
|
|
|
|
What can it parse?
|
|
|
|
------------------
|
|
|
|
|
2022-09-04 16:16:25 +02:00
|
|
|
The parser supports parsing PHP 7 and PHP 8 code, with the following exceptions:
|
2021-04-25 22:47:15 +02:00
|
|
|
|
|
|
|
* Namespaced names containing whitespace (e.g. `Foo \ Bar` instead of `Foo\Bar`) are not supported.
|
2022-08-20 17:14:50 +02:00
|
|
|
These are illegal in PHP 8, but are legal in earlier versions. However, PHP-Parser does not
|
2021-04-25 22:47:15 +02:00
|
|
|
support them for any version.
|
2011-11-10 11:40:11 +01:00
|
|
|
|
2022-09-04 16:16:25 +02:00
|
|
|
PHP-Parser 4.x had full support for parsing PHP 5. PHP-Parser 5.x has only limited support, with the
|
|
|
|
following caveats:
|
|
|
|
|
|
|
|
* Some variable expressions like `$$foo[0]` are valid in both PHP 5 and PHP 7, but have different
|
|
|
|
interpretation. In such cases, the PHP 7 AST will always be constructed (using `($$foo)[0]`
|
|
|
|
rather than `${$foo[0]}`).
|
|
|
|
* Declarations of the form `global $$var[0]` are not supported in PHP 7 and will cause a parse
|
|
|
|
error. In error recovery mode, it is possible to continue parsing after such declarations.
|
|
|
|
|
2012-02-21 19:02:04 +01:00
|
|
|
As the parser is based on the tokens returned by `token_get_all` (which is only able to lex the PHP
|
2016-07-22 17:07:56 +02:00
|
|
|
version it runs on), additionally a wrapper for emulating tokens from newer versions is provided.
|
2023-08-16 20:58:35 +02:00
|
|
|
This allows to parse PHP 8.3 source code running on PHP 7.4, for example. This emulation is not
|
2022-09-04 16:16:25 +02:00
|
|
|
perfect, but works well in practice.
|
|
|
|
|
|
|
|
Finally, it should be noted that the parser aims to accept all valid code, not reject all invalid
|
|
|
|
code. It will generally accept code that is only valid in newer versions (even when targeting an
|
|
|
|
older one), and accept code that is syntactically correct, but would result in a compiler error.
|
2011-11-10 11:40:11 +01:00
|
|
|
|
|
|
|
What output does it produce?
|
|
|
|
----------------------------
|
|
|
|
|
2018-02-28 10:40:30 -05:00
|
|
|
The parser produces an [Abstract Syntax Tree][1] (AST) also known as a node tree. How this looks
|
2011-11-10 11:40:11 +01:00
|
|
|
can best be seen in an example. The program `<?php echo 'Hi', 'World';` will give you a node tree
|
|
|
|
roughly looking like this:
|
|
|
|
|
2011-11-12 19:28:53 +01:00
|
|
|
```
|
|
|
|
array(
|
|
|
|
0: Stmt_Echo(
|
|
|
|
exprs: array(
|
|
|
|
0: Scalar_String(
|
|
|
|
value: Hi
|
|
|
|
)
|
|
|
|
1: Scalar_String(
|
|
|
|
value: World
|
2011-11-10 11:40:11 +01:00
|
|
|
)
|
|
|
|
)
|
|
|
|
)
|
2011-11-12 19:28:53 +01:00
|
|
|
)
|
|
|
|
```
|
2011-11-10 11:40:11 +01:00
|
|
|
|
2014-09-12 00:20:22 +02:00
|
|
|
This matches the structure of the code: An echo statement, which takes two strings as expressions,
|
2019-06-29 23:17:13 -04:00
|
|
|
with the values `Hi` and `World`.
|
2011-11-10 11:40:11 +01:00
|
|
|
|
2012-05-11 16:44:13 +02:00
|
|
|
You can also see that the AST does not contain any whitespace information (but most comments are saved).
|
2022-09-04 16:16:25 +02:00
|
|
|
However, it does retain accurate position information, which can be used to inspect precise formatting.
|
2011-11-10 11:40:11 +01:00
|
|
|
|
|
|
|
What else can it do?
|
|
|
|
--------------------
|
|
|
|
|
2023-07-02 19:12:02 +02:00
|
|
|
Apart from the parser itself, this package also bundles support for some other, related features:
|
2011-11-10 11:40:11 +01:00
|
|
|
|
|
|
|
* Support for pretty printing, which is the act of converting an AST into PHP code. Please note
|
|
|
|
that "pretty printing" does not imply that the output is especially pretty. It's just how it's
|
|
|
|
called ;)
|
2023-07-02 19:12:02 +02:00
|
|
|
* Support for serializing and unserializing the node tree to JSON.
|
2022-09-04 16:16:25 +02:00
|
|
|
* Support for dumping the node tree in a human-readable form (see the section above for an
|
2023-07-02 19:12:02 +02:00
|
|
|
example of how the output looks like).
|
|
|
|
* Infrastructure for traversing and changing the AST (node traverser and node visitors).
|
|
|
|
* A node visitor for resolving namespaced names.
|
2011-11-10 11:40:11 +01:00
|
|
|
|
|
|
|
[0]: http://en.wikipedia.org/wiki/Static_program_analysis
|
|
|
|
[1]: http://en.wikipedia.org/wiki/Abstract_syntax_tree
|
2014-10-01 09:18:01 +01:00
|
|
|
[2]: http://php.net/token_get_all
|