PHP-Parser/lib/PhpParser/Lexer.php

<?php declare(strict_types=1);

namespace PhpParser;

require __DIR__ . '/compatibility_tokens.php';

class Lexer {
    /** @var list<Token> List of tokens */
    protected array $tokens;

    /**
     * Tokenize the provided source code.
     *
     * The token array is in the same format as provided by the PhpToken::tokenize() method in
     * PHP 8.0. The tokens are instances of PhpParser\Token, to abstract over a polyfill
     * implementation in earlier PHP version.
     *
     * The token array is terminated by a sentinel token with token ID 0.
     * The token array does not discard any tokens (i.e. whitespace and comments are included).
     * The token position attributes are against this token array.
     *
     * @param string $code The source code to tokenize.
     * @param ErrorHandler|null $errorHandler Error handler to use for lexing errors. Defaults to
     *                                        ErrorHandler\Throwing.
     * @return Token[] Tokens
     */
    public function tokenize(string $code, ?ErrorHandler $errorHandler = null): array {
        if (null === $errorHandler) {
            $errorHandler = new ErrorHandler\Throwing();
        }

        $scream = ini_set('xdebug.scream', '0');

        $this->tokens = @Token::tokenize($code);
        $this->postprocessTokens($errorHandler);

        if (false !== $scream) {
            ini_set('xdebug.scream', $scream);
        }

        return $this->tokens;
    }

    private function handleInvalidCharacter(Token $token, ErrorHandler $errorHandler): void {
        $chr = $token->text;
        if ($chr === "\0") {
            // PHP cuts error message after null byte, so need special case
            $errorMsg = 'Unexpected null byte';
        } else {
            $errorMsg = sprintf(
                'Unexpected character "%s" (ASCII %d)', $chr, ord($chr)
            );
        }

        $errorHandler->handleError(new Error($errorMsg, [
            'startLine' => $token->line,
            'endLine' => $token->line,
            'startFilePos' => $token->pos,
            'endFilePos' => $token->pos,
        ]));
    }

    private function isUnterminatedComment(Token $token): bool {
        return $token->is([\T_COMMENT, \T_DOC_COMMENT])
            && substr($token->text, 0, 2) === '/*'
            && substr($token->text, -2) !== '*/';
    }

    protected function postprocessTokens(ErrorHandler $errorHandler): void {
        // This function reports errors (bad characters and unterminated comments) in the token
        // array, and performs certain canonicalizations:
        //  * Use PHP 8.1 T_AMPERSAND_NOT_FOLLOWED_BY_VAR_OR_VARARG and
        //    T_AMPERSAND_FOLLOWED_BY_VAR_OR_VARARG tokens used to disambiguate intersection types.
        //  * Add a sentinel token with ID 0.

        $numTokens = \count($this->tokens);
        if ($numTokens === 0) {
            // Empty input edge case: Just add the sentinel token.
            $this->tokens[] = new Token(0, "\0", 1, 0);
            return;
        }

        for ($i = 0; $i < $numTokens; $i++) {
            $token = $this->tokens[$i];
            if ($token->id === \T_BAD_CHARACTER) {
                $this->handleInvalidCharacter($token, $errorHandler);
            }

            if ($token->id === \ord('&')) {
                $next = $i + 1;
                while (isset($this->tokens[$next]) && $this->tokens[$next]->id === \T_WHITESPACE) {
                    $next++;
                }
                $followedByVarOrVarArg = isset($this->tokens[$next]) &&
                    $this->tokens[$next]->is([\T_VARIABLE, \T_ELLIPSIS]);
                $token->id = $followedByVarOrVarArg
                    ? \T_AMPERSAND_FOLLOWED_BY_VAR_OR_VARARG
                    : \T_AMPERSAND_NOT_FOLLOWED_BY_VAR_OR_VARARG;
            }
        }

        // Check for unterminated comment
        $lastToken = $this->tokens[$numTokens - 1];
        if ($this->isUnterminatedComment($lastToken)) {
            $errorHandler->handleError(new Error('Unterminated comment', [
                'startLine' => $lastToken->line,
                'endLine' => $lastToken->getEndLine(),
                'startFilePos' => $lastToken->pos,
                'endFilePos' => $lastToken->getEndPos(),
            ]));
        }

        // Add sentinel token.
        $this->tokens[] = new Token(0, "\0", $lastToken->getEndLine(), $lastToken->getEndPos());
    }

    /**
     * Returns the token array for the last tokenized source code.
     *
     * @return Token[] Array of tokens
     */
    public function getTokens(): array {
        return $this->tokens;
    }
}
Add strict_types to lib code 2017-08-18 22:57:27 +02:00			`<?php declare(strict_types=1);`
Initial commit 2011-04-18 19:02:30 +02:00
Port library to use namespaces, with BC for old names 2014-02-06 14:44:16 +01:00			`namespace PhpParser;`

Move definition of compatibility tokens into separate file 2022-06-19 17:29:24 +02:00			`require __DIR__ . '/compatibility_tokens.php';`

Add php-cs-fixer config and reformat The formatting in this project has become something of a mess, because it changed over time. Add a CS fixer config and reformat to the desired style, which is PSR-12, but with sane brace placement. 2022-08-28 22:57:06 +02:00			`class Lexer {`
Declare list types (#907) Closes #905 2022-12-14 22:59:53 +01:00			`/** @var list<Token> List of tokens */`
Add property types Types omitted in two places where we violate them currently: Namespace_::$stmts can be null during parsing, and Enum_::$scalarType can be a complex type for invalid programs. 2023-08-16 21:18:30 +02:00			`protected array $tokens;`
Use inject-once approach for lexer Now the lexer is injected only once when creating the parser. Instead of $parser = new PHPParser_Parser; $parser->parse(new PHPParser_Lexer($code)); $parser->parse(new PHPParser_Lexer($code2)); you write: $parser = new PHPParser_Parser(new PHPParser_Lexer); $parser->parse($code); $parser->parse($code2); 2012-04-25 20:04:46 +02:00
			`/**`
Replace startLexing() with tokenize() For now Lexer::getTokens() still exists, but should probably be removed. 2023-08-13 16:03:26 +02:00			`* Tokenize the provided source code.`
Add doccomments and slightly change some APIs 2011-05-31 16:33:11 +02:00			`*`
Replace startLexing() with tokenize() For now Lexer::getTokens() still exists, but should probably be removed. 2023-08-13 16:03:26 +02:00			`* The token array is in the same format as provided by the PhpToken::tokenize() method in`
			`* PHP 8.0. The tokens are instances of PhpParser\Token, to abstract over a polyfill`
			`* implementation in earlier PHP version.`
Add some more unit tests 2011-07-13 23:07:05 +02:00			`*`
Replace startLexing() with tokenize() For now Lexer::getTokens() still exists, but should probably be removed. 2023-08-13 16:03:26 +02:00			`* The token array is terminated by a sentinel token with token ID 0.`
			`* The token array does not discard any tokens (i.e. whitespace and comments are included).`
			`* The token position attributes are against this token array.`
			`*`
			`* @param string $code The source code to tokenize.`
Introduce ErrorHandler Add ErrorHandler interface, as well as ErrorHandler\Throwing and ErrorHandler\Collecting. The error handler is passed to Parser::parse(). This supersedes the throwOnError option. NameResolver now accepts an ErrorHandler in the ctor. 2016-10-09 13:15:24 +02:00			`* @param ErrorHandler\|null $errorHandler Error handler to use for lexing errors. Defaults to`
Replace startLexing() with tokenize() For now Lexer::getTokens() still exists, but should probably be removed. 2023-08-13 16:03:26 +02:00			`* ErrorHandler\Throwing.`
			`* @return Token[] Tokens`
Add doccomments and slightly change some APIs 2011-05-31 16:33:11 +02:00			`*/`
Replace startLexing() with tokenize() For now Lexer::getTokens() still exists, but should probably be removed. 2023-08-13 16:03:26 +02:00			`public function tokenize(string $code, ?ErrorHandler $errorHandler = null): array {`
Introduce ErrorHandler Add ErrorHandler interface, as well as ErrorHandler\Throwing and ErrorHandler\Collecting. The error handler is passed to Parser::parse(). This supersedes the throwOnError option. NameResolver now accepts an ErrorHandler in the ctor. 2016-10-09 13:15:24 +02:00			`if (null === $errorHandler) {`
			`$errorHandler = new ErrorHandler\Throwing();`
			`}`

Strict type compliance Were this library to be fully annotated with scalar types and return types where possible and were strict types to be enabled for all files, the test suite would now pass. 2015-03-23 11:43:22 +01:00			`$scream = ini_set('xdebug.scream', '0');`
Disable xdebug.scream while lexing 2014-04-19 22:26:05 +02:00
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`$this->tokens = @Token::tokenize($code);`
Canonicalize to PHP 8 comment token format The trailing newline is no longer part of the comment token. 2020-06-27 18:53:09 +02:00			`$this->postprocessTokens($errorHandler);`
Factor out error handling out of Lexer construcor Makes the constructor more concise and puts the strange error handling stuff in separate methods 2012-02-21 17:00:49 +01:00
Strict type compliance Were this library to be fully annotated with scalar types and return types where possible and were strict types to be enabled for all files, the test suite would now pass. 2015-03-23 11:43:22 +01:00			`if (false !== $scream) {`
			`ini_set('xdebug.scream', $scream);`
			`}`
Replace startLexing() with tokenize() For now Lexer::getTokens() still exists, but should probably be removed. 2023-08-13 16:03:26 +02:00
			`return $this->tokens;`
Factor out error handling out of Lexer construcor Makes the constructor more concise and puts the strange error handling stuff in separate methods 2012-02-21 17:00:49 +01:00			`}`

Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`private function handleInvalidCharacter(Token $token, ErrorHandler $errorHandler): void {`
			`$chr = $token->text;`
			`if ($chr === "\0") {`
			`// PHP cuts error message after null byte, so need special case`
			`$errorMsg = 'Unexpected null byte';`
			`} else {`
			`$errorMsg = sprintf(`
			`'Unexpected character "%s" (ASCII %d)', $chr, ord($chr)`
			`);`
Support recovery from lexer errors Lexer::startLexing() no longer throws, instead errors can be fetched using Lexer::getErrors(). Lexer errors now also contain full line and position information. 2016-09-30 18:28:35 +02:00			`}`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00
			`$errorHandler->handleError(new Error($errorMsg, [`
			`'startLine' => $token->line,`
			`'endLine' => $token->line,`
			`'startFilePos' => $token->pos,`
			`'endFilePos' => $token->pos,`
			`]));`
Support recovery from lexer errors Lexer::startLexing() no longer throws, instead errors can be fetched using Lexer::getErrors(). Lexer errors now also contain full line and position information. 2016-09-30 18:28:35 +02:00			`}`

Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`private function isUnterminatedComment(Token $token): bool {`
			`return $token->is([\T_COMMENT, \T_DOC_COMMENT])`
			`&& substr($token->text, 0, 2) === '/*'`
			`&& substr($token->text, -2) !== '*/';`
Support recovery from lexer errors Lexer::startLexing() no longer throws, instead errors can be fetched using Lexer::getErrors(). Lexer errors now also contain full line and position information. 2016-09-30 18:28:35 +02:00			`}`

Add missing return types 2022-09-11 17:51:59 +02:00			`protected function postprocessTokens(ErrorHandler $errorHandler): void {`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`// This function reports errors (bad characters and unterminated comments) in the token`
			`// array, and performs certain canonicalizations:`
Add support for PHP 8.1 With the introduction of intersection types, PHP now lexes the token '&' either as T_AMPERSAND_(NOT_)FOLLOWED_BY_VAR_OR_VARARG. This completely breaks parsing of any code containing '&'. Fix this by canonicalizing to the new token format (unconditionally, independent of emulation) and adjusting the parser to use the two new tokens. This doesn't add actual support for intersection types yet. 2021-07-09 16:08:46 +02:00			`// * Use PHP 8.1 T_AMPERSAND_NOT_FOLLOWED_BY_VAR_OR_VARARG and`
			`// T_AMPERSAND_FOLLOWED_BY_VAR_OR_VARARG tokens used to disambiguate intersection types.`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`// * Add a sentinel token with ID 0.`
Support recovery from lexer errors Lexer::startLexing() no longer throws, instead errors can be fetched using Lexer::getErrors(). Lexer errors now also contain full line and position information. 2016-09-30 18:28:35 +02:00
Insert T_BAD_CHARACTER tokens for missing characters The token stream should cover all characters in the original code, insert a dummy token for missing illegal characters. We should really be doing this in token_get_all() as well. 2019-06-30 11:43:48 +02:00			`$numTokens = \count($this->tokens);`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`if ($numTokens === 0) {`
			`// Empty input edge case: Just add the sentinel token.`
			`$this->tokens[] = new Token(0, "\0", 1, 0);`
			`return;`
			`}`

Insert T_BAD_CHARACTER tokens for missing characters The token stream should cover all characters in the original code, insert a dummy token for missing illegal characters. We should really be doing this in token_get_all() as well. 2019-06-30 11:43:48 +02:00			`for ($i = 0; $i < $numTokens; $i++) {`
			`$token = $this->tokens[$i];`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`if ($token->id === \T_BAD_CHARACTER) {`
			`$this->handleInvalidCharacter($token, $errorHandler);`
Implement emulation of PHP 8 T_NAME_* tokens Like comment emulation, this is unconditional, as it is required for core functionality. 2020-07-23 12:28:13 +02:00			`}`

Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`if ($token->id === \ord('&')) {`
Add support for PHP 8.1 With the introduction of intersection types, PHP now lexes the token '&' either as T_AMPERSAND_(NOT_)FOLLOWED_BY_VAR_OR_VARARG. This completely breaks parsing of any code containing '&'. Fix this by canonicalizing to the new token format (unconditionally, independent of emulation) and adjusting the parser to use the two new tokens. This doesn't add actual support for intersection types yet. 2021-07-09 16:08:46 +02:00			`$next = $i + 1;`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`while (isset($this->tokens[$next]) && $this->tokens[$next]->id === \T_WHITESPACE) {`
Add support for PHP 8.1 With the introduction of intersection types, PHP now lexes the token '&' either as T_AMPERSAND_(NOT_)FOLLOWED_BY_VAR_OR_VARARG. This completely breaks parsing of any code containing '&'. Fix this by canonicalizing to the new token format (unconditionally, independent of emulation) and adjusting the parser to use the two new tokens. This doesn't add actual support for intersection types yet. 2021-07-09 16:08:46 +02:00			`$next++;`
			`}`
			`$followedByVarOrVarArg = isset($this->tokens[$next]) &&`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`$this->tokens[$next]->is([\T_VARIABLE, \T_ELLIPSIS]);`
			`$token->id = $followedByVarOrVarArg`
			`? \T_AMPERSAND_FOLLOWED_BY_VAR_OR_VARARG`
			`: \T_AMPERSAND_NOT_FOLLOWED_BY_VAR_OR_VARARG;`
Add support for PHP 8.1 With the introduction of intersection types, PHP now lexes the token '&' either as T_AMPERSAND_(NOT_)FOLLOWED_BY_VAR_OR_VARARG. This completely breaks parsing of any code containing '&'. Fix this by canonicalizing to the new token format (unconditionally, independent of emulation) and adjusting the parser to use the two new tokens. This doesn't add actual support for intersection types yet. 2021-07-09 16:08:46 +02:00			`}`
Throw ParseErrorException on error instead of error callback As long as the parser isn't reentrant having an error callback doesn't really make sense and only complicates everything. 2011-06-03 17:44:23 +02:00			`}`

Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`// Check for unterminated comment`
			`$lastToken = $this->tokens[$numTokens - 1];`
			`if ($this->isUnterminatedComment($lastToken)) {`
			`$errorHandler->handleError(new Error('Unterminated comment', [`
			`'startLine' => $lastToken->line,`
			`'endLine' => $lastToken->getEndLine(),`
			`'startFilePos' => $lastToken->pos,`
			`'endFilePos' => $lastToken->getEndPos(),`
			`]));`
Start adding Unit test (PHPUnit) 2011-07-13 12:24:10 +02:00			`}`

Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`// Add sentinel token.`
			`$this->tokens[] = new Token(0, "\0", $lastToken->getEndLine(), $lastToken->getEndPos());`
Initial commit 2011-04-18 19:02:30 +02:00			`}`

Support token position attributes in lexer Also change endFilePos semantics to refer to the last character that is included in the token, rather than one past the last character. This ensures that all end* attributes have the same semantics. 2014-12-18 23:26:17 +01:00			`/**`
Replace startLexing() with tokenize() For now Lexer::getTokens() still exists, but should probably be removed. 2023-08-13 16:03:26 +02:00			`* Returns the token array for the last tokenized source code.`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`*`
			`* @return Token[] Array of tokens`
Support token position attributes in lexer Also change endFilePos semantics to refer to the last character that is included in the token, rather than one past the last character. This ensures that all end* attributes have the same semantics. 2014-12-18 23:26:17 +01:00			`*/`
Use PHP 8.0 token representation Migrate everything to use PhpToken-compatible token representation, rather than the legacy array/string representation. 2022-06-04 17:34:48 +02:00			`public function getTokens(): array {`
Support token position attributes in lexer Also change endFilePos semantics to refer to the last character that is included in the token, rather than one past the last character. This ensures that all end* attributes have the same semantics. 2014-12-18 23:26:17 +01:00			`return $this->tokens;`
			`}`
Synchronized error messages with native php error messages 2014-01-23 13:33:02 +01:00			`}`