Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a
2025-10-17 06:56:06 +02:00 · 2006-07-22 18:55:34 +00:00
parent eac83995e1
commit ac1e62e043
3 changed files with 45 additions and 50 deletions
--- a/docs/lexer.txt
+++ b/docs/lexer.txt
@@ -2,27 +2,40 @@
 Lexer

 The lexer parses a string of SGML-style markup and converts them into
-corresponding tokens. It doesn't check for correctness, although it's
+corresponding tokens. It doesn't check for well-formedness, although it's
 internal mechanism may make this automatic (such as the case of DOMLex).

 We have several implementations of the Lexer:

-DirectLex - our in-house implementation
+DirectLex [4,5] - our in-house implementation
    DirectLex has absolutely no dependencies, making it a reasonably good
-    default for PHP4.  Written with efficiency in mind, it is generally
-    faster than the PEAR parser, although the two are very close and usually
-    overlap a bit.  It will support UTF-8 completely eventually.
+    default for PHP4.  Written with efficiency in mind, it is up to two
+    times faster than the PEAR parser.  It will support UTF-8 completely
+    eventually.

-PEARSax3 - uses the PEAR package XML_HTMLSax3 to parse
+PEARSax3 [4,5] - uses the PEAR package XML_HTMLSax3 to parse
    PEAR, not suprisingly, also has a SAX parser for HTML.  I don't know
-    very much about implementation, but it's fairly well written.  You need
-    to have PEAR added to your path to use it though.  Not sure whether or
-    not it's UTF-8 aware.
+    very much about implementation, but it's fairly well written.  However, that
+    abstraction comes at a price: performance. You need to have it installed,
+    and if the API changes, it might break our adapter. Not sure whether or not
+    it's UTF-8 aware, but it has some entity parsing trouble.

-DOMLex - uses the PHP5 core extension DOM to parse
+DOMLex [5] - uses the PHP5 core extension DOM to parse
    In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
    It gives us a forgiving HTML parser, which we use to transform the HTML
-    into a DOM, and then into the tokens.  It is extremely fast, and is the
+    into a DOM, and then into the tokens.  It is blazingly fast, and is the
    default choice for PHP 5.  However, entity resolution may be troublesome,
-    though it's UTF-8 is excellent.
+    though its UTF-8 is excellent.  Also, any empty elements will have empty
+    tokens associated with them, even if this is prohibited.

+We use tokens because creating a DOM representation would:
+
+1. Require more processing power to create,
+2. Require recursion to iterate,
+3. Must be compatible with PHP 5's DOM,
+4. Has the entire document structure (html and body not needed), and
+5. Has unknown readability improvement.
+
+What the last item means is that the functions for manipulating tokens are
+already fairly compact, and when well-commented, more abstraction may not
+be needed.