diff --git a/docs/lexer.txt b/docs/lexer.txt new file mode 100644 index 00000000..a59557ac --- /dev/null +++ b/docs/lexer.txt @@ -0,0 +1,28 @@ + +Lexer + +The lexer parses a string of SGML-style markup and converts them into +corresponding tokens. It doesn't check for correctness, although it's +internal mechanism may make this automatic (such as the case of DOMLex). + +We have several implementations of the Lexer: + +DirectLex - our in-house implementation + DirectLex has absolutely no dependencies, making it a reasonably good + default for PHP4. Written with efficiency in mind, it is generally + faster than the PEAR parser, although the two are very close and usually + overlap a bit. It will support UTF-8 completely eventually. + +PEARSax3 - uses the PEAR package XML_HTMLSax3 to parse + PEAR, not suprisingly, also has a SAX parser for HTML. I don't know + very much about implementation, but it's fairly well written. You need + to have PEAR added to your path to use it though. Not sure whether or + not it's UTF-8 aware. + +DOMLex - uses the PHP5 core extension DOM to parse + In PHP 5, the DOM XML extension was revamped into DOM and added to the core. + It gives us a forgiving HTML parser, which we use to transform the HTML + into a DOM, and then into the tokens. It is extremely fast, and is the + default choice for PHP 5. However, entity resolution may be troublesome, + though it's UTF-8 is excellent. + diff --git a/docs/security.txt b/docs/security.txt index 7b7963e4..95fd0ccb 100644 --- a/docs/security.txt +++ b/docs/security.txt @@ -1,4 +1,5 @@ -== Possible Security Issues == + +Security Like anything that claims to afford security, HTML_Purifier can be circumvented through negligence of people. This class will do its job: no more, no less, @@ -14,10 +15,11 @@ can do). Make sure any input is properly converted to UTF-8, or the parser will mangle it badly (though it won't be a security risk if you're outputting it as UTF-8). -2. XHTML 1.0. This is what the parser is outputting. For the most part, it's -compatible with HTML 4.01, but XHTML enforces some very nice things that all -web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode has -waaaay too many quirks for a little parser to handle. +2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most +part, it's compatible with HTML 4.01, but XHTML enforces some very nice things +that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode +has waaaay too many quirks for a little parser to handle. We did not select +strict in order to prevent ourselves from being too draconic on users. 3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the rest of the document, it's difficult to know what's unique. I project default diff --git a/docs/spec.txt b/docs/spec.txt index 40fb4c81..d2c00419 100644 --- a/docs/spec.txt +++ b/docs/spec.txt @@ -1,5 +1,5 @@ -HTML Purifier +HTML Purifier Specification by Edward Z. Yang == Introduction == @@ -39,7 +39,7 @@ with malformed input. In summary: -1. Parse document into an array of tag and text tokens +1. Parse document into an array of tag and text tokens (Lexer) 2. Remove all elements not on whitelist and transform certain other elements into acceptable forms (i.e. ) 3. Make document well formed while helpfully taking into account certain quirks, @@ -49,10 +49,10 @@ In summary: important for tables). 5. Validate attributes according to more restrictive definitions based on the RFCs. -6. Translate back into a string. +6. Translate back into a string. (Generator) HTML Purifier is best suited for documents that require a rich array of -HTML tags. Things like blog comments are, in all likelihood, most appropriately +HTML tags. Things like blog comments are, in all likelihood, most appropriately written in an extremely restrictive set of markup that doesn't require all this functionality (or not written in HTML at all). @@ -60,25 +60,23 @@ all this functionality (or not written in HTML at all). == STAGE 1 - parsing == - Status: A (see source, mainly internal raw) + Status: A (see source, mainly internals and UTF-8) -We've got two options for this: HTMLSax or my MarkupLexer. Hopefully, we -can make the two interfaces compatible. This means that we need a lot -of little classes: +The Lexer (currently we have three choices) handles parsing into Tokens. -* StartTag(name, attributes) is openHandler -* EndTag(name) is closeHandler -* EmptyTag(name, attributes) is openHandler (is in array of empties) -* Data(text) is dataHandler +Here are the mappings for Lexer_PEARSax3 + +* Start(name, attributes) is openHandler +* End(name) is closeHandler +* Empty(name, attributes) is openHandler (is in array of empties) +* Data(parse(text)) is dataHandler * Comment(text) is escapeHandler (has leading -) -* CharacterData(text) is escapeHandler (has leading [) +* Data(text) is escapeHandler (has leading [, CDATA) Ignorable/not being implemented (although we probably want to output them raw): * ProcessingInstructions(text) is piHandler * JavaOrASPInstructions(text) is jaspHandler -Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects. - == STAGE 2 - remove foreign elements ==