From ca1453401fb1b59328e6b4978913667a0819c729 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Fri, 25 Aug 2006 03:01:16 +0000 Subject: [PATCH] Update documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@319 48356398-32a2-884e-a903-53898d9a118a --- TODO | 21 +++++++++++++++------ docs/code-quality.txt | 8 +++++--- docs/config-ideas.txt | 1 + docs/config.txt | 2 ++ docs/security.txt | 17 +++++++++-------- docs/spec.txt | 6 ++++-- 6 files changed, 36 insertions(+), 19 deletions(-) diff --git a/TODO b/TODO index 9b03817f..3643491a 100644 --- a/TODO +++ b/TODO @@ -3,23 +3,32 @@ Todo List Core: - Finish table and shorthand CSS attributes - border-collapse, caption-side, empty-cells, table-layout, vertical-align - - background + - background (and friends) - border, border-* - font - list-style - Implement all non-essential attribute transforms - Microsoft Word HTML cleaning - Plugins for major CMSes + - Rewrite *Definition and Config relationship, add various "levels" of cleaning + - Support other character encodings out-of-the-box + - Allow strict HTML 4.01, loose HTML 4.01 and strict XHTML 1.0 output Code issues: - Massive profiling, make it faster! - Make URI validation routines tighter (especially mailto) - Distinguish between different types of URIs, for instance, a mailto URI in IMG SRC is nonsensical - - Factor out Host validation to its own AttrDef - - Rewrite table's child definition - - Silently drop content inbetween SCRIPT tags + - Rewrite table's child definition to be faster, smart, and regexp free + - Silently drop content inbetween SCRIPT tags (can be generalized to allow + specification of elements that, when detected as foreign, trigger removal + of children, although unbalanced tags could wreck havoc (or at least delete + the rest of the document). Enhancements: - - Do fixes for Firefox's inability to handle COL alignment props (Bug 915) - - Pretty-printing HTML \ No newline at end of file + - Fixes for Firefox's inability to handle COL alignment props (Bug 915) + - Pretty-printing HTML + - Hooks for adding custom processors to custom namespaced tags and attributes, + offer default implementation + - Auto-paragraphing (be sure to leverage fact that we know when things + shouldn't be paragraphed, such as lists and tables). diff --git a/docs/code-quality.txt b/docs/code-quality.txt index 33751ca4..7c23fc7f 100644 --- a/docs/code-quality.txt +++ b/docs/code-quality.txt @@ -21,13 +21,15 @@ AttrDef variable overwriting, missing validation for query, fragment and path, no percent-encode fixing CSS - parser doesn't accept advanced CSS (fringe) + Number - constructor interface is inconsistent with Integer AttrTransform - doesn't accept AttrContext, non-validating - Lang - invalid xml:lang value can overwrite valid lang value (fringe) ChildDef - not-allowed nodes translated to text, likely invalid handling -Config - "load configuration" hooks missing, rich set* accessors missing +Config - "load configuration" hooks missing, rich set* accessors missing, + needs redefined relationship with the definitions Strategy FixNesting - cannot bubble nodes out of structures - MakeWellFormed - insufficient automatic closing definitions + MakeWellFormed - insufficient automatic closing definitions (check HTML + spec for optional end tags). RemoveForeignElements - should be run in parallel with MakeWellFormed URIScheme - needs to have callable generic checks ftp - missing typecode check diff --git a/docs/config-ideas.txt b/docs/config-ideas.txt index f9d9085f..852b3aca 100644 --- a/docs/config-ideas.txt +++ b/docs/config-ideas.txt @@ -28,6 +28,7 @@ time. Note the naming convention: %Namespace.Directive %Attr.MaxWidth, %Attr.MaxHeight - caps for width and height related checks. + (a hack in Pixels for an image crashing attack could be replaced by this) %URI.Munge - will munge all URIs to a different URI, which should redirect the user to the applicable page. A urlencoded version of the URI diff --git a/docs/config.txt b/docs/config.txt index 84d920fc..dda4b2b2 100644 --- a/docs/config.txt +++ b/docs/config.txt @@ -17,6 +17,8 @@ are passed. These classes are: HTMLPurifier::*, Generator::generateFromTokens and Lexer::tokenizeHTML. However, whenever a valid configuration object is defined, that object should be used. +-- the following is projected changes to the configuration system -- + In relation to HTMLDefinition and CSSDefinition, there are going to be some major structural changes to enable the easy configuration of these objects. Due to the intricacy of these objects, it's not feasible to ask an average diff --git a/docs/security.txt b/docs/security.txt index f077691c..f388559d 100644 --- a/docs/security.txt +++ b/docs/security.txt @@ -9,11 +9,11 @@ to be effective. Things to remember: 1. UTF-8. Currently, the parser runs under the assumption that it is dealing with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as -your character encoding, you should switch. Now. (in future versions, however, -I may make the character encoding configurable, but there's only so much I -can do). Make sure any input is properly converted to UTF-8, or the parser -will mangle it badly (though it won't be a security risk if you're outputting -it as UTF-8 though). +your character encoding, you should switch. Now. Make sure any input is +properly converted to UTF-8, or the parser will mangle it badly +(though it won't be a security risk if you're outputting it as UTF-8 though). +We will be adding out-of-the-box support for the other major character +encodings shortly. 2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most part, it's compatible with HTML 4.01, but XHTML enforces some very nice things @@ -23,8 +23,9 @@ strict in order to prevent ourselves from being too draconic on users, but this may be configurable in the future. 3. IDs. They need to be unique, but without some knowledge of the -rest of the document, it's difficult to know what's unique. Without setting -%Attr.IDBlacklist to the proper +rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist +needs to be set: we may want to consider disallowing IDs by default to +save lazy programmers. 4. [PROJECTED] Links. We're not going to try for spam protection (although some hooks for such a module might be nice) but we may offer the ability to @@ -36,4 +37,4 @@ to protect your pages from being attacked by garish colors and plain old bad taste. A neat feature would be the ability to define acceptable colors in a document, but that's not likely to be implemented for a while. In the meantime, be sure to make sure that floated elements (permitted, since they -can be quite useful) cna't mess up your layout. +can be quite useful) can't mess up your layout. diff --git a/docs/spec.txt b/docs/spec.txt index c51848ad..db88e9c5 100644 --- a/docs/spec.txt +++ b/docs/spec.txt @@ -29,7 +29,8 @@ output is valid XHTML or send the HTML through a draconic XML parser (and yet still get the nesting wrong: SafeHtmlChecker.class.php does not prevent tags from being nested within each other). -This document seeks to detail the inner workings of HTML Purifier. The first +This document no longer is a detailed description of how HTMLPurifier works, +as those descriptions have been moved to the appropriate code. The first draft was drawn up after two rough code sketches and the implementation of a forgiving lexer. You may also be interested in the unit tests located in the tests/ folder, which provide a living document on how exactly the filter deals @@ -52,4 +53,5 @@ In summary: HTML Purifier is best suited for documents that require a rich array of HTML tags. Things like blog comments are, in all likelihood, most appropriately written in an extremely restrictive set of markup that doesn't require -all this functionality (or not written in HTML at all). +all this functionality (or not written in HTML at all), although this may +be changing in the future.