From ca1453401fb1b59328e6b4978913667a0819c729 Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang" <edwardzyang@thewritingpot.com>
Date: Fri, 25 Aug 2006 03:01:16 +0000
Subject: [PATCH] Update documentation.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@319 48356398-32a2-884e-a903-53898d9a118a
---
 TODO                  | 21 +++++++++++++++------
 docs/code-quality.txt |  8 +++++---
 docs/config-ideas.txt |  1 +
 docs/config.txt       |  2 ++
 docs/security.txt     | 17 +++++++++--------
 docs/spec.txt         |  6 ++++--
 6 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/TODO b/TODO
index 9b03817f..3643491a 100644
--- a/TODO
+++ b/TODO
@@ -3,23 +3,32 @@ Todo List
 Core:
  - Finish table and shorthand CSS attributes
     - border-collapse, caption-side, empty-cells, table-layout, vertical-align
-    - background
+    - background (and friends)
     - border, border-*
     - font
     - list-style
  - Implement all non-essential attribute transforms
  - Microsoft Word HTML cleaning
  - Plugins for major CMSes
+ - Rewrite *Definition and Config relationship, add various "levels" of cleaning
+ - Support other character encodings out-of-the-box
+ - Allow strict HTML 4.01, loose HTML 4.01 and strict XHTML 1.0 output
 
 Code issues:
  - Massive profiling, make it faster!
  - Make URI validation routines tighter (especially mailto)
  - Distinguish between different types of URIs, for instance, a mailto URI
    in IMG SRC is nonsensical
- - Factor out Host validation to its own AttrDef
- - Rewrite table's child definition
- - Silently drop content inbetween SCRIPT tags
+ - Rewrite table's child definition to be faster, smart, and regexp free
+ - Silently drop content inbetween SCRIPT tags (can be generalized to allow
+   specification of elements that, when detected as foreign, trigger removal
+   of children, although unbalanced tags could wreck havoc (or at least delete
+   the rest of the document).
 
 Enhancements:
- - Do fixes for Firefox's inability to handle COL alignment props (Bug 915)
- - Pretty-printing HTML
\ No newline at end of file
+ - Fixes for Firefox's inability to handle COL alignment props (Bug 915)
+ - Pretty-printing HTML
+ - Hooks for adding custom processors to custom namespaced tags and attributes,
+   offer default implementation
+ - Auto-paragraphing (be sure to leverage fact that we know when things
+   shouldn't be paragraphed, such as lists and tables).
diff --git a/docs/code-quality.txt b/docs/code-quality.txt
index 33751ca4..7c23fc7f 100644
--- a/docs/code-quality.txt
+++ b/docs/code-quality.txt
@@ -21,13 +21,15 @@ AttrDef
         variable overwriting, missing validation for query, fragment and path,
         no percent-encode fixing
     CSS - parser doesn't accept advanced CSS (fringe)
+    Number - constructor interface is inconsistent with Integer
 AttrTransform - doesn't accept AttrContext, non-validating
-    Lang - invalid xml:lang value can overwrite valid lang value (fringe)
 ChildDef - not-allowed nodes translated to text, likely invalid handling
-Config - "load configuration" hooks missing, rich set* accessors missing
+Config - "load configuration" hooks missing, rich set* accessors missing,
+    needs redefined relationship with the definitions
 Strategy
     FixNesting - cannot bubble nodes out of structures
-    MakeWellFormed - insufficient automatic closing definitions
+    MakeWellFormed - insufficient automatic closing definitions (check HTML
+        spec for optional end tags).
     RemoveForeignElements - should be run in parallel with MakeWellFormed
 URIScheme - needs to have callable generic checks
     ftp - missing typecode check
diff --git a/docs/config-ideas.txt b/docs/config-ideas.txt
index f9d9085f..852b3aca 100644
--- a/docs/config-ideas.txt
+++ b/docs/config-ideas.txt
@@ -28,6 +28,7 @@ time.  Note the naming convention: %Namespace.Directive
 
 %Attr.MaxWidth, 
 %Attr.MaxHeight - caps for width and height related checks.
+    (a hack in Pixels for an image crashing attack could be replaced by this)
 
 %URI.Munge - will munge all URIs to a different URI, which should redirect
     the user to the applicable page. A urlencoded version of the URI
diff --git a/docs/config.txt b/docs/config.txt
index 84d920fc..dda4b2b2 100644
--- a/docs/config.txt
+++ b/docs/config.txt
@@ -17,6 +17,8 @@ are passed.  These classes are: HTMLPurifier::*, Generator::generateFromTokens
 and Lexer::tokenizeHTML.  However, whenever a valid configuration object
 is defined, that object should be used.
 
+-- the following is projected changes to the configuration system --
+
 In relation to HTMLDefinition and CSSDefinition, there are going to be some
 major structural changes to enable the easy configuration of these objects.
 Due to the intricacy of these objects, it's not feasible to ask an average
diff --git a/docs/security.txt b/docs/security.txt
index f077691c..f388559d 100644
--- a/docs/security.txt
+++ b/docs/security.txt
@@ -9,11 +9,11 @@ to be effective. Things to remember:
 1. UTF-8. Currently, the parser runs under the assumption that it is dealing
 with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
 character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
-your character encoding, you should switch. Now. (in future versions, however,
-I may make the character encoding configurable, but there's only so much I
-can do). Make sure any input is properly converted to UTF-8, or the parser
-will mangle it badly (though it won't be a security risk if you're outputting
-it as UTF-8 though).
+your character encoding, you should switch. Now. Make sure any input is
+properly converted to UTF-8, or the parser will mangle it badly
+(though it won't be a security risk if you're outputting it as UTF-8 though).
+We will be adding out-of-the-box support for the other major character
+encodings shortly.
 
 2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
 part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
@@ -23,8 +23,9 @@ strict in order to prevent ourselves from being too draconic on users, but
 this may be configurable in the future.
 
 3. IDs. They need to be unique, but without some knowledge of the
-rest of the document, it's difficult to know what's unique. Without setting
-%Attr.IDBlacklist to the proper 
+rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
+needs to be set: we may want to consider disallowing IDs by default to
+save lazy programmers.
 
 4. [PROJECTED] Links. We're not going to try for spam protection (although
 some hooks for such a module might be nice) but we may offer the ability to
@@ -36,4 +37,4 @@ to protect your pages from being attacked by garish colors and plain old
 bad taste.  A neat feature would be the ability to define acceptable colors
 in a document, but that's not likely to be implemented for a while.  In the
 meantime, be sure to make sure that floated elements (permitted, since they
-can be quite useful) cna't mess up your layout.
+can be quite useful) can't mess up your layout.
diff --git a/docs/spec.txt b/docs/spec.txt
index c51848ad..db88e9c5 100644
--- a/docs/spec.txt
+++ b/docs/spec.txt
@@ -29,7 +29,8 @@ output is valid XHTML or send the HTML through a draconic XML parser (and yet
 still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
 tags from being nested within each other).
 
-This document seeks to detail the inner workings of HTML Purifier.  The first
+This document no longer is a detailed description of how HTMLPurifier works,
+as those descriptions have been moved to the appropriate code.  The first
 draft was drawn up after two rough code sketches and the implementation of a
 forgiving lexer.  You may also be interested in the unit tests located in the
 tests/ folder, which provide a living document on how exactly the filter deals
@@ -52,4 +53,5 @@ In summary:
 HTML Purifier is best suited for documents that require a rich array of
 HTML tags.  Things like blog comments are, in all likelihood, most appropriately
 written in an extremely restrictive set of markup that doesn't require
-all this functionality (or not written in HTML at all).
+all this functionality (or not written in HTML at all), although this may
+be changing in the future.