1
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-17 05:58:15 +01:00

Update documentation.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@319 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2006-08-25 03:01:16 +00:00
parent dcec92e7b3
commit ca1453401f
6 changed files with 36 additions and 19 deletions

21
TODO
View File

@ -3,23 +3,32 @@ Todo List
Core:
- Finish table and shorthand CSS attributes
- border-collapse, caption-side, empty-cells, table-layout, vertical-align
- background
- background (and friends)
- border, border-*
- font
- list-style
- Implement all non-essential attribute transforms
- Microsoft Word HTML cleaning
- Plugins for major CMSes
- Rewrite *Definition and Config relationship, add various "levels" of cleaning
- Support other character encodings out-of-the-box
- Allow strict HTML 4.01, loose HTML 4.01 and strict XHTML 1.0 output
Code issues:
- Massive profiling, make it faster!
- Make URI validation routines tighter (especially mailto)
- Distinguish between different types of URIs, for instance, a mailto URI
in IMG SRC is nonsensical
- Factor out Host validation to its own AttrDef
- Rewrite table's child definition
- Silently drop content inbetween SCRIPT tags
- Rewrite table's child definition to be faster, smart, and regexp free
- Silently drop content inbetween SCRIPT tags (can be generalized to allow
specification of elements that, when detected as foreign, trigger removal
of children, although unbalanced tags could wreck havoc (or at least delete
the rest of the document).
Enhancements:
- Do fixes for Firefox's inability to handle COL alignment props (Bug 915)
- Pretty-printing HTML
- Fixes for Firefox's inability to handle COL alignment props (Bug 915)
- Pretty-printing HTML
- Hooks for adding custom processors to custom namespaced tags and attributes,
offer default implementation
- Auto-paragraphing (be sure to leverage fact that we know when things
shouldn't be paragraphed, such as lists and tables).

View File

@ -21,13 +21,15 @@ AttrDef
variable overwriting, missing validation for query, fragment and path,
no percent-encode fixing
CSS - parser doesn't accept advanced CSS (fringe)
Number - constructor interface is inconsistent with Integer
AttrTransform - doesn't accept AttrContext, non-validating
Lang - invalid xml:lang value can overwrite valid lang value (fringe)
ChildDef - not-allowed nodes translated to text, likely invalid handling
Config - "load configuration" hooks missing, rich set* accessors missing
Config - "load configuration" hooks missing, rich set* accessors missing,
needs redefined relationship with the definitions
Strategy
FixNesting - cannot bubble nodes out of structures
MakeWellFormed - insufficient automatic closing definitions
MakeWellFormed - insufficient automatic closing definitions (check HTML
spec for optional end tags).
RemoveForeignElements - should be run in parallel with MakeWellFormed
URIScheme - needs to have callable generic checks
ftp - missing typecode check

View File

@ -28,6 +28,7 @@ time. Note the naming convention: %Namespace.Directive
%Attr.MaxWidth,
%Attr.MaxHeight - caps for width and height related checks.
(a hack in Pixels for an image crashing attack could be replaced by this)
%URI.Munge - will munge all URIs to a different URI, which should redirect
the user to the applicable page. A urlencoded version of the URI

View File

@ -17,6 +17,8 @@ are passed. These classes are: HTMLPurifier::*, Generator::generateFromTokens
and Lexer::tokenizeHTML. However, whenever a valid configuration object
is defined, that object should be used.
-- the following is projected changes to the configuration system --
In relation to HTMLDefinition and CSSDefinition, there are going to be some
major structural changes to enable the easy configuration of these objects.
Due to the intricacy of these objects, it's not feasible to ask an average

View File

@ -9,11 +9,11 @@ to be effective. Things to remember:
1. UTF-8. Currently, the parser runs under the assumption that it is dealing
with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
your character encoding, you should switch. Now. (in future versions, however,
I may make the character encoding configurable, but there's only so much I
can do). Make sure any input is properly converted to UTF-8, or the parser
will mangle it badly (though it won't be a security risk if you're outputting
it as UTF-8 though).
your character encoding, you should switch. Now. Make sure any input is
properly converted to UTF-8, or the parser will mangle it badly
(though it won't be a security risk if you're outputting it as UTF-8 though).
We will be adding out-of-the-box support for the other major character
encodings shortly.
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
@ -23,8 +23,9 @@ strict in order to prevent ourselves from being too draconic on users, but
this may be configurable in the future.
3. IDs. They need to be unique, but without some knowledge of the
rest of the document, it's difficult to know what's unique. Without setting
%Attr.IDBlacklist to the proper
rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
needs to be set: we may want to consider disallowing IDs by default to
save lazy programmers.
4. [PROJECTED] Links. We're not going to try for spam protection (although
some hooks for such a module might be nice) but we may offer the ability to
@ -36,4 +37,4 @@ to protect your pages from being attacked by garish colors and plain old
bad taste. A neat feature would be the ability to define acceptable colors
in a document, but that's not likely to be implemented for a while. In the
meantime, be sure to make sure that floated elements (permitted, since they
can be quite useful) cna't mess up your layout.
can be quite useful) can't mess up your layout.

View File

@ -29,7 +29,8 @@ output is valid XHTML or send the HTML through a draconic XML parser (and yet
still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
tags from being nested within each other).
This document seeks to detail the inner workings of HTML Purifier. The first
This document no longer is a detailed description of how HTMLPurifier works,
as those descriptions have been moved to the appropriate code. The first
draft was drawn up after two rough code sketches and the implementation of a
forgiving lexer. You may also be interested in the unit tests located in the
tests/ folder, which provide a living document on how exactly the filter deals
@ -52,4 +53,5 @@ In summary:
HTML Purifier is best suited for documents that require a rich array of
HTML tags. Things like blog comments are, in all likelihood, most appropriately
written in an extremely restrictive set of markup that doesn't require
all this functionality (or not written in HTML at all).
all this functionality (or not written in HTML at all), although this may
be changing in the future.