diff --git a/NEWS b/NEWS
index d312c8ce..e25a41b1 100644
--- a/NEWS
+++ b/NEWS
@@ -24,6 +24,16 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
. Refactored parseData() to general Lexer class
. Tester named "HTML Purifier" not "HTMLPurifier"
+1.1.1, released 2006-09-24
+! Configuration option to optionally Tidy up output for indentation to make up
+ for dropped whitespace by DOMLex (pretty-printing for the entire application
+ should be done by a page-wide Tidy)
+- Various documentation updates
+- Fixed parse error in configuration documentation script
+- Fixed fatal error in benchmark scripts, slightly augmented
+- As far as possible, whitespace is preserved in-between table children
+- Sample test-settings.php file included
+
1.1.0, released 2006-09-16
! Directive documentation generation using XSLT
! XHTML can now be turned off, output becomes
diff --git a/docs/colors.txt b/docs/colors.txt
new file mode 100644
index 00000000..e0b74e45
--- /dev/null
+++ b/docs/colors.txt
@@ -0,0 +1,23 @@
+
+Colors
+ Hammering some sense into those content-makers
+
+Your website probably has a color-scheme. Green on white, purple on yellow,
+whatever. When you give users the ability to style their content, you may
+want them to keep in line with your styling. If you're website is all
+about light colors, you don't want a user to come in and vandalize your
+page with a deep maroon.
+
+This is an extremely silly feature proposal, but I'm writing it down anyway.
+
+What if the user could constrain the colors specified in inline styles? You
+are only allowed to use these shades of dark green for text and these shades
+of light yellow for the background. At the very least, you could ensure
+that we did not have pale yellow on white text.
+
+Implementation issues:
+1. Requires the color attribute definition to know, currently, what the text
+and background colors are. This becomes difficult when classes are thrown
+into the mix.
+2. The user still has to define the permissible colors, how does one do
+something like that?
diff --git a/docs/filter-levels.txt b/docs/filter-levels.txt
index 09b0563f..52f4a05b 100644
--- a/docs/filter-levels.txt
+++ b/docs/filter-levels.txt
@@ -20,15 +20,32 @@ can further be customized using simpler configuration options.
Here are some fuzzy levels you could set:
1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite,
- code, em, i, strike, strong; however, you could get away with only a, b and
- i; also having p and pre tags would be helpful.
-2. Pages - As permissive as possible without allowing XSS. No protection
+ code, em, i, strike, strong; however, you could get away with only a, em and
+ p; also having blockquote and pre tags would be helpful.
+2. BBCode - Emulate the usual tagset for forums: b, i, img, a, blockquote,
+ pre, div, span and h[2-6] (the last three are for specially formatted
+ posts, div and span require associated classes or inline styling enabled
+ to be useful)
+3. Pages - As permissive as possible without allowing XSS. No protection
against bad design sense, unfortunantely. Suitable for wiki and page
environments.
-3. Lint - Accept everything in the spec, a Tidy wannabe.
+4. Lint - Accept everything in the spec, a Tidy wannabe. (This probably won't
+ get implemented as it would require routines for things like
+ and friends to be implemented, which is a lot of work for not a lot of
+ benefit)
-I've also decomposed tags into risk levels. An asterisk indicates that no one
-really uses that tag, tilde indicates it's deprecated.
+One final note: when you start axing tags that are more commonly used, you
+run the risk of accidentally destroying user data, especially if the data
+is incoming from a WYSIWYG eidtor that hasn't been synced accordingly. This may
+make forbidden element to text transformations desirable (for example, images).
+
+
+
+== Element Risk Analysis ==
+
+Legend:
+ [danger level] - regular tags / uncommon tags ~ deprecated tags
+ [danger level]* - rare tags
1 - blockquote, code, em, i, p, tt / strong, sub, sup
1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp
@@ -38,30 +55,76 @@ really uses that tag, tilde indicates it's deprecated.
5 - a
7 - area, map
+These are special use tags, they should be enabled on a blanket basis.
+
Lists - dd, dl, dt, li, ol, ul ~ menu, dir
Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead
+
Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea
XSS - noscript, object, script ~ applet
-
Meta - base, basefont, body, head, html, link, meta, style, title
Frames - frame, frameset, iframe
And tag specific notes:
-a - general problems involving linkspam
-b - too much bold is bad, typographically speaking bold is discouraged
-br - often misused
+a - general problems involving linkspam
+b - too much bold is bad, typographically speaking bold is discouraged
+br - often misused
center - CSS, usually no legit use
del - only useful in editing context
div - little meaning in certain contexts i.e. blog comment
-h1 - usually no legit use, as header is already set by application
-h* - not needed in blog comments
-hr - usually not necessary in blog comments
-img - could be extremely undesirable if linking to external pics
+h1 - usually no legit use, as header is already set by application
+h* - not needed in blog comments
+hr - usually not necessary in blog comments
+img - could be extremely undesirable if linking to external pics (CSRF, goatse)
pre - could use formatting, only useful in code contexts
-q - very little support
-s - transform into span with styling or del?
+q - very little support
+s - transform into span with styling or del?
small - technically presentational
span - depends on attribute allowances
sub, sup - specialized
-u - little legit use, prefer class with text-decoration
+u - little legit use, prefer class with text-decoration
+
+Based on the riskiness of the items, we may want to offer %HTML.DisableImages
+attribute and put URI filtering higher up on the priority list.
+
+
+== Attribute Risk Analysis ==
+
+We actually have a suprisingly small assortment of allowed attributes (the
+rest are deprecated in strict, and thus we opted not to allow them, even
+though our output is XHTML Transitional by default.)
+
+Required URI - img.alt, img.src, a.href
+Medium risk - *.class, *.dir
+High risk - img.height, img.width, *.id, *.style
+
+Table - colgroup/col.span, td/th.rowspan, td/th.colspan
+Uncommon - *.title, *.lang, *.xml:lang
+Rare - td/th.abbr, table.summary, {table}.charoff
+Rare URI - del.cite, ins.cite, blockquote.cite, q.cite, img.longdesc
+Presentational - {table}.align, {table}.valign, table.frame, table.rules,
+ table.border
+Partially presentational - table.cellpadding, table.cellspacing,
+ table.width, col.width, colgroup.width
+
+
+== CSS Risk Analysis ==
+
+There are certain CSS elements that are extremely useful inline, but then
+as you get to more presentation oriented styling it may not always be
+appropriate to inline them.
+
+Useful - clear, float, border-collapse, caption-side
+
+These CSS properties can break layouts if used improperly. We have excluded
+any CSS properties that are not currently implemented (such as position).
+
+Dangerous, can go outside container - float
+Easy to abuse - font-size, font-family (font), width
+Colored - background-color (background), border-color (border), color
+Dramatic - border, list-style-position (list-style), margin, padding,
+ text-align, text-indent, text-transform, vertical-align, line-height
+
+Dramatic elements substnatially change the look of text in ways that should
+probably have been reserved to other areas.
diff --git a/docs/strictness.txt b/docs/strictness.txt
new file mode 100644
index 00000000..b4f9268b
--- /dev/null
+++ b/docs/strictness.txt
@@ -0,0 +1,25 @@
+
+Is HTML Purifier Strict or Transitional?
+ A little bit of helpful guidance
+
+Despite the fact that HTML Purifier professes only to support transitional
+HTML, it rejects a lot of attributes and elements that are actually, indeed,
+valid. You can investigate progress.html to find out precisely what we
+are doing to these *deprecated* attributes.
+
+However, users have found that Strict HTML imposes some quite unreasonable
+restrictions on certain things. The start and value attributes in ol and
+li (respectively) perhaps are the most contested. There's is currently no
+widely supported browser method short of JavaScript that can replace these
+two deprecated elements. HTML Purifier does not currently support them, but
+it might behoove us to do so while our output is still transitional.
+
+Fortunantely, that's the only real bugger case. The others have near-perfect
+CSS equivalents, and were presentational anyway. However, the other question
+pops up: should we always convert these to the CSS forms when 1. the spec
+allows them anyway and 2. older browsers support them better? After all, the
+whole point about CSS is to seperate styling from content, so inline styling
+doesn't solve that problem.
+
+It's an icky question, and we'll have to deal with it as more and more
+transforms get implemented.
diff --git a/library/HTMLPurifier/HTMLDefinition.php b/library/HTMLPurifier/HTMLDefinition.php
index 34791ee7..9ef7d1c1 100644
--- a/library/HTMLPurifier/HTMLDefinition.php
+++ b/library/HTMLPurifier/HTMLDefinition.php
@@ -56,6 +56,7 @@ class HTMLPurifier_HTMLDefinition
/**
* String name of parent element HTML will be going into.
+ * @todo Allow this to be overloaded by user config
* @public
*/
var $info_parent = 'div';
@@ -111,12 +112,19 @@ class HTMLPurifier_HTMLDefinition
//////////////////////////////////////////////////////////////////////
// info[]->child : defines allowed children for elements
- // entities: prefixed with e_ and _ replaces .
+ // entities: prefixed with e_ and _ replaces . from DTD
+ // double underlines are entities we made up
// we don't use an array because that complicates interpolation
// strings are used instead of arrays because if you use arrays,
// you have to do some hideous manipulation with array_merge()
+ // todo: determine whether or not having allowed children
+ // that aren't allowed globally affects security (it shouldn't)
+ // if above works out, extend children definitions to include all
+ // possible elements (allowed elements will dictate which ones
+ // get dropped
+
$e_special_extra = 'img';
$e_special_basic = 'br | span | bdo';
$e_special = "$e_special_basic | $e_special_extra";
@@ -142,16 +150,18 @@ class HTMLPurifier_HTMLDefinition
$e_block = "p | $e_heading | div | $e_lists | $e_blocktext | table";
$e__flow = "#PCDATA | $e_block | $e_inline | $e_misc";
$e_Flow = new HTMLPurifier_ChildDef_Optional($e__flow);
- $e_a_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | $e_special".
- " | $e_fontstyle | $e_phrase | $e_inline_forms | $e_misc_inline");
+ $e_a_content = new HTMLPurifier_ChildDef_Optional("#PCDATA".
+ " | $e_special | $e_fontstyle | $e_phrase | $e_inline_forms".
+ " | $e_misc_inline");
$e_pre_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | a".
" | $e_special_basic | $e_fontstyle_basic | $e_phrase_basic".
" | $e_inline_forms | $e_misc_inline");
- $e_form_content = new HTMLPurifier_ChildDef_Optional(''); //unused
- $e_form_button_content = new HTMLPurifier_ChildDef_Optional(''); // unused
+ $e_form_content = new HTMLPurifier_ChildDef_Optional('');//unused
+ $e_form_button_content = new HTMLPurifier_ChildDef_Optional('');//unused
$this->info['ins']->child =
- $this->info['del']->child = new HTMLPurifier_ChildDef_Chameleon($e__inline, $e__flow);
+ $this->info['del']->child =
+ new HTMLPurifier_ChildDef_Chameleon($e__inline, $e__flow);
$this->info['blockquote']->child=
$this->info['dd']->child =
@@ -225,7 +235,7 @@ class HTMLPurifier_HTMLDefinition
//////////////////////////////////////////////////////////////////////
// info[]->type : defines the type of the element (block or inline)
- // reuses $e_Inline and $e_block
+ // reuses $e_Inline and $e_Block
foreach ($e_Inline->elements as $name) {
$this->info[$name]->type = 'inline';
@@ -243,7 +253,7 @@ class HTMLPurifier_HTMLDefinition
$this->info['a']->excludes = array('a' => true);
$this->info['pre']->excludes = array_flip(array('img', 'big', 'small',
- // technically in spec, but we don't allow em anyway
+ // technically useless, but good to be indepth
'object', 'applet', 'font', 'basefont'));
//////////////////////////////////////////////////////////////////////
@@ -253,6 +263,8 @@ class HTMLPurifier_HTMLDefinition
// by the transform classes. It will, however, do simple and slightly
// complex attribute value substitution
+ // the question of varying allowed attributes is more entangling.
+
$e_Text = new HTMLPurifier_AttrDef_Text();
// attrs, included in almost every single one except for a few,
@@ -297,7 +309,8 @@ class HTMLPurifier_HTMLDefinition
$this->info['table']->attr['summary'] = $e_Text;
- $this->info['table']->attr['border'] = new HTMLPurifier_AttrDef_Pixels();
+ $this->info['table']->attr['border'] =
+ new HTMLPurifier_AttrDef_Pixels();
$e_Length = new HTMLPurifier_AttrDef_Length();
$this->info['table']->attr['cellpadding'] =
@@ -329,7 +342,7 @@ class HTMLPurifier_HTMLDefinition
$this->info['q']->attr['cite'] = $e_URI;
//////////////////////////////////////////////////////////////////////
- // UNIMP : info_tag_transform : transformations of tags
+ // info_tag_transform : transformations of tags
$this->info_tag_transform['font'] = new HTMLPurifier_TagTransform_Font();
$this->info_tag_transform['menu'] = new HTMLPurifier_TagTransform_Simple('ul');
@@ -339,6 +352,9 @@ class HTMLPurifier_HTMLDefinition
//////////////////////////////////////////////////////////////////////
// info[]->auto_close : tags that automatically close another
+ // todo: determine whether or not SGML-like modeling based on
+ // mandatory/optional end tags would be a better policy
+
// make sure you test using isset() not !empty()
// these are all block elements: blocks aren't allowed in P