mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-08-05 13:47:24 +02:00
Sync 1.1 branch as much as possible with trunk.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/1.1@476 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
10
NEWS
10
NEWS
@@ -24,6 +24,16 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
|
|||||||
. Refactored parseData() to general Lexer class
|
. Refactored parseData() to general Lexer class
|
||||||
. Tester named "HTML Purifier" not "HTMLPurifier"
|
. Tester named "HTML Purifier" not "HTMLPurifier"
|
||||||
|
|
||||||
|
1.1.1, released 2006-09-24
|
||||||
|
! Configuration option to optionally Tidy up output for indentation to make up
|
||||||
|
for dropped whitespace by DOMLex (pretty-printing for the entire application
|
||||||
|
should be done by a page-wide Tidy)
|
||||||
|
- Various documentation updates
|
||||||
|
- Fixed parse error in configuration documentation script
|
||||||
|
- Fixed fatal error in benchmark scripts, slightly augmented
|
||||||
|
- As far as possible, whitespace is preserved in-between table children
|
||||||
|
- Sample test-settings.php file included
|
||||||
|
|
||||||
1.1.0, released 2006-09-16
|
1.1.0, released 2006-09-16
|
||||||
! Directive documentation generation using XSLT
|
! Directive documentation generation using XSLT
|
||||||
! XHTML can now be turned off, output becomes <br>
|
! XHTML can now be turned off, output becomes <br>
|
||||||
|
23
docs/colors.txt
Normal file
23
docs/colors.txt
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
|
||||||
|
Colors
|
||||||
|
Hammering some sense into those content-makers
|
||||||
|
|
||||||
|
Your website probably has a color-scheme. Green on white, purple on yellow,
|
||||||
|
whatever. When you give users the ability to style their content, you may
|
||||||
|
want them to keep in line with your styling. If you're website is all
|
||||||
|
about light colors, you don't want a user to come in and vandalize your
|
||||||
|
page with a deep maroon.
|
||||||
|
|
||||||
|
This is an extremely silly feature proposal, but I'm writing it down anyway.
|
||||||
|
|
||||||
|
What if the user could constrain the colors specified in inline styles? You
|
||||||
|
are only allowed to use these shades of dark green for text and these shades
|
||||||
|
of light yellow for the background. At the very least, you could ensure
|
||||||
|
that we did not have pale yellow on white text.
|
||||||
|
|
||||||
|
Implementation issues:
|
||||||
|
1. Requires the color attribute definition to know, currently, what the text
|
||||||
|
and background colors are. This becomes difficult when classes are thrown
|
||||||
|
into the mix.
|
||||||
|
2. The user still has to define the permissible colors, how does one do
|
||||||
|
something like that?
|
@@ -20,15 +20,32 @@ can further be customized using simpler configuration options.
|
|||||||
Here are some fuzzy levels you could set:
|
Here are some fuzzy levels you could set:
|
||||||
|
|
||||||
1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite,
|
1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite,
|
||||||
code, em, i, strike, strong; however, you could get away with only a, b and
|
code, em, i, strike, strong; however, you could get away with only a, em and
|
||||||
i; also having p and pre tags would be helpful.
|
p; also having blockquote and pre tags would be helpful.
|
||||||
2. Pages - As permissive as possible without allowing XSS. No protection
|
2. BBCode - Emulate the usual tagset for forums: b, i, img, a, blockquote,
|
||||||
|
pre, div, span and h[2-6] (the last three are for specially formatted
|
||||||
|
posts, div and span require associated classes or inline styling enabled
|
||||||
|
to be useful)
|
||||||
|
3. Pages - As permissive as possible without allowing XSS. No protection
|
||||||
against bad design sense, unfortunantely. Suitable for wiki and page
|
against bad design sense, unfortunantely. Suitable for wiki and page
|
||||||
environments.
|
environments.
|
||||||
3. Lint - Accept everything in the spec, a Tidy wannabe.
|
4. Lint - Accept everything in the spec, a Tidy wannabe. (This probably won't
|
||||||
|
get implemented as it would require routines for things like <object>
|
||||||
|
and friends to be implemented, which is a lot of work for not a lot of
|
||||||
|
benefit)
|
||||||
|
|
||||||
I've also decomposed tags into risk levels. An asterisk indicates that no one
|
One final note: when you start axing tags that are more commonly used, you
|
||||||
really uses that tag, tilde indicates it's deprecated.
|
run the risk of accidentally destroying user data, especially if the data
|
||||||
|
is incoming from a WYSIWYG eidtor that hasn't been synced accordingly. This may
|
||||||
|
make forbidden element to text transformations desirable (for example, images).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
== Element Risk Analysis ==
|
||||||
|
|
||||||
|
Legend:
|
||||||
|
[danger level] - regular tags / uncommon tags ~ deprecated tags
|
||||||
|
[danger level]* - rare tags
|
||||||
|
|
||||||
1 - blockquote, code, em, i, p, tt / strong, sub, sup
|
1 - blockquote, code, em, i, p, tt / strong, sub, sup
|
||||||
1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp
|
1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp
|
||||||
@@ -38,30 +55,76 @@ really uses that tag, tilde indicates it's deprecated.
|
|||||||
5 - a
|
5 - a
|
||||||
7 - area, map
|
7 - area, map
|
||||||
|
|
||||||
|
These are special use tags, they should be enabled on a blanket basis.
|
||||||
|
|
||||||
Lists - dd, dl, dt, li, ol, ul ~ menu, dir
|
Lists - dd, dl, dt, li, ol, ul ~ menu, dir
|
||||||
Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead
|
Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead
|
||||||
|
|
||||||
Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea
|
Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea
|
||||||
XSS - noscript, object, script ~ applet
|
XSS - noscript, object, script ~ applet
|
||||||
|
|
||||||
Meta - base, basefont, body, head, html, link, meta, style, title
|
Meta - base, basefont, body, head, html, link, meta, style, title
|
||||||
Frames - frame, frameset, iframe
|
Frames - frame, frameset, iframe
|
||||||
|
|
||||||
And tag specific notes:
|
And tag specific notes:
|
||||||
|
|
||||||
a - general problems involving linkspam
|
a - general problems involving linkspam
|
||||||
b - too much bold is bad, typographically speaking bold is discouraged
|
b - too much bold is bad, typographically speaking bold is discouraged
|
||||||
br - often misused
|
br - often misused
|
||||||
center - CSS, usually no legit use
|
center - CSS, usually no legit use
|
||||||
del - only useful in editing context
|
del - only useful in editing context
|
||||||
div - little meaning in certain contexts i.e. blog comment
|
div - little meaning in certain contexts i.e. blog comment
|
||||||
h1 - usually no legit use, as header is already set by application
|
h1 - usually no legit use, as header is already set by application
|
||||||
h* - not needed in blog comments
|
h* - not needed in blog comments
|
||||||
hr - usually not necessary in blog comments
|
hr - usually not necessary in blog comments
|
||||||
img - could be extremely undesirable if linking to external pics
|
img - could be extremely undesirable if linking to external pics (CSRF, goatse)
|
||||||
pre - could use formatting, only useful in code contexts
|
pre - could use formatting, only useful in code contexts
|
||||||
q - very little support
|
q - very little support
|
||||||
s - transform into span with styling or del?
|
s - transform into span with styling or del?
|
||||||
small - technically presentational
|
small - technically presentational
|
||||||
span - depends on attribute allowances
|
span - depends on attribute allowances
|
||||||
sub, sup - specialized
|
sub, sup - specialized
|
||||||
u - little legit use, prefer class with text-decoration
|
u - little legit use, prefer class with text-decoration
|
||||||
|
|
||||||
|
Based on the riskiness of the items, we may want to offer %HTML.DisableImages
|
||||||
|
attribute and put URI filtering higher up on the priority list.
|
||||||
|
|
||||||
|
|
||||||
|
== Attribute Risk Analysis ==
|
||||||
|
|
||||||
|
We actually have a suprisingly small assortment of allowed attributes (the
|
||||||
|
rest are deprecated in strict, and thus we opted not to allow them, even
|
||||||
|
though our output is XHTML Transitional by default.)
|
||||||
|
|
||||||
|
Required URI - img.alt, img.src, a.href
|
||||||
|
Medium risk - *.class, *.dir
|
||||||
|
High risk - img.height, img.width, *.id, *.style
|
||||||
|
|
||||||
|
Table - colgroup/col.span, td/th.rowspan, td/th.colspan
|
||||||
|
Uncommon - *.title, *.lang, *.xml:lang
|
||||||
|
Rare - td/th.abbr, table.summary, {table}.charoff
|
||||||
|
Rare URI - del.cite, ins.cite, blockquote.cite, q.cite, img.longdesc
|
||||||
|
Presentational - {table}.align, {table}.valign, table.frame, table.rules,
|
||||||
|
table.border
|
||||||
|
Partially presentational - table.cellpadding, table.cellspacing,
|
||||||
|
table.width, col.width, colgroup.width
|
||||||
|
|
||||||
|
|
||||||
|
== CSS Risk Analysis ==
|
||||||
|
|
||||||
|
There are certain CSS elements that are extremely useful inline, but then
|
||||||
|
as you get to more presentation oriented styling it may not always be
|
||||||
|
appropriate to inline them.
|
||||||
|
|
||||||
|
Useful - clear, float, border-collapse, caption-side
|
||||||
|
|
||||||
|
These CSS properties can break layouts if used improperly. We have excluded
|
||||||
|
any CSS properties that are not currently implemented (such as position).
|
||||||
|
|
||||||
|
Dangerous, can go outside container - float
|
||||||
|
Easy to abuse - font-size, font-family (font), width
|
||||||
|
Colored - background-color (background), border-color (border), color
|
||||||
|
Dramatic - border, list-style-position (list-style), margin, padding,
|
||||||
|
text-align, text-indent, text-transform, vertical-align, line-height
|
||||||
|
|
||||||
|
Dramatic elements substnatially change the look of text in ways that should
|
||||||
|
probably have been reserved to other areas.
|
||||||
|
25
docs/strictness.txt
Normal file
25
docs/strictness.txt
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
|
||||||
|
Is HTML Purifier Strict or Transitional?
|
||||||
|
A little bit of helpful guidance
|
||||||
|
|
||||||
|
Despite the fact that HTML Purifier professes only to support transitional
|
||||||
|
HTML, it rejects a lot of attributes and elements that are actually, indeed,
|
||||||
|
valid. You can investigate progress.html to find out precisely what we
|
||||||
|
are doing to these *deprecated* attributes.
|
||||||
|
|
||||||
|
However, users have found that Strict HTML imposes some quite unreasonable
|
||||||
|
restrictions on certain things. The start and value attributes in ol and
|
||||||
|
li (respectively) perhaps are the most contested. There's is currently no
|
||||||
|
widely supported browser method short of JavaScript that can replace these
|
||||||
|
two deprecated elements. HTML Purifier does not currently support them, but
|
||||||
|
it might behoove us to do so while our output is still transitional.
|
||||||
|
|
||||||
|
Fortunantely, that's the only real bugger case. The others have near-perfect
|
||||||
|
CSS equivalents, and were presentational anyway. However, the other question
|
||||||
|
pops up: should we always convert these to the CSS forms when 1. the spec
|
||||||
|
allows them anyway and 2. older browsers support them better? After all, the
|
||||||
|
whole point about CSS is to seperate styling from content, so inline styling
|
||||||
|
doesn't solve that problem.
|
||||||
|
|
||||||
|
It's an icky question, and we'll have to deal with it as more and more
|
||||||
|
transforms get implemented.
|
@@ -56,6 +56,7 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* String name of parent element HTML will be going into.
|
* String name of parent element HTML will be going into.
|
||||||
|
* @todo Allow this to be overloaded by user config
|
||||||
* @public
|
* @public
|
||||||
*/
|
*/
|
||||||
var $info_parent = 'div';
|
var $info_parent = 'div';
|
||||||
@@ -111,12 +112,19 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
//////////////////////////////////////////////////////////////////////
|
//////////////////////////////////////////////////////////////////////
|
||||||
// info[]->child : defines allowed children for elements
|
// info[]->child : defines allowed children for elements
|
||||||
|
|
||||||
// entities: prefixed with e_ and _ replaces .
|
// entities: prefixed with e_ and _ replaces . from DTD
|
||||||
|
// double underlines are entities we made up
|
||||||
|
|
||||||
// we don't use an array because that complicates interpolation
|
// we don't use an array because that complicates interpolation
|
||||||
// strings are used instead of arrays because if you use arrays,
|
// strings are used instead of arrays because if you use arrays,
|
||||||
// you have to do some hideous manipulation with array_merge()
|
// you have to do some hideous manipulation with array_merge()
|
||||||
|
|
||||||
|
// todo: determine whether or not having allowed children
|
||||||
|
// that aren't allowed globally affects security (it shouldn't)
|
||||||
|
// if above works out, extend children definitions to include all
|
||||||
|
// possible elements (allowed elements will dictate which ones
|
||||||
|
// get dropped
|
||||||
|
|
||||||
$e_special_extra = 'img';
|
$e_special_extra = 'img';
|
||||||
$e_special_basic = 'br | span | bdo';
|
$e_special_basic = 'br | span | bdo';
|
||||||
$e_special = "$e_special_basic | $e_special_extra";
|
$e_special = "$e_special_basic | $e_special_extra";
|
||||||
@@ -142,16 +150,18 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
$e_block = "p | $e_heading | div | $e_lists | $e_blocktext | table";
|
$e_block = "p | $e_heading | div | $e_lists | $e_blocktext | table";
|
||||||
$e__flow = "#PCDATA | $e_block | $e_inline | $e_misc";
|
$e__flow = "#PCDATA | $e_block | $e_inline | $e_misc";
|
||||||
$e_Flow = new HTMLPurifier_ChildDef_Optional($e__flow);
|
$e_Flow = new HTMLPurifier_ChildDef_Optional($e__flow);
|
||||||
$e_a_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | $e_special".
|
$e_a_content = new HTMLPurifier_ChildDef_Optional("#PCDATA".
|
||||||
" | $e_fontstyle | $e_phrase | $e_inline_forms | $e_misc_inline");
|
" | $e_special | $e_fontstyle | $e_phrase | $e_inline_forms".
|
||||||
|
" | $e_misc_inline");
|
||||||
$e_pre_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | a".
|
$e_pre_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | a".
|
||||||
" | $e_special_basic | $e_fontstyle_basic | $e_phrase_basic".
|
" | $e_special_basic | $e_fontstyle_basic | $e_phrase_basic".
|
||||||
" | $e_inline_forms | $e_misc_inline");
|
" | $e_inline_forms | $e_misc_inline");
|
||||||
$e_form_content = new HTMLPurifier_ChildDef_Optional(''); //unused
|
$e_form_content = new HTMLPurifier_ChildDef_Optional('');//unused
|
||||||
$e_form_button_content = new HTMLPurifier_ChildDef_Optional(''); // unused
|
$e_form_button_content = new HTMLPurifier_ChildDef_Optional('');//unused
|
||||||
|
|
||||||
$this->info['ins']->child =
|
$this->info['ins']->child =
|
||||||
$this->info['del']->child = new HTMLPurifier_ChildDef_Chameleon($e__inline, $e__flow);
|
$this->info['del']->child =
|
||||||
|
new HTMLPurifier_ChildDef_Chameleon($e__inline, $e__flow);
|
||||||
|
|
||||||
$this->info['blockquote']->child=
|
$this->info['blockquote']->child=
|
||||||
$this->info['dd']->child =
|
$this->info['dd']->child =
|
||||||
@@ -225,7 +235,7 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
//////////////////////////////////////////////////////////////////////
|
//////////////////////////////////////////////////////////////////////
|
||||||
// info[]->type : defines the type of the element (block or inline)
|
// info[]->type : defines the type of the element (block or inline)
|
||||||
|
|
||||||
// reuses $e_Inline and $e_block
|
// reuses $e_Inline and $e_Block
|
||||||
|
|
||||||
foreach ($e_Inline->elements as $name) {
|
foreach ($e_Inline->elements as $name) {
|
||||||
$this->info[$name]->type = 'inline';
|
$this->info[$name]->type = 'inline';
|
||||||
@@ -243,7 +253,7 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
|
|
||||||
$this->info['a']->excludes = array('a' => true);
|
$this->info['a']->excludes = array('a' => true);
|
||||||
$this->info['pre']->excludes = array_flip(array('img', 'big', 'small',
|
$this->info['pre']->excludes = array_flip(array('img', 'big', 'small',
|
||||||
// technically in spec, but we don't allow em anyway
|
// technically useless, but good to be indepth
|
||||||
'object', 'applet', 'font', 'basefont'));
|
'object', 'applet', 'font', 'basefont'));
|
||||||
|
|
||||||
//////////////////////////////////////////////////////////////////////
|
//////////////////////////////////////////////////////////////////////
|
||||||
@@ -253,6 +263,8 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
// by the transform classes. It will, however, do simple and slightly
|
// by the transform classes. It will, however, do simple and slightly
|
||||||
// complex attribute value substitution
|
// complex attribute value substitution
|
||||||
|
|
||||||
|
// the question of varying allowed attributes is more entangling.
|
||||||
|
|
||||||
$e_Text = new HTMLPurifier_AttrDef_Text();
|
$e_Text = new HTMLPurifier_AttrDef_Text();
|
||||||
|
|
||||||
// attrs, included in almost every single one except for a few,
|
// attrs, included in almost every single one except for a few,
|
||||||
@@ -297,7 +309,8 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
|
|
||||||
$this->info['table']->attr['summary'] = $e_Text;
|
$this->info['table']->attr['summary'] = $e_Text;
|
||||||
|
|
||||||
$this->info['table']->attr['border'] = new HTMLPurifier_AttrDef_Pixels();
|
$this->info['table']->attr['border'] =
|
||||||
|
new HTMLPurifier_AttrDef_Pixels();
|
||||||
|
|
||||||
$e_Length = new HTMLPurifier_AttrDef_Length();
|
$e_Length = new HTMLPurifier_AttrDef_Length();
|
||||||
$this->info['table']->attr['cellpadding'] =
|
$this->info['table']->attr['cellpadding'] =
|
||||||
@@ -329,7 +342,7 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
$this->info['q']->attr['cite'] = $e_URI;
|
$this->info['q']->attr['cite'] = $e_URI;
|
||||||
|
|
||||||
//////////////////////////////////////////////////////////////////////
|
//////////////////////////////////////////////////////////////////////
|
||||||
// UNIMP : info_tag_transform : transformations of tags
|
// info_tag_transform : transformations of tags
|
||||||
|
|
||||||
$this->info_tag_transform['font'] = new HTMLPurifier_TagTransform_Font();
|
$this->info_tag_transform['font'] = new HTMLPurifier_TagTransform_Font();
|
||||||
$this->info_tag_transform['menu'] = new HTMLPurifier_TagTransform_Simple('ul');
|
$this->info_tag_transform['menu'] = new HTMLPurifier_TagTransform_Simple('ul');
|
||||||
@@ -339,6 +352,9 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
//////////////////////////////////////////////////////////////////////
|
//////////////////////////////////////////////////////////////////////
|
||||||
// info[]->auto_close : tags that automatically close another
|
// info[]->auto_close : tags that automatically close another
|
||||||
|
|
||||||
|
// todo: determine whether or not SGML-like modeling based on
|
||||||
|
// mandatory/optional end tags would be a better policy
|
||||||
|
|
||||||
// make sure you test using isset() not !empty()
|
// make sure you test using isset() not !empty()
|
||||||
|
|
||||||
// these are all block elements: blocks aren't allowed in P
|
// these are all block elements: blocks aren't allowed in P
|
||||||
|
Reference in New Issue
Block a user