Update spec.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@144 48356398-32a2-884e-a903-53898d9a118a
2025-10-15 14:14:24 +02:00 · 2006-08-03 00:21:27 +00:00
parent 064fd603d3
commit 19081ffdf2
1 changed files with 9 additions and 99 deletions
--- a/docs/spec.txt
+++ b/docs/spec.txt
@@ -64,24 +64,7 @@ The rest of this document is pending moving into their associated classes.
 == STAGE 3 - make well formed ==
    Status: A- (not as good as possible)
 Now we step through the whole thing and correct nesting issues. Most of the
 time, it's making sure the tags match up, but there's some trickery going on
 for HTML's quirks. They are:
 * Set of tags that close P
        'address', 'blockquote', 'dd',      'dir',       'div', 
        'dl',      'dt',         'h1',     'h2',      'h3',        'h4', 
        'h5',      'h6',         'hr',
        'ol',     'p',       'pre', 
        'table',   'ul' 
 * Li closes li
 * more?
 We also want to do translations, like from FONT to SPAN with STYLE.
@@ -128,61 +111,15 @@ The way, I suppose, one would check for it, is whenever a node is removed,
 scroll to it's parent start, and re-evaluate it. Make sure you're able to do
 that with minimal code repetition.
 EDITOR'S NOTE: this behavior is not implemented by default, because the
 default configuration has a setup that ensures that cascading node removals
 will never happen. However, there will be warning signs in case someone tries
 to hack it further.
 The most complex case can probably be done by using some fancy regexp
 expressions and transformations. However, it doesn't seem right that, say,
-a stray <b> in a <table> can cause the entire table to be removed. Fixing it,
+a stray <b> in a <table> can cause the entire table to be removed. Depending
-however, may be too difficult (or not, see below).
+on how much work we want to do, this will at least need a custom child
-
+definition, and at most require extra element bubbling capabilities to be
-This code was excerpted from the PEAR class XML_DTD. It implements regexp
+added.
 checking.
 --
 // # This actually does the validation
 // Validate the order of the children
 if (!$was_error && count($dtd_children)) {
    $children_list = implode(',', $children);
    $regex = $this->dtd->getPcreRegex($name);
    if (!preg_match('/^'.$regex.'$/', $children_list)) {
        $dtd_regex = $this->dtd->getDTDRegex($name);
        $this->_errors("In element <$name> the children list found:\n'$children_list', ".
                       "does not conform the DTD definition: '$dtd_regex'", $lineno);
    }
 }
 --
 // # This figures out the PcreRegex
 //$ch is a string of the allowed childs
 $children = preg_split('/([^#a-zA-Z0-9_.-]+)/', $ch, -1, PREG_SPLIT_NO_EMPTY);
 // check for parsed character data special case
 if (in_array('#PCDATA', $children)) {
    $content = '#PCDATA';
    if (count($children) == 1) {
        $children = array();
        break;
    }
 }
 // $children is not used after this
 $this->dtd['elements'][$elem_name]['child_validation_dtd_regex'] = $ch;
 // Convert the DTD regex language into PCRE regex format
 $reg = str_replace(',', ',?', $ch);
 $reg = preg_replace('/([#a-zA-Z0-9_.-]+)/', '(,?\\0)', $reg);
 $this->dtd['elements'][$elem_name]['child_validation_pcre_regex'] = $reg;
 --
 We can probably loot and steal all of this. This brilliance of this code is
 amazing. I'm lovin' it!
 So, the way we define these cases should work like this:
 class ChildDef with validateChildren($children_tags)
@@ -201,26 +138,6 @@ parent.
 --
 Another few problems: EXCLUSIONS!
 a
    must not contain other a elements.
 pre
    must not contain the img, object, big, small, sub, or sup elements.
 button
    must not contain the input, select, textarea, label, button, form, fieldset,
    iframe or isindex elements.
 label
    must not contain other label elements.
 form
    must not contain other form elements. 
 Normative exclusions straight from the horses mouth. These are SGML style,
 not XML style, so we need to modify the ruleset slightly. However, the DTD
 may have done this for us already.
 --
 Also, what do we do with elements if they're not allowed somewhere? We need
 some sort of default behavior. I reckon that we should be allowed to:
@@ -240,20 +157,13 @@ to text when PCDATA is allowed.
 --
-Note that generic child definitions are not usually desirable: we should
+ins/del are allowed in block and inline content, but it is
-implement custom handlers for each one that specify the stuff correctly.
+inappropriate to include block content within an ins element
-
+occurring in inline content. How would we fix this?
 --
 <!--
  ins/del are allowed in block and inline content, but it is
  inappropriate to include block content within an ins element
  occurring in inline content.
 -->
 == STAGE 4 - check attributes ==
-    STATUS: N (not started)
+    STATUS: F (currently implementing core/i18n)
 While we're doing all this nesting hocus-pocus, attributes are also being
 checked. The reason why we need this to be done with the nesting stuff
@@ -262,10 +172,10 @@ replace it with data). Fortunantely, this is rare enough that we only have
 to worry about it for certain things:
 * ! bdo - dir > replace with span, preserve attributes
 * ! img - src, alt > if only alt is missing, insert filename, else remove img
 * basefont - size
 * param - name
 * applet - width, height
 * ! img - src, alt > if only alt is missing, insert filename, else remove img
 * map - id
 * area - alt
 * form - action