Rewrite FixNesting implementation to be tree-based.

This mega-patch rips out the FixNesting implementation and the related ChildDef components. The primary algorithmic change is to convert from use of tokens to tree nodes, which are far more amenable to the style of processing that FixNesting uses. Additionally, FixNesting has been changed to go bottom-up rather than top-down, in order to avoid needing to implement backtracking. This patch simplifies a good deal of the relevant logic, since we no longer need to continually recalculate the nesting structure when processing things. However, the conversion to the alternate format incurs some overhead, so for small inputs these changes are not a win. One possibility to greatly reduce the constant factors here is to switch to entirely using libxml's representation, and never serializing tokens; this would require one to rewrite injectors, however. The iterative post-order traversal in FixNesting is a bit subtle, but we have essentially reified the stack and continuations. We've removed support for %Core.EscapeInvalidChildren. Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
2025-10-16 22:46:06 +02:00 · 2013-10-20 22:18:59 -07:00
parent b3640e1af6
commit 0767bbc12d
22 changed files with 358 additions and 698 deletions
--- a/library/HTMLPurifier/Strategy/FixNesting.php
+++ b/library/HTMLPurifier/Strategy/FixNesting.php
@@ -10,12 +10,12 @@
 * document type definitions, such as the chameleon nature of ins/del
 * tags and global child exclusions.
 *
- * The first major objective of this strategy is to iterate through all the
- * nodes (not tokens) of the list of tokens and determine whether or not
- * their children conform to the element's definition.  If they do not, the
- * child definition may optionally supply an amended list of elements that
- * is valid or require that the entire node be deleted (and the previous
- * node rescanned).
+ * The first major objective of this strategy is to iterate through all
+ * the nodes and determine whether or not their children conform to the
+ * element's definition.  If they do not, the child definition may
+ * optionally supply an amended list of elements that is valid or
+ * require that the entire node be deleted (and the previous node
+ * rescanned).
 *
 * The second objective is to ensure that explicitly excluded elements of
 * an element do not appear in its children.  Code that accomplishes this
@@ -25,23 +25,8 @@
 * @note Whether or not unrecognized children are silently dropped or
 *       translated into text depends on the child definitions.
 *
- * @todo Enable nodes to be bubbled out of the structure.
- *
- * @warning This algorithm (though it may be hard to see) proceeds from
- *          a top-down fashion.  Thus, parents are processed before
- *          children.  This is easy to implement and has a nice effiency
- *          benefit, in that if a node is removed, we never waste any
- *          time processing it, but it also means that if a child
- *          changes in a non-encapsulated way (e.g. it is removed), we
- *          need to go back and reprocess the parent to see if those
- *          changes resulted in problems for the parent.  See
- *          [BACKTRACK] for an example of this.  In the current
- *          implementation, this backtracking can only be triggered when
- *          a node is removed and if that node was the sole node, the
- *          parent would need to be removed.  As such, it is easy to see
- *          that backtracking only incurs constant overhead.  If more
- *          sophisticated backtracking is implemented, care must be
- *          taken to avoid nontermination or exponential blowup.
+ * @todo Enable nodes to be bubbled out of the structure.  This is
+ *       easier with our new algorithm.
 */

 class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
@@ -55,23 +40,19 @@ class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
     */
    public function execute($tokens, $config, $context)
    {
+
        //####################################################################//
        // Pre-processing

-        //$node = HTMLPurifier_Arborize::arborize($tokens, $config, $context);
-        //$new_tokens = HTMLPurifier_Arborize::flatten($node, $config, $context);
+        // O(n) pass to convert to a tree, so that we can efficiently
+        // refer to substrings
+        $top_node = HTMLPurifier_Arborize::arborize($tokens, $config, $context);

        // get a copy of the HTML definition
        $definition = $config->getHTMLDefinition();

        $excludes_enabled = !$config->get('Core.DisableExcludes');

-        // insert implicit "parent" node, will be removed at end.
-        // DEFINITION CALL
-        $parent_name = $definition->info_parent;
-        array_unshift($tokens, new HTMLPurifier_Token_Start($parent_name));
-        $tokens[] = new HTMLPurifier_Token_End($parent_name);
-
        // setup the context variable 'IsInline', for chameleon processing
        // is 'false' when we are not inline, 'true' when it must always
        // be inline, and an integer when it is inline for a certain
@@ -85,278 +66,115 @@ class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
        //####################################################################//
        // Loop initialization

-        // stack that contains the indexes of all parents,
-        // $stack[count($stack)-1] being the current parent
-        $stack = array();
-
        // stack that contains all elements that are excluded
        // it is organized by parent elements, similar to $stack,
        // but it is only populated when an element with exclusions is
        // processed, i.e. there won't be empty exclusions.
-        $exclude_stack = array();
+        $exclude_stack = array($definition->info_parent_def->excludes);

        // variable that contains the start token while we are processing
        // nodes. This enables error reporting to do its job
-        $start_token = false;
-        $context->register('CurrentToken', $start_token);
+        $node = $top_node;
+        // dummy token
+        list($token, $d) = $node->toTokenPair();
+        $context->register('CurrentNode', $node);
+        $context->register('CurrentToken', $token);

        //####################################################################//
        // Loop

-        // iterate through all start nodes. Determining the start node
-        // is complicated so it has been omitted from the loop construct
-        for ($i = 0, $size = count($tokens); $i < $size;) {
+        // We need to implement a post-order traversal iteratively, to
+        // avoid running into stack space limits.  This is pretty tricky
+        // to reason about, so we just manually stack-ify the recursive
+        // variant:
+        //
+        //  function f($node) {
+        //      foreach ($node->children as $child) {
+        //          f($child);
+        //      }
+        //      validate($node);
+        //  }
+        //
+        // Thus, we will represent a stack frame as array($node,
+        // $is_inline, stack of children)
+        // e.g. array_reverse($node->children) - already processed
+        // children.

-            //################################################################//
-            // Gather information on children
+        $parent_def = $definition->info_parent_def;
+        $stack = array(
+            array($top_node,
+                  $parent_def->descendants_are_inline,
+                  $parent_def->excludes, // exclusions
+                  0)
+            );

-            // child token accumulator
-            $child_tokens = array();
-
-            // scroll to the end of this node, report number, and collect
-            // all children
-            for ($j = $i, $depth = 0; ; $j++) {
-                if ($tokens[$j] instanceof HTMLPurifier_Token_Start) {
-                    $depth++;
-                    // skip token assignment on first iteration, this is the
-                    // token we currently are on
-                    if ($depth == 1) {
-                        continue;
-                    }
-                } elseif ($tokens[$j] instanceof HTMLPurifier_Token_End) {
-                    $depth--;
-                    // skip token assignment on last iteration, this is the
-                    // end token of the token we're currently on
-                    if ($depth == 0) {
-                        break;
-                    }
+        while (!empty($stack)) {
+            list($node, $is_inline, $excludes, $ix) = array_pop($stack);
+            // recursive call
+            $go = false;
+            $def = empty($stack) ? $definition->info_parent_def : $definition->info[$node->name];
+            while (isset($node->children[$ix])) {
+                $child = $node->children[$ix++];
+                if ($child instanceof HTMLPurifier_Node_Element) {
+                    $go = true;
+                    $stack[] = array($node, $is_inline, $excludes, $ix);
+                    $stack[] = array($child,
+                        // ToDo: I don't think it matters if it's def or
+                        // child_def, but double check this...
+                        $is_inline || $def->descendants_are_inline,
+                        empty($def->excludes) ? $excludes
+                                              : array_merge($excludes, $def->excludes),
+                        0);
+                    break;
                }
-                $child_tokens[] = $tokens[$j];
-            }
-
-            // $i is index of start token
-            // $j is index of end token
-
-            $start_token = $tokens[$i]; // to make token available via CurrentToken
-
-            //################################################################//
-            // Gather information on parent
-
-            // calculate parent information
-            if ($count = count($stack)) {
-                $parent_index = $stack[$count - 1];
-                $parent_name = $tokens[$parent_index]->name;
-                if ($parent_index == 0) {
-                    $parent_def = $definition->info_parent_def;
+            };
+            if ($go) continue;
+            list($token, $d) = $node->toTokenPair();
+            // base case
+            if ($excludes_enabled && isset($excludes[$node->name])) {
+                $node->dead = true;
+                if ($e) $e->send(E_ERROR, 'Strategy_FixNesting: Node excluded');
+            } else {
+                // XXX I suppose it would be slightly more efficient to
+                // avoid the allocation here and have children
+                // strategies handle it
+                $children = array();
+                foreach ($node->children as $child) {
+                    if (!$child->dead) $children[] = $child;
+                }
+                $result = $def->child->validateChildren($children, $config, $context);
+                if ($result === true) {
+                    // nop
+                    $node->children = $children;
+                } elseif ($result === false) {
+                    $node->dead = true;
+                    if ($e) $e->send(E_ERROR, 'Strategy_FixNesting: Node removed');
                } else {
-                    $parent_def = $definition->info[$parent_name];
-                }
-            } else {
-                // processing as if the parent were the "root" node
-                // unknown info, it won't be used anyway, in the future,
-                // we may want to enforce one element only (this is
-                // necessary for HTML Purifier to clean entire documents
-                $parent_index = $parent_name = $parent_def = null;
-            }
-
-            // calculate context
-            if ($is_inline === false) {
-                // check if conditions make it inline
-                if (!empty($parent_def) && $parent_def->descendants_are_inline) {
-                    $is_inline = $count - 1;
-                }
-            } else {
-                // check if we're out of inline
-                if ($count === $is_inline) {
-                    $is_inline = false;
-                }
-            }
-
-            //################################################################//
-            // Determine whether element is explicitly excluded SGML-style
-
-            // determine whether or not element is excluded by checking all
-            // parent exclusions. The array should not be very large, two
-            // elements at most.
-            $excluded = false;
-            if (!empty($exclude_stack) && $excludes_enabled) {
-                foreach ($exclude_stack as $lookup) {
-                    if (isset($lookup[$tokens[$i]->name])) {
-                        $excluded = true;
-                        // no need to continue processing
-                        break;
+                    $node->children = $result;
+                    if ($e) {
+                        // XXX This will miss mutations of internal nodes. Perhaps defer to the child validators
+                        if (empty($result) && !empty($children)) {
+                            $e->send(E_ERROR, 'Strategy_FixNesting: Node contents removed');
+                        } else if ($result != $children) {
+                            $e->send(E_WARNING, 'Strategy_FixNesting: Node reorganized');
+                        }
                    }
                }
            }
-
-            //################################################################//
-            // Perform child validation
-
-            if ($excluded) {
-                // there is an exclusion, remove the entire node
-                $result = false;
-                $excludes = array(); // not used, but good to initialize anyway
-            } else {
-                // DEFINITION CALL
-                if ($i === 0) {
-                    // special processing for the first node
-                    $def = $definition->info_parent_def;
-                } else {
-                    $def = $definition->info[$tokens[$i]->name];
-
-                }
-
-                if (!empty($def->child)) {
-                    // have DTD child def validate children
-                    $result = $def->child->validateChildren(
-                        $child_tokens,
-                        $config,
-                        $context
-                    );
-                } else {
-                    // weird, no child definition, get rid of everything
-                    $result = false;
-                }
-
-                // determine whether or not this element has any exclusions
-                $excludes = $def->excludes;
-            }
-
-            // $result is now a bool or array
-
-            //################################################################//
-            // Process result by interpreting $result
-
-            if ($result === true || $child_tokens === $result) {
-                // leave the node as is
-
-                // register start token as a parental node start
-                $stack[] = $i;
-
-                // register exclusions if there are any
-                if (!empty($excludes)) {
-                    $exclude_stack[] = $excludes;
-                }
-
-                // move cursor to next possible start node
-                $i++;
-
-            } elseif ($result === false) {
-                // remove entire node
-
-                if ($e) {
-                    if ($excluded) {
-                        $e->send(E_ERROR, 'Strategy_FixNesting: Node excluded');
-                    } else {
-                        $e->send(E_ERROR, 'Strategy_FixNesting: Node removed');
-                    }
-                }
-
-                // calculate length of inner tokens and current tokens
-                $length = $j - $i + 1;
-
-                // perform removal
-                array_splice($tokens, $i, $length);
-
-                // update size
-                $size -= $length;
-
-                // there is no start token to register,
-                // current node is now the next possible start node
-                // unless it turns out that we need to do a double-check
-
-                // this is a rought heuristic that covers 100% of HTML's
-                // cases and 99% of all other cases. A child definition
-                // that would be tricked by this would be something like:
-                // ( | a b c) where it's all or nothing. Fortunately,
-                // our current implementation claims that that case would
-                // not allow empty, even if it did
-                if (!$parent_def->child->allow_empty) {
-                    // we need to do a double-check [BACKTRACK]
-                    $i = $parent_index;
-                    array_pop($stack);
-                }
-
-                // PROJECTED OPTIMIZATION: Process all children elements before
-                // reprocessing parent node.
-
-            } else {
-                // replace node with $result
-
-                // calculate length of inner tokens
-                $length = $j - $i - 1;
-
-                if ($e) {
-                    if (empty($result) && $length) {
-                        $e->send(E_ERROR, 'Strategy_FixNesting: Node contents removed');
-                    } else {
-                        $e->send(E_WARNING, 'Strategy_FixNesting: Node reorganized');
-                    }
-                }
-
-                // perform replacement
-                array_splice($tokens, $i + 1, $length, $result);
-
-                // update size
-                $size -= $length;
-                $size += count($result);
-
-                // register start token as a parental node start
-                $stack[] = $i;
-
-                // register exclusions if there are any
-                if (!empty($excludes)) {
-                    $exclude_stack[] = $excludes;
-                }
-
-                // move cursor to next possible start node
-                $i++;
-            }
-
-            //################################################################//
-            // Scroll to next start node
-
-            // We assume, at this point, that $i is the index of the token
-            // that is the first possible new start point for a node.
-
-            // Test if the token indeed is a start tag, if not, move forward
-            // and test again.
-            $size = count($tokens);
-            while ($i < $size and !$tokens[$i] instanceof HTMLPurifier_Token_Start) {
-                if ($tokens[$i] instanceof HTMLPurifier_Token_End) {
-                    // pop a token index off the stack if we ended a node
-                    array_pop($stack);
-                    // pop an exclusion lookup off exclusion stack if
-                    // we ended node and that node had exclusions
-                    if ($i == 0 || $i == $size - 1) {
-                        // use specialized var if it's the super-parent
-                        $s_excludes = $definition->info_parent_def->excludes;
-                    } else {
-                        $s_excludes = $definition->info[$tokens[$i]->name]->excludes;
-                    }
-                    if ($s_excludes) {
-                        array_pop($exclude_stack);
-                    }
-                }
-                $i++;
-            }
-
        }

        //####################################################################//
        // Post-processing

-        // remove implicit parent tokens at the beginning and end
-        array_shift($tokens);
-        array_pop($tokens);
-
        // remove context variables
        $context->destroy('IsInline');
+        $context->destroy('CurrentNode');
        $context->destroy('CurrentToken');

        //####################################################################//
        // Return
-        return $tokens;
+
+        return HTMLPurifier_Arborize::flatten($node, $config, $context);
    }
 }