1
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-07-31 19:30:21 +02:00

Rewrite FixNesting implementation to be tree-based.

This mega-patch rips out the FixNesting implementation and the related
ChildDef components.  The primary algorithmic change is to convert from
use of tokens to tree nodes, which are far more amenable to the style
of processing that FixNesting uses.  Additionally, FixNesting has been
changed to go bottom-up rather than top-down, in order to avoid needing
to implement backtracking.

This patch simplifies a good deal of the relevant logic, since we no
longer need to continually recalculate the nesting structure when
processing things.  However, the conversion to the alternate format
incurs some overhead, so for small inputs these changes are not a win.
One possibility to greatly reduce the constant factors here is to switch
to entirely using libxml's representation, and never serializing tokens;
this would require one to rewrite injectors, however.

The iterative post-order traversal in FixNesting is a bit subtle, but
we have essentially reified the stack and continuations.

We've removed support for %Core.EscapeInvalidChildren.

Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
This commit is contained in:
Edward Z. Yang
2013-10-20 22:18:59 -07:00
parent b3640e1af6
commit 0767bbc12d
22 changed files with 358 additions and 698 deletions

View File

@@ -10,12 +10,12 @@
* document type definitions, such as the chameleon nature of ins/del
* tags and global child exclusions.
*
* The first major objective of this strategy is to iterate through all the
* nodes (not tokens) of the list of tokens and determine whether or not
* their children conform to the element's definition. If they do not, the
* child definition may optionally supply an amended list of elements that
* is valid or require that the entire node be deleted (and the previous
* node rescanned).
* The first major objective of this strategy is to iterate through all
* the nodes and determine whether or not their children conform to the
* element's definition. If they do not, the child definition may
* optionally supply an amended list of elements that is valid or
* require that the entire node be deleted (and the previous node
* rescanned).
*
* The second objective is to ensure that explicitly excluded elements of
* an element do not appear in its children. Code that accomplishes this
@@ -25,23 +25,8 @@
* @note Whether or not unrecognized children are silently dropped or
* translated into text depends on the child definitions.
*
* @todo Enable nodes to be bubbled out of the structure.
*
* @warning This algorithm (though it may be hard to see) proceeds from
* a top-down fashion. Thus, parents are processed before
* children. This is easy to implement and has a nice effiency
* benefit, in that if a node is removed, we never waste any
* time processing it, but it also means that if a child
* changes in a non-encapsulated way (e.g. it is removed), we
* need to go back and reprocess the parent to see if those
* changes resulted in problems for the parent. See
* [BACKTRACK] for an example of this. In the current
* implementation, this backtracking can only be triggered when
* a node is removed and if that node was the sole node, the
* parent would need to be removed. As such, it is easy to see
* that backtracking only incurs constant overhead. If more
* sophisticated backtracking is implemented, care must be
* taken to avoid nontermination or exponential blowup.
* @todo Enable nodes to be bubbled out of the structure. This is
* easier with our new algorithm.
*/
class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
@@ -55,23 +40,19 @@ class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
*/
public function execute($tokens, $config, $context)
{
//####################################################################//
// Pre-processing
//$node = HTMLPurifier_Arborize::arborize($tokens, $config, $context);
//$new_tokens = HTMLPurifier_Arborize::flatten($node, $config, $context);
// O(n) pass to convert to a tree, so that we can efficiently
// refer to substrings
$top_node = HTMLPurifier_Arborize::arborize($tokens, $config, $context);
// get a copy of the HTML definition
$definition = $config->getHTMLDefinition();
$excludes_enabled = !$config->get('Core.DisableExcludes');
// insert implicit "parent" node, will be removed at end.
// DEFINITION CALL
$parent_name = $definition->info_parent;
array_unshift($tokens, new HTMLPurifier_Token_Start($parent_name));
$tokens[] = new HTMLPurifier_Token_End($parent_name);
// setup the context variable 'IsInline', for chameleon processing
// is 'false' when we are not inline, 'true' when it must always
// be inline, and an integer when it is inline for a certain
@@ -85,278 +66,115 @@ class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
//####################################################################//
// Loop initialization
// stack that contains the indexes of all parents,
// $stack[count($stack)-1] being the current parent
$stack = array();
// stack that contains all elements that are excluded
// it is organized by parent elements, similar to $stack,
// but it is only populated when an element with exclusions is
// processed, i.e. there won't be empty exclusions.
$exclude_stack = array();
$exclude_stack = array($definition->info_parent_def->excludes);
// variable that contains the start token while we are processing
// nodes. This enables error reporting to do its job
$start_token = false;
$context->register('CurrentToken', $start_token);
$node = $top_node;
// dummy token
list($token, $d) = $node->toTokenPair();
$context->register('CurrentNode', $node);
$context->register('CurrentToken', $token);
//####################################################################//
// Loop
// iterate through all start nodes. Determining the start node
// is complicated so it has been omitted from the loop construct
for ($i = 0, $size = count($tokens); $i < $size;) {
// We need to implement a post-order traversal iteratively, to
// avoid running into stack space limits. This is pretty tricky
// to reason about, so we just manually stack-ify the recursive
// variant:
//
// function f($node) {
// foreach ($node->children as $child) {
// f($child);
// }
// validate($node);
// }
//
// Thus, we will represent a stack frame as array($node,
// $is_inline, stack of children)
// e.g. array_reverse($node->children) - already processed
// children.
//################################################################//
// Gather information on children
$parent_def = $definition->info_parent_def;
$stack = array(
array($top_node,
$parent_def->descendants_are_inline,
$parent_def->excludes, // exclusions
0)
);
// child token accumulator
$child_tokens = array();
// scroll to the end of this node, report number, and collect
// all children
for ($j = $i, $depth = 0; ; $j++) {
if ($tokens[$j] instanceof HTMLPurifier_Token_Start) {
$depth++;
// skip token assignment on first iteration, this is the
// token we currently are on
if ($depth == 1) {
continue;
}
} elseif ($tokens[$j] instanceof HTMLPurifier_Token_End) {
$depth--;
// skip token assignment on last iteration, this is the
// end token of the token we're currently on
if ($depth == 0) {
break;
}
while (!empty($stack)) {
list($node, $is_inline, $excludes, $ix) = array_pop($stack);
// recursive call
$go = false;
$def = empty($stack) ? $definition->info_parent_def : $definition->info[$node->name];
while (isset($node->children[$ix])) {
$child = $node->children[$ix++];
if ($child instanceof HTMLPurifier_Node_Element) {
$go = true;
$stack[] = array($node, $is_inline, $excludes, $ix);
$stack[] = array($child,
// ToDo: I don't think it matters if it's def or
// child_def, but double check this...
$is_inline || $def->descendants_are_inline,
empty($def->excludes) ? $excludes
: array_merge($excludes, $def->excludes),
0);
break;
}
$child_tokens[] = $tokens[$j];
}
// $i is index of start token
// $j is index of end token
$start_token = $tokens[$i]; // to make token available via CurrentToken
//################################################################//
// Gather information on parent
// calculate parent information
if ($count = count($stack)) {
$parent_index = $stack[$count - 1];
$parent_name = $tokens[$parent_index]->name;
if ($parent_index == 0) {
$parent_def = $definition->info_parent_def;
};
if ($go) continue;
list($token, $d) = $node->toTokenPair();
// base case
if ($excludes_enabled && isset($excludes[$node->name])) {
$node->dead = true;
if ($e) $e->send(E_ERROR, 'Strategy_FixNesting: Node excluded');
} else {
// XXX I suppose it would be slightly more efficient to
// avoid the allocation here and have children
// strategies handle it
$children = array();
foreach ($node->children as $child) {
if (!$child->dead) $children[] = $child;
}
$result = $def->child->validateChildren($children, $config, $context);
if ($result === true) {
// nop
$node->children = $children;
} elseif ($result === false) {
$node->dead = true;
if ($e) $e->send(E_ERROR, 'Strategy_FixNesting: Node removed');
} else {
$parent_def = $definition->info[$parent_name];
}
} else {
// processing as if the parent were the "root" node
// unknown info, it won't be used anyway, in the future,
// we may want to enforce one element only (this is
// necessary for HTML Purifier to clean entire documents
$parent_index = $parent_name = $parent_def = null;
}
// calculate context
if ($is_inline === false) {
// check if conditions make it inline
if (!empty($parent_def) && $parent_def->descendants_are_inline) {
$is_inline = $count - 1;
}
} else {
// check if we're out of inline
if ($count === $is_inline) {
$is_inline = false;
}
}
//################################################################//
// Determine whether element is explicitly excluded SGML-style
// determine whether or not element is excluded by checking all
// parent exclusions. The array should not be very large, two
// elements at most.
$excluded = false;
if (!empty($exclude_stack) && $excludes_enabled) {
foreach ($exclude_stack as $lookup) {
if (isset($lookup[$tokens[$i]->name])) {
$excluded = true;
// no need to continue processing
break;
$node->children = $result;
if ($e) {
// XXX This will miss mutations of internal nodes. Perhaps defer to the child validators
if (empty($result) && !empty($children)) {
$e->send(E_ERROR, 'Strategy_FixNesting: Node contents removed');
} else if ($result != $children) {
$e->send(E_WARNING, 'Strategy_FixNesting: Node reorganized');
}
}
}
}
//################################################################//
// Perform child validation
if ($excluded) {
// there is an exclusion, remove the entire node
$result = false;
$excludes = array(); // not used, but good to initialize anyway
} else {
// DEFINITION CALL
if ($i === 0) {
// special processing for the first node
$def = $definition->info_parent_def;
} else {
$def = $definition->info[$tokens[$i]->name];
}
if (!empty($def->child)) {
// have DTD child def validate children
$result = $def->child->validateChildren(
$child_tokens,
$config,
$context
);
} else {
// weird, no child definition, get rid of everything
$result = false;
}
// determine whether or not this element has any exclusions
$excludes = $def->excludes;
}
// $result is now a bool or array
//################################################################//
// Process result by interpreting $result
if ($result === true || $child_tokens === $result) {
// leave the node as is
// register start token as a parental node start
$stack[] = $i;
// register exclusions if there are any
if (!empty($excludes)) {
$exclude_stack[] = $excludes;
}
// move cursor to next possible start node
$i++;
} elseif ($result === false) {
// remove entire node
if ($e) {
if ($excluded) {
$e->send(E_ERROR, 'Strategy_FixNesting: Node excluded');
} else {
$e->send(E_ERROR, 'Strategy_FixNesting: Node removed');
}
}
// calculate length of inner tokens and current tokens
$length = $j - $i + 1;
// perform removal
array_splice($tokens, $i, $length);
// update size
$size -= $length;
// there is no start token to register,
// current node is now the next possible start node
// unless it turns out that we need to do a double-check
// this is a rought heuristic that covers 100% of HTML's
// cases and 99% of all other cases. A child definition
// that would be tricked by this would be something like:
// ( | a b c) where it's all or nothing. Fortunately,
// our current implementation claims that that case would
// not allow empty, even if it did
if (!$parent_def->child->allow_empty) {
// we need to do a double-check [BACKTRACK]
$i = $parent_index;
array_pop($stack);
}
// PROJECTED OPTIMIZATION: Process all children elements before
// reprocessing parent node.
} else {
// replace node with $result
// calculate length of inner tokens
$length = $j - $i - 1;
if ($e) {
if (empty($result) && $length) {
$e->send(E_ERROR, 'Strategy_FixNesting: Node contents removed');
} else {
$e->send(E_WARNING, 'Strategy_FixNesting: Node reorganized');
}
}
// perform replacement
array_splice($tokens, $i + 1, $length, $result);
// update size
$size -= $length;
$size += count($result);
// register start token as a parental node start
$stack[] = $i;
// register exclusions if there are any
if (!empty($excludes)) {
$exclude_stack[] = $excludes;
}
// move cursor to next possible start node
$i++;
}
//################################################################//
// Scroll to next start node
// We assume, at this point, that $i is the index of the token
// that is the first possible new start point for a node.
// Test if the token indeed is a start tag, if not, move forward
// and test again.
$size = count($tokens);
while ($i < $size and !$tokens[$i] instanceof HTMLPurifier_Token_Start) {
if ($tokens[$i] instanceof HTMLPurifier_Token_End) {
// pop a token index off the stack if we ended a node
array_pop($stack);
// pop an exclusion lookup off exclusion stack if
// we ended node and that node had exclusions
if ($i == 0 || $i == $size - 1) {
// use specialized var if it's the super-parent
$s_excludes = $definition->info_parent_def->excludes;
} else {
$s_excludes = $definition->info[$tokens[$i]->name]->excludes;
}
if ($s_excludes) {
array_pop($exclude_stack);
}
}
$i++;
}
}
//####################################################################//
// Post-processing
// remove implicit parent tokens at the beginning and end
array_shift($tokens);
array_pop($tokens);
// remove context variables
$context->destroy('IsInline');
$context->destroy('CurrentNode');
$context->destroy('CurrentToken');
//####################################################################//
// Return
return $tokens;
return HTMLPurifier_Arborize::flatten($node, $config, $context);
}
}