mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-16 21:48:14 +01:00
c2bc3549a3
Co-authored-by: Viktor Szépe <viktor@szepe.net> Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
219 lines
7.8 KiB
Plaintext
219 lines
7.8 KiB
Plaintext
THE UNIVERSAL DESIGN PATTERN: PROPERTIES
|
|
Steve Yegge
|
|
|
|
Implementation:
|
|
get(name)
|
|
put(name, value)
|
|
has(name)
|
|
remove(name)
|
|
iteration, with filtering [this will be our namespaces]
|
|
parent
|
|
|
|
Representations:
|
|
- Keys are strings
|
|
- It's nice to not need to quote keys (if we formulate our own language,
|
|
consider this)
|
|
- Property not present representation (key missing)
|
|
- Frequent removal/re-add may have null help. If null is valid, use
|
|
another value. (PHP semantics are weird here)
|
|
|
|
Data structures:
|
|
- LinkedHashMap is wonderful (O(1) access and maintains order)
|
|
- Using a special property that points to the parent is usual
|
|
- Multiple inheritance possible, need rules for which to lookup first
|
|
- Iterative inheritance is best
|
|
- Consider performance!
|
|
|
|
Deletion
|
|
- Tricky problem with inheritance
|
|
- Distinguish between "not found" and "look in my parent for the property"
|
|
[Maybe HTML Purifier won't allow deletion]
|
|
|
|
Read/write asymmetry (it's correct!)
|
|
|
|
Read-only plists
|
|
- Allow ability to freeze [this is what we have already]
|
|
- Don't overuse it
|
|
|
|
Performance:
|
|
- Intern strings (PHP does this already)
|
|
- Don't be case-insensitive
|
|
- If all properties in a plist are known a-priori, you can use a "perfect"
|
|
hash function. Often overkill.
|
|
- Copy-on-read caching "plundering" reduces lookup, but uses memory and can
|
|
grow stale. Use as last resort.
|
|
- Refactoring to fields. Watch for API compatibility, system complexity,
|
|
and lack of flexibility.
|
|
- Refrigerator: external data-structure to hold plists
|
|
|
|
Transient properties:
|
|
[Don't need to worry about this]
|
|
- Use a separate plist for transient properties
|
|
- Non-numeric override; numeric should ADD
|
|
- Deletion: removeTransientProperty() and transientlyRemoveProperty()
|
|
|
|
Persistence:
|
|
- XML/JSON are good
|
|
- Text-based is good for readability, maintainability and bootstrapping
|
|
- Compressed binary format for network transport [not necessary]
|
|
- RDBMS or XML database
|
|
|
|
Querying: [not relevant]
|
|
- XML database is nice for XPath/XQuery
|
|
- jQuery for JSON
|
|
- Just load it all into a program
|
|
|
|
Backfills/Data integrity:
|
|
- Use usual methods
|
|
- Lazy backfill is a nice hack
|
|
|
|
Type systems:
|
|
- Flags: ReadOnly, Permanent, DontEnum
|
|
- Typed properties isn't that useful [It's also Not-PHP]
|
|
- Separate meta-list of directive properties IS useful
|
|
- Duck typing is useful for systems designed fully around properties pattern
|
|
|
|
Trade-off:
|
|
+ Flexibility
|
|
+ Extensibility
|
|
+ Unit-testing/prototype-speed
|
|
- Performance
|
|
- Data integrity
|
|
- Navagability/Query-ability
|
|
- Reversability (hard to go back)
|
|
|
|
HTML Purifier
|
|
|
|
We are not happy with our current system of defining configuration directives,
|
|
because it has become clear that things will get a lot nicer if we allow
|
|
multiple namespaces, and there are some features that naturally lend themselves
|
|
to inheritance, which we do not really support well.
|
|
|
|
One of the considered implementation changes would be to go from a structure
|
|
like:
|
|
|
|
array(
|
|
'Namespace' => array(
|
|
'Directive' => 'val1',
|
|
'Directive2' => 'val2',
|
|
)
|
|
)
|
|
|
|
to:
|
|
|
|
array(
|
|
'Namespace.Directive' => 'val1',
|
|
'Namespace.Directive2' => 'val2',
|
|
)
|
|
|
|
The below implementation takes more memory, however, and it makes it a bit
|
|
complicated to grab all values from a namespace.
|
|
|
|
The alternate implementation choice is to allow nested plists. This keeps
|
|
iteration easy, but is problematic for inheritance (it would be difficult
|
|
to distinguish a plist from an array) and retrieval (when specifying multiple
|
|
namespaces we would need some multiple de-referencing).
|
|
|
|
----
|
|
|
|
We can bite the performance hit, and just do iteration with filter
|
|
(the strncmp call should be relatively cheap). Then, users should be able
|
|
to optimize doing something like:
|
|
|
|
$config = HTMLPurifier_Config::createDefault();
|
|
if (!file_exists('config.php')) {
|
|
// set up $config
|
|
$config->save('config.php');
|
|
} else {
|
|
$config->load('config.php');
|
|
}
|
|
|
|
Or maybe memcache, or something. This means that "// set up $config" must
|
|
not have any dynamic parts, or the user has to invalidate the cache when
|
|
they do update it. We have to think about this a little more carefully; the
|
|
file call might be more expensive.
|
|
|
|
----
|
|
|
|
This might get expensive, however, when we actually care about iterating
|
|
over the configuration and want the actual values. So what about nesting the
|
|
lists?
|
|
|
|
"ns.sub.directive" => values['ns']['sub']['directive']
|
|
|
|
We can distinguish between plists and arrays by using ArrayObjects for the
|
|
plists, and regular arrays for the arrays? Alternatively, use ArrayObjects
|
|
for the arrays, and regular arrays for the plists.
|
|
|
|
----
|
|
|
|
Implementation demands, and what has caused them:
|
|
|
|
1. DefinitionCache, the HTML, CSS and URI namespaces have caches attached to them
|
|
Results:
|
|
- getBatchSerial()
|
|
- getBatch() : in general, the ability to traverse just a namespace
|
|
|
|
2. AutoFormat/Filter, this is a plugin architecture, directives not hard-coded
|
|
- getBatch()
|
|
|
|
3. Configuration form
|
|
- Namespaces used to organize directives
|
|
|
|
Other than that, we have a pure plist. PERHAPS we should maintain separate things
|
|
for these different demands.
|
|
|
|
Issue 2: Directives for configuring the plugins are regular plists, but
|
|
when enabling them, while it's "plist-ish", what you're really doing is adding
|
|
them to an array of "autoformatters"/"filters" to enable. We can setup
|
|
magic BC as well as in the new interface, but there should also be an
|
|
add('AutoFormat', 'AutoParagraph'); which does the right thing.
|
|
|
|
One thing to consider is whether or not inheritance rules will apply to these.
|
|
I'd say yes. That means that they're still plisty, in fact, the underlying
|
|
implementation will probably be a plist. However, they will get their OWN
|
|
plists, and will NOT support nesting.
|
|
|
|
Issue 1: Our current implementation is generally not efficient; md5(serialize($foo))
|
|
is pretty expensive. So, I don't think there will be any problems if it
|
|
gets "less" efficient, as long as we give users a properly fast alternative;
|
|
DefinitionRev gives us a way to do this, by simply telling the user they must
|
|
update it whenever they update Configuration directives as well. (There are
|
|
obvious BC concerns here).
|
|
|
|
In such a case, we simply iterate over our plist (performing full retrievals
|
|
for each value), grab the entries we care about, and then serialize and hash.
|
|
It's going to be slow either way, due to the ability of plists to inherit.
|
|
If we ksort(), we don't have to traverse the entire array, however, the
|
|
cost of a ksort() call may not be worth it.
|
|
|
|
At this point, last time, I started worrying about the performance implications
|
|
of allowing inheritance, and wondering whether or not I wanted to squash
|
|
the plist. At first blush, our code might be under the assumption that
|
|
accessing properties is cheap; but actually we prefer to copy out the value
|
|
into a member variable if it's going to be used many times. With this is mind
|
|
I don't think CPU consumption from a few nested function calls is going to
|
|
be a problem. We *are* going to enforce a function only interface.
|
|
|
|
The next issue at hand is how we're going to manage the "special" plists,
|
|
which should still be able to be inherited. Basically, it means that multiple
|
|
plists would be attached to the configuration object, which is not the
|
|
best for memory performance. The alternative is to keep them all in one
|
|
big plist, and then eat the one-time cost of traversing the entire plist
|
|
to grab the appropriate values.
|
|
|
|
I think at this point we can write the generic interface, and then set up separate
|
|
plists if that ends up being necessary for performance (it probably won't.) Now
|
|
lets code our generic plist implementation.
|
|
|
|
----
|
|
|
|
Iterating over the plist presents some problems. The way we've chosen to solve
|
|
this is to squash all of the parents.
|
|
|
|
----
|
|
|
|
But I don't need iteration.
|
|
|
|
vim: et sw=4 sts=4
|