mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-17 14:08:15 +01:00
Update dev-includes.txt with our evil master plan.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1506 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
c85fd83d2b
commit
c43c0660f5
5
TODO
5
TODO
@ -12,7 +12,10 @@ amount of effort to implement, it may get endlessly delayed. Do not be
|
|||||||
afraid to cast your vote for the next feature to be implemented!
|
afraid to cast your vote for the next feature to be implemented!
|
||||||
|
|
||||||
IMPORTANT
|
IMPORTANT
|
||||||
- Ensure all configuration goes through HTMLPurifier_Config
|
- Put our new configuration thing into effect
|
||||||
|
- Get everything into configuration objects (filters, I'm looking at you)
|
||||||
|
- Factor demo.php into a set of Printer classes, and then create a stub
|
||||||
|
file for users here
|
||||||
|
|
||||||
3.1 release [Error'ed]
|
3.1 release [Error'ed]
|
||||||
# Error logging for filtering/cleanup procedures
|
# Error logging for filtering/cleanup procedures
|
||||||
|
@ -20,7 +20,9 @@ when there are not. Unfortunately, these two goals seem contrary to each
|
|||||||
other.
|
other.
|
||||||
|
|
||||||
A peripheral issue is the performance of ConfigSchema, which has been
|
A peripheral issue is the performance of ConfigSchema, which has been
|
||||||
shown take a large, constant amount of initialization time.
|
shown take a large, constant amount of initialization time, and is
|
||||||
|
intricately linked to the issue of includes due to its pervasive use
|
||||||
|
in our plugin architecture.
|
||||||
|
|
||||||
Pros and Cons
|
Pros and Cons
|
||||||
-------------
|
-------------
|
||||||
@ -47,13 +49,222 @@ Include it all:
|
|||||||
- Classes that inherit from external libraries will cause compile
|
- Classes that inherit from external libraries will cause compile
|
||||||
errors
|
errors
|
||||||
|
|
||||||
Build an include stub:
|
Build an include stub (Let's do this!):
|
||||||
Pros:
|
Pros:
|
||||||
- Only necessary code is included
|
- Only necessary code is included
|
||||||
- Plays nicely with opcode caches and standalone version
|
- Plays nicely with opcode caches and standalone version
|
||||||
- require (without once) can be used, see above
|
- require (without once) can be used, see above
|
||||||
|
- Could further extend as a compilation to one file
|
||||||
Cons:
|
Cons:
|
||||||
- Not implemented yet
|
- Not implemented yet
|
||||||
- Requires user intervention and use of a command line script
|
- Requires user intervention and use of a command line script
|
||||||
- Standalone script must be chained to this
|
- Standalone script must be chained to this
|
||||||
- More complex and compiled-language-like
|
- More complex and compiled-language-like
|
||||||
|
- Requires a whole new class of system-wide configuration directives,
|
||||||
|
as configuration objects can be reused
|
||||||
|
- Determining what needs to be included can be complex (see above)
|
||||||
|
- No way of autodetecting dynamically instantiated classes
|
||||||
|
- Might be slow
|
||||||
|
|
||||||
|
Include stubs
|
||||||
|
-------------
|
||||||
|
|
||||||
|
This solution may be "just right" for users who are heavily oriented
|
||||||
|
towards performance. However, there are a number of picky implementation
|
||||||
|
details to work out beforehand.
|
||||||
|
|
||||||
|
The number one concern is how to make the HTML Purifier files "work
|
||||||
|
out of the box", while still being able to easily get them into a form
|
||||||
|
that works with this setup. As the codebase stands right now, it would
|
||||||
|
be necessary to strip out all of the require_once calls. The only way
|
||||||
|
we could get rid of the require_once calls is to use __autoload or
|
||||||
|
use the stub for all cases (which might not be a bad idea).
|
||||||
|
|
||||||
|
Aside
|
||||||
|
-----
|
||||||
|
An important thing to remember, however, is that these require_once's
|
||||||
|
are valuable data about what classes a file needs. Unfortunately, there's
|
||||||
|
no distinction between whether or not the file is needed all the time,
|
||||||
|
or whether or not it is one of our "optional" files. Thus, it is
|
||||||
|
effectively useless.
|
||||||
|
|
||||||
|
Deprecated
|
||||||
|
----------
|
||||||
|
One of the things I'd like to do is have the code search for any classes
|
||||||
|
that are explicitly mentioned in the code. If a class isn't mentioned, I
|
||||||
|
get to assume that it is "optional," i.e. included via introspection.
|
||||||
|
The choice is either to use PHP's tokenizer or use regexps; regexps would
|
||||||
|
be faster but a tokenizer would be more correct. If this ends up being
|
||||||
|
unfeasible, adding dependency comments isn't a bad idea. (This could
|
||||||
|
even be done automatically by search/replacing require_once, although
|
||||||
|
we'd have to manually inspect the results for the optional requires.)
|
||||||
|
|
||||||
|
NOTE: This ends up not being necessary, as we're going to make the user
|
||||||
|
figure out all the extra classes they need, and only include the core
|
||||||
|
which is predetermined.
|
||||||
|
|
||||||
|
Using the autoload framework with include stubs works nicely with
|
||||||
|
introspective classes: instead of having to have require_once inside
|
||||||
|
the function, we can let autoload do the work; we simply need to
|
||||||
|
new $class or accept the object straight from the caller. Handling filters
|
||||||
|
becomes a simple matter of ticking off configuration directives, and
|
||||||
|
if ConfigSchema spits out errors, adding the necessary includes. We could
|
||||||
|
also use the autoload framework as a fallback, in case the user forgets
|
||||||
|
to make the include, but doesn't really care about performance.
|
||||||
|
|
||||||
|
Insight
|
||||||
|
-------
|
||||||
|
All of this talk is merely a natural extension of what our current
|
||||||
|
standalone functionality does. However, instead of having our code
|
||||||
|
perform the includes, or attempting to inline everything that possibly
|
||||||
|
could be used, we boot the issue to the user, making them include
|
||||||
|
everything or setup the fallback autoload handler.
|
||||||
|
|
||||||
|
Configuration Schema
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
A common deficiency for all of the conditional include setups (including
|
||||||
|
the dynamically built include PHP stub) is that if one of this
|
||||||
|
conditionally included files includes a configuration directive, it
|
||||||
|
is not accessible to configdoc. A stopgap solution for this problem is
|
||||||
|
to have it piggy-back off of the data in the merge-library.php script
|
||||||
|
to figure out what extra files it needs to include, but if the file also
|
||||||
|
inherits classes that don't exist, we're in big trouble.
|
||||||
|
|
||||||
|
I think it's high time we centralized the configuration documentation.
|
||||||
|
However, the type checking has been a great boon for the library, and
|
||||||
|
I'd like to keep that. The compromise is to use some other source, and
|
||||||
|
then parse it into the ConfigSchema internal format (sans all of those
|
||||||
|
nasty documentation strings which we really don't need at runtime) and
|
||||||
|
serialize that for future use.
|
||||||
|
|
||||||
|
The next question is that of format. XML is very verbose, and the prospect
|
||||||
|
of setting defaults in it gives me willies. However, this may be necessary.
|
||||||
|
Splitting up the file into manageable chunks may alleviate this trouble,
|
||||||
|
and we may be even want to create our own format optimized for specifying
|
||||||
|
configuration. It might look like (based off the PHPT format, which is
|
||||||
|
nicely compact yet unambiguous and human-readable):
|
||||||
|
|
||||||
|
Core.HiddenElements
|
||||||
|
TYPE: lookup
|
||||||
|
DEFAULT: array('script', 'style') // auto-converted during processing
|
||||||
|
--ALIASES--
|
||||||
|
Core.InvisibleElements, Core.StupidElements
|
||||||
|
--DESCRIPTION--
|
||||||
|
<p>
|
||||||
|
Blah blah
|
||||||
|
</p>
|
||||||
|
|
||||||
|
The first line is the directive name, the lines after that prior to the
|
||||||
|
first --HEADER-- block are single-line values, and then after that
|
||||||
|
the multiline values are there. No value is restricted to a particular
|
||||||
|
format: DEFAULT could very well be multiline if that would be easier.
|
||||||
|
This would make it insanely easy, also, to add arbitrary extra parameters,
|
||||||
|
like:
|
||||||
|
|
||||||
|
VERSION: 3.0.0
|
||||||
|
ALLOWED: 'none', 'light', 'medium', 'heavy' // this is wrapped in array()
|
||||||
|
EXTERNAL: CSSTidy // this would be documented somewhere else with a URL
|
||||||
|
|
||||||
|
The final loss would be that you wouldn't know what file the directive
|
||||||
|
was used in; with some clever regexps it should be possible to
|
||||||
|
figure out where $config->get($ns, $d); occurs. Reflective calls to
|
||||||
|
the configuration object is mitigated by the fact that getBatch is
|
||||||
|
used, so we can simply talk about that in the namespace definition page.
|
||||||
|
This might be slow, but it would only happen when we are creating
|
||||||
|
the documentation for consumption, and is sugar.
|
||||||
|
|
||||||
|
We can put this in a schema/ directory, outside of HTML Purifier. The serialized
|
||||||
|
data gets treated like entities.ser.
|
||||||
|
|
||||||
|
The final thing that needs to be handled is user defined configurations.
|
||||||
|
They can be added at runtime using ConfigSchema::registerDirectory()
|
||||||
|
which globs the directory and grabs all of the directives to be incorporated
|
||||||
|
in. Then, the result is saved. We may want to take advantage of the
|
||||||
|
DefinitionCache framework, although it is not altogether certain what
|
||||||
|
configuration directives would be used to generate our key (meta-directives!)
|
||||||
|
|
||||||
|
Further thoughts
|
||||||
|
----------------
|
||||||
|
Our master configuration schema will only need to be updated once
|
||||||
|
every new version, so it's easily versionable. User specified
|
||||||
|
schema files are far more volatile, but it's far too expensive
|
||||||
|
to check the filemtimes of all the files, so a DefinitionRev style
|
||||||
|
mechanism works better. However, we can uniquely identify the
|
||||||
|
schema based on the directories they loaded, so there's no need
|
||||||
|
for a DefinitionId until we give them full programmatic control.
|
||||||
|
|
||||||
|
These variables should be directly incorporated into ConfigSchema,
|
||||||
|
and ConfigSchema should handle serialization. Some refactoring will be
|
||||||
|
necessary for the DefinitionCache classes, as they are built with
|
||||||
|
Config in mind. If the user changes something, the cache file gets
|
||||||
|
rebuilt. If the version changes, the cache file gets rebuilt. Since
|
||||||
|
our unit tests flush the caches before we start, and the operation is
|
||||||
|
pretty fast, this will not negatively impact unit testing.
|
||||||
|
|
||||||
|
One last thing: certain configuration directives require that files
|
||||||
|
get added. They may even be specified dynamically. It is not a good idea
|
||||||
|
for the HTMLPurifier_Config object to be used directly for such matters.
|
||||||
|
Instead, the userland code should explicitly perform the includes. We may
|
||||||
|
put in something like:
|
||||||
|
|
||||||
|
REQUIRES: HTMLPurifier_Filter_ExtractStyleBlocks
|
||||||
|
|
||||||
|
To indicate that if that class doesn't exist, and the user is attempting
|
||||||
|
to use the directive, we should fatally error out. The stub includes the core files,
|
||||||
|
and the user includes everything else. Any reflective things like new
|
||||||
|
$class would be required to tie in with the configuration.
|
||||||
|
|
||||||
|
It would work very well with rarely used configuration options, but it
|
||||||
|
wouldn't be so good for "core" parts that can be disabled. In such cases
|
||||||
|
the core include file would need to be modified, and the only way
|
||||||
|
to properly do this is use the configuration object. Once again, our
|
||||||
|
ability to create cache keys saves the day again: we can create arbitrary
|
||||||
|
stub files for arbitrary configurations and include those. They could
|
||||||
|
even be the single file affairs. The only thing we'd need to include,
|
||||||
|
then, would be HTMLPurifier_Config! Then, the configuration object would
|
||||||
|
load the library.
|
||||||
|
|
||||||
|
An aside...
|
||||||
|
-----------
|
||||||
|
One questions, however, the wisdom of letting PHP files write other PHP
|
||||||
|
files. It seems like a recipe for disaster, or at least lots of headaches
|
||||||
|
in highly secured setups, where PHP does not have the ability to write
|
||||||
|
to its root. In such cases, we could use sticky bits or tell the user
|
||||||
|
to manually generate the file.
|
||||||
|
|
||||||
|
The other troublesome bit is actually doing the calculations necessary.
|
||||||
|
For certain cases, it's simple (such as URIScheme), but for AttrDef
|
||||||
|
and HTMLModule the dependency trees are very complex in relation to
|
||||||
|
%HTML.Allowed and friends. I think that this idea should be shelved
|
||||||
|
and looked at a later, less insane date.
|
||||||
|
|
||||||
|
An interesting dilemma presents itself when a configuration form is offered
|
||||||
|
to the user. Normally, the configuration object is not accessible without
|
||||||
|
editing PHP code; this facility changes thing. The sensible thing to do
|
||||||
|
is stipulate that all classes required by the directives you allow must
|
||||||
|
be included.
|
||||||
|
|
||||||
|
Unit testing
|
||||||
|
------------
|
||||||
|
|
||||||
|
Setting up the parsing and translation into our existing format would not
|
||||||
|
be difficult to do. It might represent a good time for us to rethink our
|
||||||
|
tests for these facilities; as creative as they are, they are often hacky
|
||||||
|
and require public visibility for things that ought to be protected.
|
||||||
|
This is especially applicable for our DefinitionCache tests.
|
||||||
|
|
||||||
|
Migration
|
||||||
|
---------
|
||||||
|
|
||||||
|
Because we are not *adding* anything essentially new, it should be trivial
|
||||||
|
to write a script to take our existing data and dump it into the new format.
|
||||||
|
Well, not trivial, but fairly easy to accomplish. Primary implementation
|
||||||
|
difficulties would probably involve formatting the file nicely.
|
||||||
|
|
||||||
|
Backwards-compatibility
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
I expect that the ConfigSchema methods should stick around for a little bit,
|
||||||
|
but display E_USER_NOTICE warnings that they are deprecated. This will
|
||||||
|
require documentation!
|
||||||
|
Loading…
x
Reference in New Issue
Block a user