Release 1.5.0, merged in r688-867.

- LanguageFactory::instance() declared static - HTMLModuleManagerTest pass by reference bug fixed, merge back into trunk scheduled git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/strict@869 48356398-32a2-884e-a903-53898d9a118a
2025-10-25 02:26:32 +02:00 · 2007-03-24 01:04:06 +00:00
parent cec7a1c087
commit dd2fd06591
130 changed files with 4324 additions and 1385 deletions
--- a/docs/dev-advanced-api.html
+++ b/docs/dev-advanced-api.html
@@ -0,0 +1,188 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
+    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+<meta name="description" content="Functional specification for HTML Purifier's advanced API for defining custom filtering behavior." />
+<link rel="stylesheet" type="text/css" href="style.css" />
+
+<title>Advanced API - HTML Purifier</title>
+
+</head><body>
+
+<h1>Advanced API</h1>
+
+<div id="filing">Filed under Development</div>
+<div id="index">Return to the <a href="index.html">index</a>.</div>
+<div id="home"><a href="http://hp.jpsband.org/">HTML Purifier</a> End-User Documentation</div>
+
+<p>It makes no sense to adopt a <q>one-size-fits-all</q> approach to
+filtersets: therefore, users must be able to define their own sets of
+<q>allowed</q> elements, as well as switch in-between doctypes of HTML.</p>
+
+<p>Our goals are to let the user:</p>
+
+<dl>
+    <dt>Select</dt>
+    <dd><ul>
+        <li>Doctype</li>
+        <li>Filtersets: Rich / Plain / Full ...</li>
+        <li>Mode: Lenient / Correctional</li>
+        <li>Collections (?): Safe / Unsafe</li>
+        <li>Modules / Tags / Attributes</li>
+    </ul></dd>
+    <dt>Customize</dt>
+    <dd><ul>
+        <li>Tags / Attributes / Attribute Types</li>
+        <li>Filtersets</li>
+        <li>Root Node</li>
+    </ul></dd>
+    <dt>Create</dt>
+    <dd><ul>
+        <li>Modules / Tags / Attributes / Attribute Types</li>
+        <li>Filtersets</li>
+        <li>Doctype</li>
+    </ul></dd>
+</dl>
+
+<h2>Select</h2>
+
+<h3>Selecting a Doctype</h3>
+
+<p>By default, users will use a doctype-based, permissive but secure
+whitelist.  They must define a <strong>doctype</strong>, and this serves
+as the first method of determining a filterset.</p>
+
+<p class="technical">This identifier is based
+on the name the W3C has given to the document type and <em>not</em>
+the DTD identifier.</p>
+
+<p>This parameter is set via the configuration object:</p>
+
+<pre>$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');</pre>
+
+<h3>Selecting a Filterset</h3>
+
+<p>However, selecting this doctype doesn't mean much, because if we
+adhered exactly to the definition we would be letting XSS and other
+nasties through. HTML Purifier must, in its filterset, allow a subset
+of the doctype, which we shall call a <strong>filterset</strong>.</p>
+
+<p>By default, HTML Purifier will use the <strong>Rich</strong>
+filterset, which allows as many elements as possible with untrusted
+sources. Other possible filtersets could be:</p>
+
+<dl>
+    <dt>Full</dt>
+    <dd>Allows the full span of elements in the doctype, good if you want
+        HTML Purifier to work as a Tidy substitute but not to strip
+        anything out.</dd>
+    <dt>Plain</dt>
+    <dd>Provides a minimum set of tags for semantic markup of things
+        like blog comments.</dd>
+</dl>
+
+<p>Extension-authors would be able to define custom filtersets for
+other users to use.</p>
+
+<p>A possible call to select a filterset would be:</p>
+
+<pre>$config->set('HTML', 'Filterset', 'Rich');</pre>
+
+<h3>Selecting Mode</h3>
+
+<p>Within filtersets, there are various <strong>modes</strong> of operation.
+These indicate variant behaviors that, while not strictly changing the
+allowed set of elements and attributes, will definitely affect the output.
+Currently, we have two modes, which may be used together:</p>
+
+<dl>
+    <dt>Lenient</dt>
+    <dd>Deprecated elements and attributes will be transformed into
+        standards-compliant alternatives when explicitly disallowed. For
+        example, in the XHTML 1.0 Strict doctype, a <code>center</code>
+        tag would be turned into a <code>div</code> with the CSS property
+        <code>text-align:center;</code>, but in XHTML 1.0 Transitional
+        the tag would be preserved. This mode is on by default.</dd>
+    <dt>Correctional</dt>
+    <dd>Deprecated elements and attributes will be transformed into
+        standards-compliant alternatives whenever possible. Referring
+        back to the previous example, the <code>center</code> tag would
+        be transformed in both cases. However, tags without a
+        reasonable standards-compliant alternative will be preserved
+        in their form. This mode is on by default. It may have
+        various levels of operation.</dd>
+</dl>
+
+<p>A possible call to select modes would be:</p>
+
+<pre>$config->set('HTML', 'Mode', array('correctional', 'lenient'));</pre>
+
+<p>If modes have extra parameters, a hash might work well:</p>
+
+<pre>$config->set('HTML', 'Mode', array(
+    'correctional' => 9, // strongest level
+    'lenient' => true // this one's just boolean
+));</pre>
+
+<p>Modes may possibly be wrapped up with the filterset declaration:</p>
+
+<pre>$config->set('HTML', 'Filterset', 'Rich: correctional, lenient');</pre>
+
+<p>Further investigation in this field is necessary.</p>
+
+<h3>Selecting Modules / Tags / Attributes</h3>
+
+<p>If this cookie cutter approach doesn't appeal to a user, they may
+decide to roll their own filterset by selecting modules, tags and
+attributes to allow.</p>
+
+<p class="technical">This would make use of the same facilities
+as a filterset author would use, except that it would go under an
+<q>anonymous</q> filterset that would be auto-selected if any of the
+relevant module/tag/attribute selection configuration directives were
+non-null.</p>
+
+<p>On the highest level, a user will usually be most interested in
+directly specifying which elements and attributes are desired. For
+example:</p>
+
+<pre>$config->set('HTML', 'AllowedElements', 'a,b,em,p,blockquote,code,i');</pre>
+
+<p>Attribute declarations could be merged into this declaration as such:</p>
+
+<pre>$config->set('HTML', 'Allowed', 'a[href,title],b,em,p[class],blockquote[cite],code,i');</pre>
+
+<p>...or be kept separate:</p>
+
+<pre>$config->set('HTML', 'AllowedAttributes', 'a.href,a.title,p.class,blockquote.cite');</pre>
+
+<p class="technical">Considering that, internally speaking, as mandated by
+the XHTML 1.1 Modularization specification, we have organized our
+elements around modules, considerable gymnastics will be needed to
+get this sort of functionality working.</p>
+
+<p>A user may also specify a module to load a class of elements and attributes
+into their filterest:</p>
+
+<pre>$config->set('HTML', 'Allowed', 'Hypertext,Core');</pre>
+
+<p class="fixme">The granularity of these modules is too coarse for
+the average user (for example, the core module loads everything from
+the essential <code>p</code> tag to the not-so-safe <code>h1</code>
+tag). How do we make this still a viable solution?</p>
+
+<h3>Unified selector</h3>
+
+<p>Because selecting each and every one of these configuration options
+is a chore, we may wish to offer a specialized configuration method
+for selecting a filterset. Possibility:</p>
+
+<pre>function selectFilter($doctype, $filterset, $mode)</pre>
+
+<p>...which is simply a light wrapper over the individual configuration
+calls. A custom config file format or text format could also be adopted.</p>
+
+<div id="version">$Id$</div>
+
+</body></html>
--- a/docs/enduser-overview.txt
+++ b/docs/enduser-overview.txt
@@ -36,7 +36,7 @@ forgiving lexer.  You may also be interested in the unit tests located in the
 tests/ folder, which provide a living document on how exactly the filter deals
 with malformed input.

-In summary:
+In summary (see corresponding classes for more details):

 1. Parse document into an array of tag and text tokens (Lexer)
 2. Remove all elements not on whitelist and transform certain other elements
--- a/docs/enduser-security.txt
+++ b/docs/enduser-security.txt
@@ -6,45 +6,17 @@ through negligence of people. This class will do its job: no more, no less,
 and it's up to you to provide it the proper information and proper context
 to be effective. Things to remember:

-1. Character Encoding: UTF-8.
-    This segment will soon be obsoleted by enduser-utf8.html
-Currently, the parser runs under the assumption that it is dealing
-with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
-character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
-your character encoding, make sure you configure HTML Purifier or switch
-to UTF-8. Now. Also, make sure any input is properly converted to UTF-8, or
-the parser will mangle it badly (though it won't be a security risk if you're
-outputting it as UTF-8 though).  Character encoding is, in general, a knotty
-issue, but do yourself a favor and learn about it:
-<http://www.joelonsoftware.com/articles/Unicode.html>
+1. Character Encoding: see enduser-utf8.html for more info.

-2. Doctype: XHTML 1.0 Transitional
-This is what the parser is outputting. For the most
-part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
-that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
-has waaaay too many quirks for a little parser to handle.  We did not select
-strict in order to prevent ourselves from being too draconic on users, but
-this may be configurable in the future.  Do you want standards compliance?
-The doctype is a good place to start.
+2. Doctype: document pending feature completion
+Not strictly necessary, actually. More in-depth discussion once we figure
+out how to get strict loose mode working.

-3. IDs
-    This segment is obsoleted by enduser-id.html
-They need to be unique, but without some knowledge of the
-rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
-needs to be set: we may want to consider disallowing IDs by default to
-save lazy programmers.
+3. IDs: see enduser-id.html for more info

-4. [PROJECTED] Links
-We're not going to try for spam protection (although
-some hooks for such a module might be nice) but we may offer the ability to
-only accept relative URLs. Pick the one that's right for you.
+4. Links: document pending feature completion
+Rudimentary blacklisting, we should also allow only relative URIs. We
+need a doc to explain the stuff.

-5. CSS
-While we can prevent the most flagrant cases from affecting your
-layout (such as absolutely positioned elements), no amount of code is going
-to protect your pages from being attacked by garish colors and plain old
-bad taste.  A neat feature would be the ability to define acceptable colors
-in a document, but that's not likely to be implemented for a while.  In the
-meantime, be sure to make sure that floated elements (permitted, since they
-can be quite useful) can't mess up your layout. Once again, we may want to
-disable this by default to protect lazy developers.
+5. CSS: document pending
+Explain which CSS styles we blocked and why.
--- a/docs/enduser-utf8.html
+++ b/docs/enduser-utf8.html
@@ -10,7 +10,7 @@
    .minor td {font-style:italic;}
 </style>

-<title>UTF-8 - HTML Purifier</title>
+<title>UTF-8: The Secret of Character Encoding - HTML Purifier</title>

 <!-- Note to users: this document, though professing to be UTF-8, attempts
 to use only ASCII characters, because most webservers are configured
@@ -19,21 +19,27 @@ own advice for sake of portability.  -->

 </head><body>

-<h1>UTF-8</h1>
+<h1>UTF-8: The Secret of Character Encoding</h1>

 <div id="filing">Filed under End-User</div>
 <div id="index">Return to the <a href="index.html">index</a>.</div>
 <div id="home"><a href="http://hp.jpsband.org/">HTML Purifier</a> End-User Documentation</div>

-<p>Character encoding and character sets, in truth, are not that
-difficult to understand. But if you don't understand them, you are going
-to be caught by surprise by some of HTML Purifier's behavior, namely
-the fact that it operates UTF-8 or the limitations of the character
-encoding transformations it does. This document will walk you through
+<p>Character encoding and character sets are not that
+difficult to understand, but so many people blithely stumble
+through the worlds of programming without knowing what to actually
+do about it, or say &quot;Ah, it's a job for those <em>internationalization</em>
+experts.&quot; No, it is not! This document will walk you through
 determining the encoding of your system and how you should handle
 this information. It will stay away from excessive discussion on
-the internals of character encoding, but offer the information in
-asides that can easily be skipped.</p>
+the internals of character encoding.</p>
+
+<p>This document is not designed to be read in its entirety: it will
+slowly introduce concepts that build on each other: you need not get to
+the bottom to have learned something new. However, I strongly
+recommend you read all the way to <strong>Why UTF-8?</strong>, because at least
+at that point you'd have made a conscious decision not to migrate,
+which can be a rewarding (but difficult) task.</p>

 <blockquote class="aside">
 <div class="label">Asides</div>
@@ -43,6 +49,50 @@ asides that can easily be skipped.</p>
    with a greater understanding of the underlying issues.</p>
 </blockquote>

+<h2>Table of Contents</h2>
+
+<ol id="toc">
+    <li><a href="#findcharset">Finding the real encoding</a></li>
+    <li><a href="#findmetacharset">Finding the embedded encoding</a></li>
+    <li><a href="#fixcharset">Fixing the encoding</a><ol>
+        <li><a href="#fixcharset-none">No embedded encoding</a></li>
+        <li><a href="#fixcharset-diff">Embedded encoding disagrees</a></li>
+        <li><a href="#fixcharset-server">Changing the server encoding</a><ol>
+            <li><a href="#fixcharset-server-php">PHP header() function</a></li>
+            <li><a href="#fixcharset-server-phpini">PHP ini directive</a></li>
+            <li><a href="#fixcharset-server-nophp">Non-PHP</a></li>
+            <li><a href="#fixcharset-server-htaccess">.htaccess</a></li>
+            <li><a href="#fixcharset-server-ext">File extensions</a></li>
+        </ol></li>
+        <li><a href="#fixcharset-xml">XML</a></li>
+        <li><a href="#fixcharset-internals">Inside the process</a></li>
+    </ol></li>
+    <li><a href="#whyutf8">Why UTF-8?</a><ol>
+        <li><a href="#whyutf8-i18n">Internationalization</a></li>
+        <li><a href="#whyutf8-user">User-friendly</a></li>
+        <li><a href="#whyutf8-forms">Forms</a><ol>
+            <li><a href="#whyutf8-forms-urlencoded">application/x-www-form-urlencoded</a></li>
+            <li><a href="#whyutf8-forms-multipart">multipart/form-data</a></li>
+        </ol></li>
+        <li><a href="#whyutf8-support">Well supported</a></li>
+        <li><a href="#whyutf8-htmlpurifier">HTML Purifiers</a></li>
+    </ol></li>
+    <li><a href="#migrate">Migrate to UTF-8</a><ol>
+        <li><a href="#migrate-db">Configuring your database</a><ol>
+            <li><a href="#migrate-db-legit">Legit method</a></li>
+            <li><a href="#migrate-db-binary">Binary</a></li>
+        </ol></li>
+        <li><a href="#migrate-editor">Text editor</a></li>
+        <li><a href="#migrate-bom">Byte Order Mark (headers already sent!)</a></li>
+        <li><a href="#migrate-fonts">Fonts</a><ol>
+            <li><a href="#migrate-fonts-obscure">Obscure scripts</a></li>
+            <li><a href="#migrate-fonts-occasional">Occasional use</a></li>
+        </ol></li>
+        <li><a href="#migrate-variablewidth">Dealing with variable width in functions</a></li>
+    </ol></li>
+    <li><a href="#externallinks">Further Reading</a></li>
+</ol>
+
 <h2 id="findcharset">Finding the real encoding</h2>

 <p>In the beginning, there was ASCII, and things were simple. But they
@@ -275,7 +325,7 @@ your own php.ini file, ask your support for details. Use:</p>

 <h4 id="fixcharset-server-nophp">Non-PHP</h4>

-<p>You may, for whatever reason, may need to set the character encoding
+<p>You may, for whatever reason, need to set the character encoding
 on non-PHP files, usually plain ol' HTML files. Doing this
 is more of a hit-or-miss process: depending on the software being
 used as a webserver and the configuration of that software, certain
@@ -386,8 +436,8 @@ processing instructions. They look like:</p>

 <p>For XHTML, this processing instruction theoretically
 overrides the <code>META</code> tag. In reality, this happens only when the
-XHTML is actually served as legit XML and not HTML, which is almost
-always never due to Internet Explorer's lack of support for 
+XHTML is actually served as legit XML and not HTML, which is almost always
+never due to Internet Explorer's lack of support for 
 <code>application/xhtml+xml</code> (even though doing so is often
 argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good practice</a>).</p>

@@ -398,10 +448,10 @@ for XML files is UTF-8, which often butts heads with more common
 ISO-8859-1 encoding (you see this in garbled RSS feeds).</p>

 <p>In short, if you use XHTML and have gone through the
-trouble of adding the XML header, be sure to make sure it jives
+trouble of adding the XML header, make sure it jives
 with your <code>META</code> tags and HTTP headers.</p>

-<h3>Inside the process</h3>
+<h3 id="fixcharset-internals">Inside the process</h3>

 <p>This section is not required reading,
 but may answer some of your questions on what's going on in all
@@ -572,7 +622,7 @@ Each method has deficiencies, especially the former.</p>
 the page, you still have the trouble of what to do with characters
 that are outside of the character encoding's range. The behavior, once
 again, varies: Firefox 2.0 entity-izes them while Internet Explorer
-7.0 mangles them beyond intelligibility. For serious I18N purposes,
+7.0 mangles them beyond intelligibility. For serious internationalization purposes,
 this is not an option.</p>

 <p>The other possibility is to set Accept-Encoding to UTF-8, which
@@ -604,22 +654,374 @@ hounding you about broken pages.</p>

 <h3 id="whyutf8-htmlpurifier">HTML Purifier</h3>

-<p>And finally, we get to HTML Purifier.</p>
+<p>And finally, we get to HTML Purifier.  HTML Purifier is built to
+deal with UTF-8: any indications otherwise are the result of an
+encoder that converts text from your preferred encoding to UTF-8, and
+back again.  HTML Purifier never touches anything else, and leaves
+it up to the module iconv to do the dirty work.</p>
+
+<p>This approach, however, is not perfect. iconv is blithely unaware
+of HTML character entities. HTML Purifier, in order to
+protect against sophisticated escaping schemes, normalizes all character
+and numeric entities before processing the text. This leads to
+one important ramification:</p>
+
+<p><strong>Any character that is not supported by the target character
+set, regardless of whether or not it is in the form of a character
+entity or a raw character, will be silently ignored.</strong></p>
+
+<p>Example of this principle at work: say you have <code>&amp;theta;</code>
+in your HTML, but the output is in Latin-1 (which, understandably,
+does not understand Greek), the following process will occur (assuming you've
+set the encoding correctly using %Core.Encoding):</p>
+
+<ul>
+    <li>The <code>Encoder</code> will transform the text from ISO 8859-1 to UTF-8
+        (note that theta is preserved since it doesn't actually use
+        any non-ASCII characters): <code>&amp;theta;</code></li>
+    <li>The <code>EntityParser</code> will transform all named and numeric
+        character entities to their corresponding raw UTF-8 equivalents:
+        <code>&theta;</code></li>
+    <li>HTML Purifier processes the code: <code>&theta;</code></li>
+    <li>The <code>Encoder</code> now transforms the text back from UTF-8
+        to ISO 8859-1. Since Greek is not supported by ISO 8859-1, it
+        will be either ignored or replaced with a question mark:
+        <code>?</code></li>
+</ul>
+
+<p>This behaviour is quite unsatisfactory. It is a deal-breaker for
+international applications, and it can be mildly annoying for the provincial
+soul who occasionally needs a special character. Since 1.4.0, HTML
+Purifier has provided a slightly more palatable workaround using
+%Core.EscapeNonASCIICharacters. The process now looks like:</p>
+
+<ul>
+    <li>The <code>Encoder</code> transforms encoding to UTF-8: <code>&amp;theta;</code></li>
+    <li>The <code>EntityParser</code> transforms entities: <code>&theta;</code></li>
+    <li>HTML Purifier processes the code: <code>&theta;</code></li>
+    <li>The <code>Encoder</code> replaces all non-ASCII characters
+        with numeric entities: <code>&amp;#952;</code></li>
+    <li>For good measure, <code>Encoder</code> transforms encoding back to
+        original (which is strictly unnecessary for 99% of encodings
+        out there): <code>&amp;#952;</code> (remember, it's all ASCII!)</li>
+</ul>
+
+<p>...which means that this is only good for an occasional foray into
+the land of Unicode characters, and is totally unacceptable for Chinese
+or Japanese texts. The even bigger kicker is that, supposing the
+input encoding was actually ISO-8859-7, which <em>does</em> support
+theta, the character would get entity-ized anyway! (The Encoder does
+not discriminate).</p>
+
+<p>The current functionality is about where HTML Purifier will be for
+the rest of eternity. HTML Purifier could attempt to preserve the original
+form of the entities so that they could be substituted back in, only the
+DOM extension kills them off irreversibly. HTML Purifier could also attempt
+to be smart and only convert non-ASCII characters that weren't supported
+by the target encoding, but that would require reimplementing iconv
+with HTML awareness, something I will not do.</p>
+
+<p>So there: either it's UTF-8 or crippled international support. Your pick! (and I'm
+not being sarcastic here: some people could care less about other languages)</p>

 <h2 id="migrate">Migrate to UTF-8</h2>

-<h3 id="migrate-editor">Text editor</h3>
+<p>So, you've decided to bite the bullet, and want to migrate to UTF-8.
+Note that this is not for the faint-hearted, and you should expect
+the process to take longer than you think it will take.</p>
+
+<p>The general idea is that you convert all existing text to UTF-8,
+and then you set all the headers and META tags we discussed earlier
+to UTF-8. There are many ways going about doing this: you could
+write a conversion script that runs through the database and re-encodes
+everything as UTF-8 or you could do the conversion on the fly when someone
+reads the page. The details depend on your system, but I will cover
+some of the more subtle points of migration that may trip you up.</p>

 <h3 id="migrate-db">Configuring your database</h3>

-<h3 id="migrate-convert">Convert old text</h3>
+<p>Most modern databases, the most prominent open-source ones being MySQL
+4.1+ and PostgreSQL, support character encodings. If you're switching
+to UTF-8, logically speaking, you'd want to make sure your database
+knows about the change too. There are some caveats though:</p>
+
+<h4 id="migrate-db-legit">Legit method</h4>
+
+<p>Standardization in terms of SQL syntax for specifying character
+encodings is notoriously spotty. Refer to your respective database's
+documentation on how to do this properly.</p>
+
+<p>For <a href="http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html">MySQL</a>, <code>ALTER</code> will magically perform the
+character encoding conversion for you. However, you have
+to make sure that the text inside the column is what is says it is:
+if you had put Shift-JIS in an ISO 8859-1 column, MySQL will irreversibly mangle
+the text when you try to convert it to UTF-8. You'll have to convert
+it to a binary field, convert it to a Shift-JIS field (the real encoding),
+and then finally to UTF-8. Many a website had pages irreversibly mangled
+because they didn't realize that they'd been deluding themselves about
+the character encoding all along, don't become the next victim.</p>
+
+<p>For <a href="http://www.postgresql.org/docs/8.2/static/multibyte.html">PostgreSQL</a>, there appears to be no direct way to change the
+encoding of a database (as of 8.2). You will have to dump the data, and then reimport
+it into a new table. Make sure that your client encoding is set properly:
+this is how PostgreSQL knows to perform an encoding conversion.</p>
+
+<p>Many times, you will be also asked about the &quot;collation&quot; of
+the new column. Collation is how a DBMS sorts text, like ordering
+B, C and A into A, B and C (the problem gets surprisingly complicated
+when you get to languages like Thai and Japanese). If in doubt,
+going with the default setting is usually a safe bet.</p>
+
+<p>Once the conversion is all said and done, you still have to remember
+to set the client encoding (your encoding) properly on each database
+connection using <code>SET NAMES</code> (which is standard SQL and is
+usually supported).</p>
+
+<h4 id="migrate-db-binary">Binary</h4>
+
+<p>Due to the abovementioned compatibility issues, a more interoperable
+way of storing UTF-8 text is to stuff it in a binary datatype.
+<code>CHAR</code> becomes <code>BINARY</code>, <code>VARCHAR</code> becomes
+<code>VARBINARY</code> and <code>TEXT</code> becomes <code>BLOB</code>.
+Doing so can save you some huge headaches:</p>
+
+<ul>
+    <li>The syntax for binary data types is very portable,</li>
+    <li>MySQL 4.0 has <em>no</em> support for character encodings, so
+        if you want to support it you <em>have</em> to use binary,</li>
+    <li>MySQL, as of 5.1, has no support for four byte UTF-8 characters,
+        which represent characters beyond the basic multilingual
+        plane, and</li>
+    <li>You will never have to worry about your DBMS being too smart
+        and attempting to convert your text when you don't want it to.</li>
+</ul>
+
+<p>MediaWiki, a very prominent international application, uses binary fields
+for storing their data because of point three.</p>
+
+<p>There are drawbacks, of course:</p>
+
+<ul>
+    <li>Database tools like PHPMyAdmin won't be able to offer you inline
+        text editing, since it is declared as binary,</li>
+    <li>It's not semantically correct: it's really text not binary
+        (lying to the database),</li>
+    <li>Unless you use the not-very-portable wizardry mentioned above,
+        you have to change the encoding yourself (usually, you'd do
+        it on the fly), and</li>
+    <li>You will not have collation.</li>
+</ul>
+
+<p>Choose based on your circumstances.</p>
+
+<h3 id="migrate-editor">Text editor</h3>
+
+<p>For more flat-file oriented systems, you will often be tasked with
+converting reams of existing text and HTML files into UTF-8, as well as
+making sure that all new files uploaded are properly encoded. Once again,
+I can only point vaguely in the right direction for converting your
+existing files: make sure you backup, make sure you use
+<a href="http://php.net/ref.iconv">iconv</a>(), and
+make sure you know what the original character encoding of the files
+is (or are, depending on the tidiness of your system).</p>
+
+<p>However, I can proffer more specific advice on the subject of
+text editors. Many text editors have notoriously spotty Unicode support.
+To find out how your editor is doing, you can check out <a
+href="http://www.alanwood.net/unicode/utilities_editors.html">this list</a>
+or <a href="http://en.wikipedia.org/wiki/Comparison_of_text_editors#Encoding_support">Wikipedia's list.</a>
+I personally use Notepad++, which works like a charm when it comes to UTF-8.
+Usually, you will have to <strong>explicitly</strong> tell the editor through some dialogue
+(usually Save as or Format) what encoding you want it to use. An editor
+will often offer &quot;Unicode&quot; as a method of saving, which is
+ambiguous. Make sure you know whether or not they really mean UTF-8
+or UTF-16 (which is another flavor of Unicode).</p>
+
+<p>The two things to look out for are whether or not the editor
+supports <strong>font mixing</strong> (multiple
+fonts in one document) and whether or not it adds a <strong>BOM</strong>.
+Font mixing is important because fonts rarely have support for every
+language known to mankind: in order to be flexible, an editor must
+be able to take a little from here and a little from there, otherwise
+all your Chinese characters will come as nice boxes. We'll discuss
+BOM below.</p>

 <h3 id="migrate-bom">Byte Order Mark (headers already sent!)</h3>

+<p>The BOM, or <a href="http://en.wikipedia.org/wiki/Byte_Order_Mark">Byte
+Order Mark</a>, is a magical, invisible character placed at
+the beginning of UTF-8 files to tell people what the encoding is and
+what the endianness of the text is. It is also unnecessary.</p>
+
+<p>Because it's invisible, it often
+catches people by surprise when it starts doing things it shouldn't
+be doing. For example, this PHP file:</p>
+
+<pre><strong>BOM</strong>&lt;?php
+header('Location: index.php');
+?&gt;</pre>
+
+<p>...will fail with the all too familiar <strong>Headers already sent</strong>
+PHP error. And because the BOM is invisible, this culprit will go unnoticed.
+My suggestion is to only use ASCII in PHP pages, but if you must, make
+sure the page is saved WITHOUT the BOM.</p>
+
+<blockquote class="aside">
+    <p>The headers the error is referring to are <strong>HTTP headers</strong>,
+       which are sent to the browser before any HTML to tell it various
+       information. The moment any regular text (and yes, a BOM counts as
+       ordinary text) is output, the headers must be sent, and you are
+       not allowed to send anymore. Thus, the error.</p>
+</blockquote>
+
+<p>If you are reading in text files to insert into the middle of another
+page, it is strongly advised (but not strictly necessary) that you replace out the UTF-8 byte 
+sequence for BOM <code>&quot;\xEF\xBB\xBF&quot;</code> before inserting it in,
+via:</p>
+
+<pre>$text = str_replace(&quot;\xEF\xBB\xBF&quot;, '', $text);</pre>
+
+<h3 id="migrate-fonts">Fonts</h3>
+
+<p>Generally speaking, people who are having trouble with fonts fall
+into two categories:</p>
+
+<ul>
+<li>Those who want to
+use an extremely obscure language for which there is very little
+support even among native speakers of the language, and</li>
+<li>Those where the primary language of the text is
+well-supported but there are occasional characters
+that aren't supported.</li>
+</ul>
+
+<p>Yes, there's always a chance where an English user happens across
+a Sinhalese website and doesn't have the right font. But an English user
+who happens not to have the right fonts probably has no business reading Sinhalese
+anyway. So we'll deal with the other two edge cases.</p>
+
+<h4 id="migrate-fonts-obscure">Obscure scripts</h4>
+
+<p>If you run a Bengali website, you may get comments from users who
+would like to read your website but get heaps of question marks or
+other meaningless characters. Fixing this problem requires the
+installation of a font or language pack which is often highly
+dependent on what the language is. <a href="http://bn.wikipedia.org/wiki/%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%AA%E0%A7%87%E0%A6%A1%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE:Bangla_script_display_help">Here is an example</a>
+of such a help file for the Bengali language, I am sure there are
+others out there too. You just have to point users to the appropriate
+help file.</p>
+
+<h4 id="migrate-fonts-occasional">Occasional use</h4>
+
+<p>A prime example of when you'll see some very obscure Unicode
+characters embedded in what otherwise would be very bland ASCII are
+letters of the
+<a href="http://en.wikipedia.org/wiki/International_Phonetic_Alphabet">International
+Phonetic Alphabet (IPA)</a>, use to designate pronounciations in a very standard
+manner (you probably see them all the time in your dictionary). Your
+average font probably won't have support for all of the IPA characters
+like &#664; (bilabial click) or &#658; (voiced postalveolar fricative).
+So what's a poor browser to do? Font mix! Smart browsers like Mozilla Firefox
+and Internet Explorer 7 will borrow glyphs from other fonts in order
+to make sure that all the characters display properly.</p>
+
+<p>But what happens when the browser isn't smart and happens to be the
+most widely used browser in the entire world? Microsoft IE 6
+is not smart enough to borrow from other fonts when a character isn't
+present, so more often than not you'll be slapped with a nice big &#65533;.
+To get things to work, MSIE 6 needs a little nudge. You could configure it
+to use a different font to render the text, but you can acheive the same
+effect by selectively changing the font for blocks of special characters
+to known good Unicode fonts.</p>
+
+<p>Fortunantely, the folks over at Wikipedia have already done all the
+heavy lifting for you. Get the CSS from the horses mouth here:
+<a href="http://en.wikipedia.org/wiki/MediaWiki:Common.css">Common.css</a>,
+and search for &quot;.IPA&quot; There are also a smattering of
+other classes you can use for other purposes, check out 
+<a href="http://meta.wikimedia.org/wiki/Help:Special_characters#Displaying_Special_Characters">this page</a>
+for more details. For you lazy ones, this should work:</p>
+
+<pre>.Unicode {
+        font-family: Code2000, &quot;TITUS Cyberbit Basic&quot;, &quot;Doulos SIL&quot;,
+            &quot;Chrysanthi Unicode&quot;, &quot;Bitstream Cyberbit&quot;,
+            &quot;Bitstream CyberBase&quot;, Thryomanes, Gentium, GentiumAlt,
+            &quot;Lucida Grande&quot;, &quot;Arial Unicode MS&quot;, &quot;Microsoft Sans Serif&quot;,
+            &quot;Lucida Sans Unicode&quot;;
+        font-family /**/:inherit; /* resets fonts for everyone but IE6 */
+}</pre>
+
+<p>The standard usage goes along the lines of <code>&lt;span class=&quot;Unicode&quot;&gt;Crazy
+Unicode stuff here&lt;/span&gt;</code>. Characters in the
+<a href="http://en.wikipedia.org/wiki/Windows_Glyph_List_4">Windows Glyph List</a>
+usually don't need to be fixed, but for anything else you probably
+want to play it safe. Unless, of course, you don't care about IE6
+users.</p>
+
 <h3 id="migrate-variablewidth">Dealing with variable width in functions</h3>

+<p>When people claim that PHP6 will solve all our Unicode problems, they're
+misinformed. It will not fix any of the abovementioned troubles. It will,
+however, fix the problem we are about to discuss: processing UTF-8 text
+in PHP.</p>
+
+<p>PHP (as of PHP5) is blithely unaware of the existence of UTF-8 (with a few
+notable exceptions). Sometimes, this will cause problems, other times,
+this won't. So far, we've avoided discussing the architecture of
+UTF-8, so, we must first ask, what is UTF-8? Yes, it supports Unicode,
+and yes, it is variable width. Other traits:</p>
+
+<ul>
+    <li>Every character's byte sequence is unique and will never be found
+        inside the byte sequence of another character,</li>
+    <li>UTF-8 may use up to four bytes to encode a character,</li>
+    <li>UTF-8 text must be checked for well-formedness,</li>
+    <li>Pure ASCII is also valid UTF-8, and</li>
+    <li>Binary sorting will sort UTF-8 in the same order as Unicode.</li>
+</ul>
+
+<p>Each of these traits affect different domains of text processing
+in different ways. It is beyond the scope of this document to explain
+what precisely these implications are. PHPWact provides
+a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a>
+on what to expect from each functions, although coverage is spotty in
+some areas. Their more general notes on
+<a href="http://www.phpwact.org/php/i18n/charsets">character sets</a>
+are also worth looking at for information on UTF-8. Some rules of thumb
+when dealing with Unicode text:</p>
+
+<ul>
+    <li>Do not EVER use functions that:<ul>
+        <li>...convert case (strtolower, strtoupper, ucfirst, ucwords)</li>
+        <li>...claim to be case-insensitive (str_ireplace, stristr, strcasecmp)</li>
+    </ul></li>
+    <li>Think twice before using functions that:<ul>
+        <li>...count characters (strlen will return bytes, not characters;
+            str_split and word_wrap may corrupt)</li>
+        <li>...entity-ize things (UTF-8 doesn't need entities)</li>
+        <li>...do very complex string processing (*printf)</li>
+    </ul></li>
+</ul>
+
+<p>...and always think in bytes, not characters. If you use strpos()
+to find the position of a character, it will be in bytes, but this
+usually won't matter since substr() also operates with byte indices!</p>
+
+<p>You'll also need to make sure your UTF-8 is well-formed and will
+probably need replacements for some of these functions. I recommend
+using Harry Fuecks' <a href="http://phputf8.sourceforge.net/">PHP
+UTF-8</a> library, rather than use mb_string directly. HTML Purifier
+also defines a few useful UTF-8 compatible functions: check out
+<code>Encoder.php</code> in the <code>/library/HTMLPurifier/</code>
+directory.</p>
+
 <h2 id="externallinks">Further Reading</h2>

+<p>Well, that's it. Hopefully this document has served as a very
+practical springboard into knowledge of how UTF-8 works.  You may have
+decided that you don't want to migrate yet: that's fine, just know
+what will happen to your output and what bug reports you may recieve.</p>
+
 <p>Many other developers have already discussed the subject of Unicode,
 UTF-8 and internationalization, and I would like to defer to them for
 a more in-depth look into character sets and encodings.</p>
--- a/docs/fixquotes.htc
+++ b/docs/fixquotes.htc
@@ -0,0 +1,6 @@
+<public:attach event="oncontentready" onevent="init();" />
+<script>
+function init() {
+  element.innerHTML = '&#8220;'+element.innerHTML+'&#8221;';
+}
+</script>
--- a/docs/index.html
+++ b/docs/index.html
@@ -31,7 +31,7 @@ information for casual developers using HTML Purifier.</p>
 <dt><a href="enduser-slow.html">Speeding up HTML Purifier</a></dt>
 <dd>Explains how to speed up HTML Purifier through caching or inbound filtering.</dd>

-<dt><a href="enduser-utf8.html">UTF-8</a></dt>
+<dt><a href="enduser-utf8.html">UTF-8: The Secret of Character Encoding</a></dt>
 <dd>Describes the rationale for using UTF-8, the ramifications otherwise, and how to make the switch.</dd>

 </dl>
@@ -54,6 +54,10 @@ conventions.</p>
 <dt><a href="dev-optimization.html">Optimization</a></dt>
 <dd>Discusses possible methods of optimizing HTML Purifier.</dd>

+<dt><a href="dev-advanced-api.html">Advanced API</a></dt>
+<dd>Functional specification for HTML Purifier's advanced API for defining
+custom filtering behavior.</dd>
+
 </dl>

 <h2>Proposals</h2>
--- a/docs/proposal-config.txt
+++ b/docs/proposal-config.txt
@@ -7,7 +7,7 @@ value is used for.  This means decentralized configuration declarations that
 are nevertheless error checking and a centralized configuration object.

 Directives are divided into namespaces, indicating the major portion of
-functionality they cover (although there may be overlaps.  Please consult
+functionality they cover (although there may be overlaps).  Please consult
 the documentation in ConfigDef for more information on these namespaces.

 Since configuration is dependant on context, internal classes require a
@@ -36,4 +36,5 @@ the definition, you'd have to force reconstruction.

 In practice, the pulling directives from the config object are
 solely need-based, and the flex points are littered throughout the
-setup() function.  Some sort of refactoring is likely in order.
+setup() function.  Some sort of refactoring is likely in order. See
+ref-xhtml-1.1.txt for more info.
--- a/docs/proposal-language.txt
+++ b/docs/proposal-language.txt
@@ -1,42 +1,6 @@
 We are going to model our I18N/L10N off of MediaWiki's system.  Their's is
 obviously quite complicated, so we're going to simplify it a bit for our needs.

-== Structure ==
-
-First, you have a Language object.  This object contains all the localisable
-message strings, as well as other important language-specific settings and
-custom behavior (uppercasing, lowercasing, printing dates, formatting
-numbers, etc.)
-
-The object is constructed from two sources: subclassed versions of itself
-(classes) and Message files (messages).
-
-== General use ==
-
-You load a language object by calling the Language::factory() function. 
-This function the class file for the object (taking in account fallback 
-languages by using the fallback langauge's object but overloading the 
-language key) and returns that object. Nothing else happens.
-
-When a message/etc is requested, a lazy load initializor is called.  Now the
-real work starts.  We're first going to take the scenario that the language
-is not cached.  The system loads the Messages file by:
-
-    require( $filename );
-    $cache = compact( self::$mLocalisationKeys );	
-
-...where self::$mLocalisationKeys is the name of variables that could be used
-in the localization file. This lets you use things like:
-
-    $fallback = false;
-    $rtl = false;
-
-...and easily siphon them into arrays.
-
-Then, we load the $fallback language (if not set, English) to fill in the gaps in
-the messages.  There is specialized behavior for certain keys, as they can be
-mergeable maps, lists or alias lists (not sure what the last one is).
-
 == Caching ==

 MediaWiki has lots of caching mechanisms built in, which make the code somewhat
--- a/docs/ref-loose-vs-strict.txt
+++ b/docs/ref-loose-vs-strict.txt
@@ -32,6 +32,6 @@ A tag's attribute 'target' (for selecting frames) cut
    current behavior: no substitute, just delete when in strict, allow in loose
 Attribute 'name' deprecated in favor of 'id'
    current behavior: dropped silently
-    projected behavior: create proper AttrTransform (currently not allowed at all)
+    projected behavior: create proper AttrTransform
 [done] PRE tag allows SUB/SUP? (strict dtd comment vs syntax, loose disallows)
    current behavior: disallow as usual
--- a/docs/ref-xhtml-1.1.txt
+++ b/docs/ref-xhtml-1.1.txt
@@ -1,21 +1,187 @@

-Getting XHTML 1.1 Working
-
-It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
+XHTML 1.1 and HTML Purifier

+Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>
 1. Scratch lang entirely in favor of xml:lang
 2. Scratch name entirely in favor of id (partially-done)
 3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>

-...but that's only an informative section. More things to do:
+HTML Purifier uses the modularization of XHTML
+<http://www.w3.org/TR/xhtml-modularization/> to organize the internals
+of HTMLDefinition into a more manageable and extensible fashion. Rather
+than have one super-object, HTMLDefinition is split into HTMLModules,
+each of which are responsible for defining elements, their attributes,
+and other properties (for a more indepth coverage, see
+/library/HTMLPurifier/HTMLModule.php's docblock comments).

-1. Scratch style attribute (it's deprecated)
-2. Be module-aware (this might entail intelligent grouping in the definition
-   and allowing users to specifically remove certain modules (see 5))
-3. Cross-reference minimal content models with existing DTDs and determine
-   changes (todo)
-4. Watch out for the Legacy Module
-<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/abstract_modules.html#s_legacymodule>
-5. Let users specify their own custom modules
-6. Study Modularization document
-<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
+The modules that W3C defines and we support are:
+
+    * 5.1. Attribute Collections (technically not a module
+    * 5.2. Core Modules
+          o 5.2.2. Text Module
+          o 5.2.3. Hypertext Module
+          o 5.2.4. List Module
+    * 5.4. Text Extension Modules
+          o 5.4.1. Presentation Module
+          o 5.4.2. Edit Module
+          o 5.4.3. Bi-directional Text Module
+    * 5.6. Table Modules
+          o 5.6.2. Tables Module
+    * 5.7. Image Module
+    * 5.18. Style Attribute Module
+
+Modules that we don't support but coul support are:
+
+    * 5.6. Table Modules
+          o 5.6.1. Basic Tables Module [?]
+    * 5.8. Client-side Image Map Module [?]
+    * 5.9. Server-side Image Map Module [?]
+    * 5.12. Target Module [?]
+    * 5.21. Name Identification Module [deprecated]
+    * 5.22. Legacy Module [deprecated]
+
+These modules will not be implemented due to their dangerousness or
+inapplicability as an XHTML fragment:
+
+    * 5.2. Core Modules
+          o 5.2.1. Structure Module
+    * 5.3. Applet Module
+    * 5.5. Forms Modules
+          o 5.5.1. Basic Forms Module
+          o 5.5.2. Forms Module
+    * 5.10. Object Module
+    * 5.11. Frames Module
+    * 5.13. Iframe Module
+    * 5.14. Intrinsic Events Module
+    * 5.15. Metainformation Module
+    * 5.16. Scripting Module
+    * 5.17. Style Sheet Module
+    * 5.19. Link Module
+    * 5.20. Base Module
+
+We will not be using W3C's XML Schemas or DTDs directly due to the lack
+of robust tools for handling them (the main problem is that all the
+current parsers are usually PHP 5 only and solely-validating, not
+correcting).
+
+The abstraction of the HTMLDefinition creation process will also
+contribute to a need for a caching system. Cache invalidation would be
+difficult, but could be done by comparing the HTML and Attr config
+namespaces with a copy that was packaged along with the serialized
+HTMLDefinition object.
+
+== General Use-Case ==
+
+The outwards API of HTMLDefinition has been largely preserved, not
+only for backwards-compatibility but also by design. Instead,
+HTMLDefinition can be retrieved "raw", in which it loads a structure
+that closely resembles the modules of XHTML 1.1. This structure is very
+dynamic, making it easy to make cascading changes to global content
+sets or remove elements in bulk.
+
+However, once HTML Purifier needs the actual definition, it retrieves
+a finalized version of HTMLDefinition. The finalized definition involves
+processing the modules into a form that it is optimized for multiple
+calls. This final version is immutable and, even if editable, would
+be extremely hard to change.
+
+So, some code taking advantage of the XHTML modularization may look
+like this:
+
+<?php
+    $config = HTMLPurifier_Config::createDefault();
+    $def =& $config->getHTMLDefinition(true); // reference to raw
+    unset($def->modules['Hypertext']); // rm ''a'' link
+    $purifier = new HTMLPurifier($config);
+    $purifier->purify($html); // now the definition is finalized
+?>
+
+== Inclusions ==
+
+One of the nice features of HTMLDefinition is that piggy-backing off
+of global attribute and content sets is extremely easy to do.
+
+=== Attributes ===
+
+HTMLModule->elements[$element]->attr stores attribute information for the
+specific attributes of $element. This is quite close to the final
+API that HTML Purifier interfaces with, but there's an important
+extra feature: attr may also contain a array with a member index zero.
+
+<?php
+    HTMLModule->elements[$element]->attr[0] = array('AttrSet');
+?>
+
+Rather than map the attribute key 0 to an array (which should be
+an AttrDef), it defines a number of attribute collections that should
+be merged into this elements attribute array.
+
+Furthermore, the value of an attribute key, attribute value pair need
+not be a fully fledged AttrDef object. They can also be a string, which
+signifies a AttrDef that is looked up from a centralized registry
+AttrTypes. This allows more concise attribute definitions that look
+more like W3C's declarations, as well as offering a centralized point
+for modifying the behavior of one attribute type. And, of course, the
+old method of manually instantiating an AttrDef still works.
+
+=== Attribute Collections ===
+
+Attribute collections are stored and processed in the AttrCollections
+object, which is responsible for performing the inclusions signified
+by the 0 index. These attribute collections, too, are mutable, by
+using HTMLModule->attr_collections. You may add new attributes
+to a collection or define an entirely new collection for your module's
+use. Inclusions can also be cumulative.
+
+Attribute collections allow us to get rid of so called "global attributes"
+(which actually aren't so global).
+
+=== Content Models and ChildDef ===
+
+An implementation of the above-mentioned attributes and attribute
+collections was applied to the ChildDef system. HTML Purifier uses
+a proprietary system called ChildDef for performance and flexibility
+reasons, but this does not line up very well with W3C's notion of
+regexps for defining the allowed children of an element.
+
+HTMLPurifier->elements[$element]->content_model and 
+HTMLPurifier->elements[$element]->content_model_type store information
+about the final ChildDef that will be stored in
+HTMLPurifier->elements[$element]->child (we use a different variable
+because the two forms are sufficiently different).
+
+$content_model is an abstract, string representation of the internal
+state of ChildDef, while $content_model_type is a string identifier
+of which ChildDef subclass to instantiate. $content_model is processed
+by substituting all content set identifiers (capitalized element names)
+with their contents. It is then parsed and passed into the appropriate
+ChildDef class, as defined by the ContentSets->getChildDef() or the
+custom fallback HTMLModule->getChildDef() for custom child definitions
+not in the core.
+
+You'll need to use these facilities if you plan on referencing a content
+set like "Inline" or "Block", and using them is recommended even if you're
+not due to their conciseness.
+
+A few notes on $content_model: it's structure can be as complicated
+as you want, but the pipe symbol (|) is reserved for defining possible
+choices, due to the content sets implementation. For example, a content
+model that looks like:
+
+"Inline -> Block -> a"
+
+...when the Inline content set is defined as "span | b" and the Block
+content set is defined as "div | blockquote", will expand into:
+
+"span | b -> div | blockquote -> a"
+
+The custom HTMLModule->getChildDef() function will need to be able to
+then feed this information to ChildDef in a usable manner.
+
+=== Content Sets ===
+
+Content sets can be altered using HTMLModule->content_sets, an associative
+array of content set names to content set contents. If the content set
+already exists, your values are appended on to it (great for, say,
+registering the font tag as an inline element), otherwise it is
+created. They are substituted into content_model.
--- a/docs/style.css
+++ b/docs/style.css
@@ -42,3 +42,27 @@ blockquote .label {font-weight:bold; font-size:1em; margin:0 0 .1em;

 /* Contains, without exception, $Id$, for SVN version info. */
 #version {text-align:right; font-style:italic; margin:2em 0;}
+
+#toc ol ol {list-style-type:lower-roman;}
+#toc ol {list-style-type:decimal;}
+#toc {list-style-type:upper-alpha;}
+
+q {
+  behavior: url(fixquotes.htc); /* IE fix */
+  quotes: '\201C' '\201D' '\2018' '\2019';
+}
+q:before {
+  content: open-quote;
+}
+q:after {
+  content: close-quote;
+}
+
+/* Marks off implementation details interesting only to the person writing
+   the class described in the spec. */
+.technical {margin-left:2em; }
+.technical:before {content:"Technical note: "; font-weight:bold; color:#061; }
+
+/* Marks off sections that are lacking. */
+.fixme {margin-left:2em; }
+.fixme:before {content:"Fix me: "; font-weight:bold; color:#C00; }