1
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-10-24 18:16:19 +02:00

Merged r608-621 for 1.3.2 release from trunk.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/1.3@622 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang
2006-12-26 17:10:29 +00:00
parent 3b979ee846
commit 54f615f1d3
14 changed files with 405 additions and 36 deletions

179
docs/enduser-youtube.html Normal file
View File

@@ -0,0 +1,179 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="description" content="Explains how to safely allow the embedding of flash from trusted sites in HTML Purifier." />
<link rel="stylesheet" type="text/css" href="./style.css" />
<title>Embedding YouTube Videos - HTML Purifier</title>
</head><body>
<h1 class="subtitled">Embedding YouTube Videos</h1>
<div class="subtitle">...as well as other dangerous active content</div>
<div id="filing">Filed under End-User</div>
<div id="index">Return to the <a href="index.html">index</a>.</div>
<p>Clients like their YouTube videos. It gives them a warm fuzzy feeling when
they see a neat little embedded video player on their websites that can play
the latest clips from their documentary &quot;Fido and the Bones of Spring&quot;.
All joking aside, the ability to embed YouTube videos or other active
content in their pages is something that a lot of people like.</p>
<p>This is a <em>bad</em> idea. The moment you embed anything untrusted,
you will definitely be slammed by a manner of nasties that can be
embedded in things from your run of the mill Flash movie to
<a href="http://blog.spywareguide.com/2006/12/myspace_phish_attack_leads_use.html">Quicktime movies</a>.
Even <code>img</code> tags, which HTML Purifier allows by default, can be
dangerous. Be distrustful of anything that tells a browser to load content
from another website automatically.</p>
<p>Luckily for us, however, whitelisting saves the day. Sure, letting users
include any old random flash file could be dangerous, but if it's
from a specific website, it probably is okay. If no amount of pleading will
convince the people upstairs that they should just settle with just linking
to their movies, you may find this technique very useful.</p>
<h2>Sample</h2>
<p>Below is custom code that allows users to embed
YouTube videos. This is not favoritism: this trick can easily be adapted for
other forms of embeddable content.</p>
<p>Usually, websites like YouTube give us boilerplate code that you can insert
into your documents. YouTube's code goes like this:</p>
<pre>
&lt;object width=&quot;425&quot; height=&quot;350&quot;&gt;
&lt;param name=&quot;movie&quot; value=&quot;http://www.youtube.com/v/AyPzM5WK8ys&quot; /&gt;
&lt;param name=&quot;wmode&quot; value=&quot;transparent&quot; /&gt;
&lt;embed src=&quot;http://www.youtube.com/v/AyPzM5WK8ys&quot;
type=&quot;application/x-shockwave-flash&quot;
wmode=&quot;transparent&quot; width=&quot;425&quot; height=&quot;350&quot; /&gt;
&lt;/object&gt;
</pre>
<p>There are two things to note about this code:</p>
<ol>
<li><code>&lt;embed&gt;</code> is not recognized by W3C, so if you want
standards-compliant code, you'll have to get rid of it.</li>
<li>The code is exactly the same for all instances, except for the
identifier <tt>AyPzM5WK8ys</tt> which tells us which movie file
to retrieve.</li>
</ol>
<p>What point 2 means is that if we have code like <code>&lt;span
class=&quot;embed-youtube&quot;&gt;AyPzM5WK8ys&lt;/span&gt;</code> your
application can reconstruct the full object from this small snippet that
passes through HTML Purifier <em>unharmed</em>.</p>
<pre>
&lt;?php
class HTMLPurifierX_PreserveYouTube extends HTMLPurifier
{
function purify($html, $config = null) {
$pre_regex = '#&lt;object[^&gt;]+&gt;.+?'.
'http://www.youtube.com/v/([A-Za-z0-9]+).+?&lt;/object&gt;#';
$pre_replace = '&lt;span class=&quot;youtube-embed&quot;&gt;\1&lt;/span&gt;';
$html = preg_replace($pre_regex, $pre_replace, $html);
$html = parent::purify($html, $config);
$post_regex = '#&lt;span class=&quot;youtube-embed&quot;&gt;([A-Za-z0-9]+)&lt;/span&gt;#';
$post_replace = '&lt;object width=&quot;425&quot; height=&quot;350&quot; '.
'data=&quot;http://www.youtube.com/v/\1&quot;&gt;'.
'&lt;param name=&quot;movie&quot; value=&quot;http://www.youtube.com/v/\1&quot;&gt;&lt;/param&gt;'.
'&lt;param name=&quot;wmode&quot; value=&quot;transparent&quot;&gt;&lt;/param&gt;'.
'&lt;!--[if IE]&gt;'.
'&lt;embed src=&quot;http://www.youtube.com/v/\1&quot;'.
'type=&quot;application/x-shockwave-flash&quot;'.
'wmode=&quot;transparent&quot; width=&quot;425&quot; height=&quot;350&quot; /&gt;'.
'&lt;![endif]--&gt;'.
'&lt;/object&gt;';
$html = preg_replace($post_regex, $post_replace, $html);
return $html;
}
}
$purifier = new HTMLPurifierX_PreserveYouTube();
$html_still_with_youtube = $purifier->purify($html_with_youtube);
?&gt;
</pre>
<p>There is a bit going on here, so let's explain.</p>
<ol>
<li>The class uses the prefix <code>HTMLPurifierX</code> because it's
userspace code. Don't use <code>HTMLPurifier</code> in front of your
class, since it might clobber another class in the library.</li>
<li>In order to keep the interface compatible, we've extended HTMLPurifier
into a new class that preserves the YouTube videos. This means that
all you have to do is replace all instances of
<code>new HTMLPurifier</code> to <code>new
HTMLPurifierX_PreserveYouTube</code>. There's other ways to go about
doing this: if you were calling a function that wrapped HTML Purifier,
you could paste the PHP right there. If you wanted to be really
fancy, you could make a decorator for HTMLPurifier.</li>
<li>The first preg_replace call replaces any YouTube code users may have
embedded into the benign span tag. Span is used because it is inline,
and objects are inline too. We are very careful to be extremely
restrictive on what goes inside the span tag, as if an errant code
gets in there it could get messy.</li>
<li>The HTML is then purified as usual.</li>
<li>Then, another preg_replace replaces the span tag with a fully fledged
object. Note that the embed is removed, and, in its place, a data
attribute was added to the object. This makes the tag standards
compliant! It also breaks Internet Explorer, so we add in a bit of
conditional comments with the old embed code to make it work again.
It's all quite convoluted but works.</li>
</ol>
<h2>Warning</h2>
<p>There are a number of possible problems with the code above, depending
on how you look at it.</p>
<h3>Cannot change width and height</h3>
<p>The width and height of the final YouTube movie cannot be adjusted. This
is because I am lazy. If you really insist on letting users change the size
of the movie, what you need to do is package up the attributes inside the
span tag (along with the movie ID). It gets complicated though: a malicious
user can specify an outrageously large height and width and attempt to crash
the user's operating system/browser. You need to either cap it by limiting
the amount of digits allowed in the regex or using a callback to check the
number.</p>
<h3>Trusts media's host's security</h3>
<p>By allowing this code onto our website, we are trusting that YouTube has
tech-savvy enough people not to allow their users to inject malicious
code into the Flash files. An exploit on YouTube means an exploit on your
site. Even though YouTube is run by the reputable Google, it
<a href="http://ha.ckers.org/blog/20061213/google-xss-vuln/">doesn't</a>
mean they are
<a href="http://ha.ckers.org/blog/20061208/xss-in-googles-orkut/">invulnerable.</a>
You're putting a certain measure of the job on an external provider (just as
you have by entrusting your user input to HTML Purifier), and
it is important that you are cognizant of the risk.</p>
<h3>Poorly written adaptations compromise security</h3>
<p>This should go without saying, but if you're going to adapt this code
for Google Video or the like, make sure you do it <em>right</em>. It's
extremely easy to allow a character too many in the final section and
suddenly you're introducing XSS into HTML Purifier's XSS free output. HTML
Purifier may be well written, but it cannot guard against vulnerabilities
introduced after it has finished.</p>
<h2>Future plans</h2>
<p>It would probably be a good idea if this code was added to the core
library. Look out for the inclusion of this into the core as a decorator
or the like.</p>
</body>
</html>

View File

@@ -23,7 +23,10 @@ information for casual developers using HTML Purifier.</p>
<dl>
<dt><a href="enduser-id.html">IDs</a></dt>
<dd>Explains various methods for allowing IDs in documents safely in HTML Purifier.</dd>
<dd>Explains various methods for allowing IDs in documents safely.</dd>
<dt><a href="enduser-youtube.html">Embedding YouTube videos</a></dt>
<dd>Explains how to safely allow the embedding of flash from trusted sites.</dd>
</dl>

View File

@@ -10,7 +10,8 @@ It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
...but that's only an informative section. More things to do:
1. Scratch style attribute (it's deprecated)
2. Be module-aware
2. Be module-aware (this might entail intelligent grouping in the definition
and allowing users to specifically remove certain modules (see 5))
3. Cross-reference minimal content models with existing DTDs and determine
changes (todo)
4. Watch out for the Legacy Module