|
|
|
@@ -96,7 +96,7 @@ which can be a rewarding (but difficult) task.</p>
|
|
|
|
|
<h2 id="findcharset">Finding the real encoding</h2>
|
|
|
|
|
|
|
|
|
|
<p>In the beginning, there was ASCII, and things were simple. But they
|
|
|
|
|
weren't good, for no one could write in Cryllic or Thai. So there
|
|
|
|
|
weren't good, for no one could write in Cyrillic or Thai. So there
|
|
|
|
|
exploded a proliferation of character encodings to remedy the problem
|
|
|
|
|
by extending the characters ASCII could express. This ridiculously
|
|
|
|
|
simplified version of the history of character encodings shows us that
|
|
|
|
@@ -138,7 +138,7 @@ browser:</p>
|
|
|
|
|
<dd>View > Encoding: bulleted item is unofficial name</dd>
|
|
|
|
|
</dl>
|
|
|
|
|
|
|
|
|
|
<p>Internet Explorer won't give you the mime (i.e. useful/real) name of the
|
|
|
|
|
<p>Internet Explorer won't give you the MIME (i.e. useful/real) name of the
|
|
|
|
|
character encoding, so you'll have to look it up using their description.
|
|
|
|
|
Some common ones:</p>
|
|
|
|
|
|
|
|
|
@@ -216,6 +216,12 @@ if your <code>META</code> tag claims that either:</p>
|
|
|
|
|
|
|
|
|
|
<h2 id="fixcharset">Fixing the encoding</h2>
|
|
|
|
|
|
|
|
|
|
<p class="aside">The advice given here is for pages being served as
|
|
|
|
|
vanilla <code>text/html</code>. Different practices must be used
|
|
|
|
|
for <code>application/xml</code> or <code>application/xml+xhtml</code>, see
|
|
|
|
|
<a href="http://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020430/">W3C's
|
|
|
|
|
document on XHTML media types</a> for more information.</p>
|
|
|
|
|
|
|
|
|
|
<p>If your <code>META</code> encoding and your real encoding match,
|
|
|
|
|
savvy! You can skip this section. If they don't...</p>
|
|
|
|
|
|
|
|
|
@@ -302,7 +308,8 @@ languages</a>. The appropriate code is:</p>
|
|
|
|
|
|
|
|
|
|
<p>...replacing UTF-8 with whatever your embedded encoding is.
|
|
|
|
|
This code must come before any output, so be careful about
|
|
|
|
|
stray whitespace in your application.</p>
|
|
|
|
|
stray whitespace in your application (i.e., any whitespace before
|
|
|
|
|
output excluding whitespace within <?php ?> tags).</p>
|
|
|
|
|
|
|
|
|
|
<h4 id="fixcharset-server-phpini">PHP ini directive</h4>
|
|
|
|
|
|
|
|
|
@@ -313,8 +320,8 @@ header call: <code><a href="http://php.net/ini.core#ini.default-charset">default
|
|
|
|
|
|
|
|
|
|
<p>...will also do the trick. If PHP is running as an Apache module (and
|
|
|
|
|
not as FastCGI, consult
|
|
|
|
|
<a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess do apply this property
|
|
|
|
|
globally:</p>
|
|
|
|
|
<a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess to apply this property
|
|
|
|
|
across many PHP files:</p>
|
|
|
|
|
|
|
|
|
|
<pre><a href="http://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset "UTF-8"</pre>
|
|
|
|
|
|
|
|
|
@@ -360,10 +367,11 @@ to send anything at all:</p>
|
|
|
|
|
|
|
|
|
|
<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre>
|
|
|
|
|
|
|
|
|
|
<p>...making your <code>META</code> tags the sole source of
|
|
|
|
|
character encoding information. In these cases, it is
|
|
|
|
|
<em>especially</em> important to make sure you have valid <code>META</code>
|
|
|
|
|
tags on your pages and all the text before them is ASCII.</p>
|
|
|
|
|
<p>...making your internal charset declaration (usually the <code>META</code> tags)
|
|
|
|
|
the sole source of character encoding
|
|
|
|
|
information. In these cases, it is <em>especially</em> important to make
|
|
|
|
|
sure you have valid <code>META</code> tags on your pages and all the
|
|
|
|
|
text before them is ASCII.</p>
|
|
|
|
|
|
|
|
|
|
<blockquote class="aside"><p>These directives can also be
|
|
|
|
|
placed in httpd.conf file for Apache, but
|
|
|
|
@@ -428,28 +436,30 @@ IIS to change character encodings, I'd be grateful.</p>
|
|
|
|
|
|
|
|
|
|
<p><code>META</code> tags are the most common source of embedded
|
|
|
|
|
encodings, but they can also come from somewhere else: XML
|
|
|
|
|
processing instructions. They look like:</p>
|
|
|
|
|
Declarations. They look like:</p>
|
|
|
|
|
|
|
|
|
|
<pre><?xml version="1.0" encoding="UTF-8"?></pre>
|
|
|
|
|
|
|
|
|
|
<p>...and are most often found in XML documents (including XHTML).</p>
|
|
|
|
|
|
|
|
|
|
<p>For XHTML, this processing instruction theoretically
|
|
|
|
|
<p>For XHTML, this XML Declaration theoretically
|
|
|
|
|
overrides the <code>META</code> tag. In reality, this happens only when the
|
|
|
|
|
XHTML is actually served as legit XML and not HTML, which is almost always
|
|
|
|
|
never due to Internet Explorer's lack of support for
|
|
|
|
|
<code>application/xhtml+xml</code> (even though doing so is often
|
|
|
|
|
argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good practice</a>).</p>
|
|
|
|
|
argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good
|
|
|
|
|
practice</a> and is required by the XHTML 1.1 specification).</p>
|
|
|
|
|
|
|
|
|
|
<p>For XML, however, this processing instruction is extremely important.
|
|
|
|
|
<p>For XML, however, this XML Declaration is extremely important.
|
|
|
|
|
Since most webservers are not configured to send charsets for .xml files,
|
|
|
|
|
this is the only thing a parser has to go on. Furthermore, the default
|
|
|
|
|
for XML files is UTF-8, which often butts heads with more common
|
|
|
|
|
ISO-8859-1 encoding (you see this in garbled RSS feeds).</p>
|
|
|
|
|
|
|
|
|
|
<p>In short, if you use XHTML and have gone through the
|
|
|
|
|
trouble of adding the XML header, make sure it jives
|
|
|
|
|
with your <code>META</code> tags and HTTP headers.</p>
|
|
|
|
|
trouble of adding the XML Declaration, make sure it jives
|
|
|
|
|
with your <code>META</code> tags (which should only be present
|
|
|
|
|
if served in text/html) and HTTP headers.</p>
|
|
|
|
|
|
|
|
|
|
<h3 id="fixcharset-internals">Inside the process</h3>
|
|
|
|
|
|
|
|
|
@@ -506,7 +516,7 @@ usage in one language sometimes requires the occasional special character
|
|
|
|
|
that, without surprise, is not available in your character set. Sometimes
|
|
|
|
|
developers get around this by adding support for multiple encodings: when
|
|
|
|
|
using Chinese, use Big5, when using Japanese, use Shift-JIS, when
|
|
|
|
|
using Greek, etc. Other times, they use character entities with great
|
|
|
|
|
using Greek, etc. Other times, they use character references with great
|
|
|
|
|
zeal.</p>
|
|
|
|
|
|
|
|
|
|
<p>UTF-8, however, obviates the need for any of these complicated
|
|
|
|
@@ -520,14 +530,14 @@ you don't have to use those user-unfriendly entities.</p>
|
|
|
|
|
|
|
|
|
|
<p>Websites encoded in Latin-1 (ISO-8859-1) which ocassionally need
|
|
|
|
|
a special character outside of their scope often will use a character
|
|
|
|
|
entity to achieve the desired effect. For instance, θ can be
|
|
|
|
|
entity reference to achieve the desired effect. For instance, θ can be
|
|
|
|
|
written <code>&theta;</code>, regardless of the character encoding's
|
|
|
|
|
support of Greek letters.</p>
|
|
|
|
|
|
|
|
|
|
<p>This works nicely for limited use of special characters, but
|
|
|
|
|
say you wanted this sentence of Chinese text: 激光,
|
|
|
|
|
這兩個字是甚麼意思.
|
|
|
|
|
The entity-ized version would look like this:</p>
|
|
|
|
|
The ampersand encoded version would look like this:</p>
|
|
|
|
|
|
|
|
|
|
<pre>&#28608;&#20809;, &#36889;&#20841;&#20491;&#23383;&#26159;&#29978;&#40636;&#24847;&#24605;</pre>
|
|
|
|
|
|
|
|
|
@@ -545,7 +555,7 @@ an application that originally used ISO-8859-1 but switched to UTF-8
|
|
|
|
|
when it became far to cumbersome to support foreign languages. Bots
|
|
|
|
|
will now actually go through articles and convert character entities
|
|
|
|
|
to their corresponding real characters for the sake of user-friendliness
|
|
|
|
|
and searcheability. See
|
|
|
|
|
and searchability. See
|
|
|
|
|
<a href="http://meta.wikimedia.org/wiki/Help:Special_characters">Meta's
|
|
|
|
|
page on special characters</a> for more details.
|
|
|
|
|
</p></blockquote>
|
|
|
|
@@ -593,7 +603,7 @@ browser you're using, they might:</p>
|
|
|
|
|
<ul>
|
|
|
|
|
<li>Replace the unsupported characters with useless question marks,</li>
|
|
|
|
|
<li>Attempt to fix the characters (example: smart quotes to regular quotes),</li>
|
|
|
|
|
<li>Replace the character with a character entity, or</li>
|
|
|
|
|
<li>Replace the character with a character entity reference, or</li>
|
|
|
|
|
<li>Send it anyway as a different character encoding mixed in
|
|
|
|
|
with the original encoding (usually Windows-1252 rather than
|
|
|
|
|
iso-8859-1 or UTF-8 interspersed in 8-bit)</li>
|
|
|
|
@@ -609,7 +619,7 @@ since UTF-8 supports every character.</p>
|
|
|
|
|
|
|
|
|
|
<h4 id="whyutf8-forms-multipart"><code>multipart/form-data</code></h4>
|
|
|
|
|
|
|
|
|
|
<p>Multipart form submission takes a way a lot of the ambiguity
|
|
|
|
|
<p>Multipart form submission takes away a lot of the ambiguity
|
|
|
|
|
that percent-encoding had: the server now can explicitly ask for
|
|
|
|
|
certain encodings, and the client can explicitly tell the server
|
|
|
|
|
during the form submission what encoding the fields are in.</p>
|
|
|
|
@@ -622,9 +632,9 @@ Each method has deficiencies, especially the former.</p>
|
|
|
|
|
<p>If you tell the browser to send the form in the same encoding as
|
|
|
|
|
the page, you still have the trouble of what to do with characters
|
|
|
|
|
that are outside of the character encoding's range. The behavior, once
|
|
|
|
|
again, varies: Firefox 2.0 entity-izes them while Internet Explorer
|
|
|
|
|
7.0 mangles them beyond intelligibility. For serious internationalization purposes,
|
|
|
|
|
this is not an option.</p>
|
|
|
|
|
again, varies: Firefox 2.0 converts them to character entity references
|
|
|
|
|
while Internet Explorer 7.0 mangles them beyond intelligibility. For
|
|
|
|
|
serious internationalization purposes, this is not an option.</p>
|
|
|
|
|
|
|
|
|
|
<p>The other possibility is to set Accept-Encoding to UTF-8, which
|
|
|
|
|
begs the question: Why aren't you using UTF-8 for everything then?
|
|
|
|
@@ -664,12 +674,12 @@ it up to the module iconv to do the dirty work.</p>
|
|
|
|
|
<p>This approach, however, is not perfect. iconv is blithely unaware
|
|
|
|
|
of HTML character entities. HTML Purifier, in order to
|
|
|
|
|
protect against sophisticated escaping schemes, normalizes all character
|
|
|
|
|
and numeric entities before processing the text. This leads to
|
|
|
|
|
and numeric entitie references before processing the text. This leads to
|
|
|
|
|
one important ramification:</p>
|
|
|
|
|
|
|
|
|
|
<p><strong>Any character that is not supported by the target character
|
|
|
|
|
set, regardless of whether or not it is in the form of a character
|
|
|
|
|
entity or a raw character, will be silently ignored.</strong></p>
|
|
|
|
|
entity reference or a raw character, will be silently ignored.</strong></p>
|
|
|
|
|
|
|
|
|
|
<p>Example of this principle at work: say you have <code>&theta;</code>
|
|
|
|
|
in your HTML, but the output is in Latin-1 (which, understandably,
|
|
|
|
@@ -678,7 +688,7 @@ set the encoding correctly using %Core.Encoding):</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li>The <code>Encoder</code> will transform the text from ISO 8859-1 to UTF-8
|
|
|
|
|
(note that theta is preserved since it doesn't actually use
|
|
|
|
|
(note that theta is preserved here since it doesn't actually use
|
|
|
|
|
any non-ASCII characters): <code>&theta;</code></li>
|
|
|
|
|
<li>The <code>EntityParser</code> will transform all named and numeric
|
|
|
|
|
character entities to their corresponding raw UTF-8 equivalents:
|
|
|
|
@@ -701,7 +711,7 @@ Purifier has provided a slightly more palatable workaround using
|
|
|
|
|
<li>The <code>EntityParser</code> transforms entities: <code>θ</code></li>
|
|
|
|
|
<li>HTML Purifier processes the code: <code>θ</code></li>
|
|
|
|
|
<li>The <code>Encoder</code> replaces all non-ASCII characters
|
|
|
|
|
with numeric entities: <code>&#952;</code></li>
|
|
|
|
|
with numeric entity reference: <code>&#952;</code></li>
|
|
|
|
|
<li>For good measure, <code>Encoder</code> transforms encoding back to
|
|
|
|
|
original (which is strictly unnecessary for 99% of encodings
|
|
|
|
|
out there): <code>&#952;</code> (remember, it's all ASCII!)</li>
|
|
|
|
@@ -711,19 +721,19 @@ Purifier has provided a slightly more palatable workaround using
|
|
|
|
|
the land of Unicode characters, and is totally unacceptable for Chinese
|
|
|
|
|
or Japanese texts. The even bigger kicker is that, supposing the
|
|
|
|
|
input encoding was actually ISO-8859-7, which <em>does</em> support
|
|
|
|
|
theta, the character would get entity-ized anyway! (The Encoder does
|
|
|
|
|
not discriminate).</p>
|
|
|
|
|
theta, the character would get converted into a character entity reference
|
|
|
|
|
anyway! (The Encoder does not discriminate).</p>
|
|
|
|
|
|
|
|
|
|
<p>The current functionality is about where HTML Purifier will be for
|
|
|
|
|
the rest of eternity. HTML Purifier could attempt to preserve the original
|
|
|
|
|
form of the entities so that they could be substituted back in, only the
|
|
|
|
|
form of the character references so that they could be substituted back in, only the
|
|
|
|
|
DOM extension kills them off irreversibly. HTML Purifier could also attempt
|
|
|
|
|
to be smart and only convert non-ASCII characters that weren't supported
|
|
|
|
|
by the target encoding, but that would require reimplementing iconv
|
|
|
|
|
with HTML awareness, something I will not do.</p>
|
|
|
|
|
|
|
|
|
|
<p>So there: either it's UTF-8 or crippled international support. Your pick! (and I'm
|
|
|
|
|
not being sarcastic here: some people could care less about other languages)</p>
|
|
|
|
|
not being sarcastic here: some people could care less about other languages).</p>
|
|
|
|
|
|
|
|
|
|
<h2 id="migrate">Migrate to UTF-8</h2>
|
|
|
|
|
|
|
|
|
@@ -985,7 +995,7 @@ and yes, it is variable width. Other traits:</p>
|
|
|
|
|
in different ways. It is beyond the scope of this document to explain
|
|
|
|
|
what precisely these implications are. PHPWact provides
|
|
|
|
|
a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a>
|
|
|
|
|
on what to expect from each functions, although coverage is spotty in
|
|
|
|
|
on what to expect from each function, although coverage is spotty in
|
|
|
|
|
some areas. Their more general notes on
|
|
|
|
|
<a href="http://www.phpwact.org/php/i18n/charsets">character sets</a>
|
|
|
|
|
are also worth looking at for information on UTF-8. Some rules of thumb
|
|
|
|
@@ -999,7 +1009,7 @@ when dealing with Unicode text:</p>
|
|
|
|
|
<li>Think twice before using functions that:<ul>
|
|
|
|
|
<li>...count characters (strlen will return bytes, not characters;
|
|
|
|
|
str_split and word_wrap may corrupt)</li>
|
|
|
|
|
<li>...entity-ize things (UTF-8 doesn't need entities)</li>
|
|
|
|
|
<li>...convert characters to entity references (UTF-8 doesn't need entities)</li>
|
|
|
|
|
<li>...do very complex string processing (*printf)</li>
|
|
|
|
|
</ul></li>
|
|
|
|
|
</ul>
|
|
|
|
|