mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-08-08 15:16:54 +02:00
Release 1.5.0, merged in r688-867.
- LanguageFactory::instance() declared static - HTMLModuleManagerTest pass by reference bug fixed, merge back into trunk scheduled git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/strict@869 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
@@ -10,7 +10,7 @@
|
||||
.minor td {font-style:italic;}
|
||||
</style>
|
||||
|
||||
<title>UTF-8 - HTML Purifier</title>
|
||||
<title>UTF-8: The Secret of Character Encoding - HTML Purifier</title>
|
||||
|
||||
<!-- Note to users: this document, though professing to be UTF-8, attempts
|
||||
to use only ASCII characters, because most webservers are configured
|
||||
@@ -19,21 +19,27 @@ own advice for sake of portability. -->
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1>UTF-8</h1>
|
||||
<h1>UTF-8: The Secret of Character Encoding</h1>
|
||||
|
||||
<div id="filing">Filed under End-User</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://hp.jpsband.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>Character encoding and character sets, in truth, are not that
|
||||
difficult to understand. But if you don't understand them, you are going
|
||||
to be caught by surprise by some of HTML Purifier's behavior, namely
|
||||
the fact that it operates UTF-8 or the limitations of the character
|
||||
encoding transformations it does. This document will walk you through
|
||||
<p>Character encoding and character sets are not that
|
||||
difficult to understand, but so many people blithely stumble
|
||||
through the worlds of programming without knowing what to actually
|
||||
do about it, or say "Ah, it's a job for those <em>internationalization</em>
|
||||
experts." No, it is not! This document will walk you through
|
||||
determining the encoding of your system and how you should handle
|
||||
this information. It will stay away from excessive discussion on
|
||||
the internals of character encoding, but offer the information in
|
||||
asides that can easily be skipped.</p>
|
||||
the internals of character encoding.</p>
|
||||
|
||||
<p>This document is not designed to be read in its entirety: it will
|
||||
slowly introduce concepts that build on each other: you need not get to
|
||||
the bottom to have learned something new. However, I strongly
|
||||
recommend you read all the way to <strong>Why UTF-8?</strong>, because at least
|
||||
at that point you'd have made a conscious decision not to migrate,
|
||||
which can be a rewarding (but difficult) task.</p>
|
||||
|
||||
<blockquote class="aside">
|
||||
<div class="label">Asides</div>
|
||||
@@ -43,6 +49,50 @@ asides that can easily be skipped.</p>
|
||||
with a greater understanding of the underlying issues.</p>
|
||||
</blockquote>
|
||||
|
||||
<h2>Table of Contents</h2>
|
||||
|
||||
<ol id="toc">
|
||||
<li><a href="#findcharset">Finding the real encoding</a></li>
|
||||
<li><a href="#findmetacharset">Finding the embedded encoding</a></li>
|
||||
<li><a href="#fixcharset">Fixing the encoding</a><ol>
|
||||
<li><a href="#fixcharset-none">No embedded encoding</a></li>
|
||||
<li><a href="#fixcharset-diff">Embedded encoding disagrees</a></li>
|
||||
<li><a href="#fixcharset-server">Changing the server encoding</a><ol>
|
||||
<li><a href="#fixcharset-server-php">PHP header() function</a></li>
|
||||
<li><a href="#fixcharset-server-phpini">PHP ini directive</a></li>
|
||||
<li><a href="#fixcharset-server-nophp">Non-PHP</a></li>
|
||||
<li><a href="#fixcharset-server-htaccess">.htaccess</a></li>
|
||||
<li><a href="#fixcharset-server-ext">File extensions</a></li>
|
||||
</ol></li>
|
||||
<li><a href="#fixcharset-xml">XML</a></li>
|
||||
<li><a href="#fixcharset-internals">Inside the process</a></li>
|
||||
</ol></li>
|
||||
<li><a href="#whyutf8">Why UTF-8?</a><ol>
|
||||
<li><a href="#whyutf8-i18n">Internationalization</a></li>
|
||||
<li><a href="#whyutf8-user">User-friendly</a></li>
|
||||
<li><a href="#whyutf8-forms">Forms</a><ol>
|
||||
<li><a href="#whyutf8-forms-urlencoded">application/x-www-form-urlencoded</a></li>
|
||||
<li><a href="#whyutf8-forms-multipart">multipart/form-data</a></li>
|
||||
</ol></li>
|
||||
<li><a href="#whyutf8-support">Well supported</a></li>
|
||||
<li><a href="#whyutf8-htmlpurifier">HTML Purifiers</a></li>
|
||||
</ol></li>
|
||||
<li><a href="#migrate">Migrate to UTF-8</a><ol>
|
||||
<li><a href="#migrate-db">Configuring your database</a><ol>
|
||||
<li><a href="#migrate-db-legit">Legit method</a></li>
|
||||
<li><a href="#migrate-db-binary">Binary</a></li>
|
||||
</ol></li>
|
||||
<li><a href="#migrate-editor">Text editor</a></li>
|
||||
<li><a href="#migrate-bom">Byte Order Mark (headers already sent!)</a></li>
|
||||
<li><a href="#migrate-fonts">Fonts</a><ol>
|
||||
<li><a href="#migrate-fonts-obscure">Obscure scripts</a></li>
|
||||
<li><a href="#migrate-fonts-occasional">Occasional use</a></li>
|
||||
</ol></li>
|
||||
<li><a href="#migrate-variablewidth">Dealing with variable width in functions</a></li>
|
||||
</ol></li>
|
||||
<li><a href="#externallinks">Further Reading</a></li>
|
||||
</ol>
|
||||
|
||||
<h2 id="findcharset">Finding the real encoding</h2>
|
||||
|
||||
<p>In the beginning, there was ASCII, and things were simple. But they
|
||||
@@ -275,7 +325,7 @@ your own php.ini file, ask your support for details. Use:</p>
|
||||
|
||||
<h4 id="fixcharset-server-nophp">Non-PHP</h4>
|
||||
|
||||
<p>You may, for whatever reason, may need to set the character encoding
|
||||
<p>You may, for whatever reason, need to set the character encoding
|
||||
on non-PHP files, usually plain ol' HTML files. Doing this
|
||||
is more of a hit-or-miss process: depending on the software being
|
||||
used as a webserver and the configuration of that software, certain
|
||||
@@ -386,8 +436,8 @@ processing instructions. They look like:</p>
|
||||
|
||||
<p>For XHTML, this processing instruction theoretically
|
||||
overrides the <code>META</code> tag. In reality, this happens only when the
|
||||
XHTML is actually served as legit XML and not HTML, which is almost
|
||||
always never due to Internet Explorer's lack of support for
|
||||
XHTML is actually served as legit XML and not HTML, which is almost always
|
||||
never due to Internet Explorer's lack of support for
|
||||
<code>application/xhtml+xml</code> (even though doing so is often
|
||||
argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good practice</a>).</p>
|
||||
|
||||
@@ -398,10 +448,10 @@ for XML files is UTF-8, which often butts heads with more common
|
||||
ISO-8859-1 encoding (you see this in garbled RSS feeds).</p>
|
||||
|
||||
<p>In short, if you use XHTML and have gone through the
|
||||
trouble of adding the XML header, be sure to make sure it jives
|
||||
trouble of adding the XML header, make sure it jives
|
||||
with your <code>META</code> tags and HTTP headers.</p>
|
||||
|
||||
<h3>Inside the process</h3>
|
||||
<h3 id="fixcharset-internals">Inside the process</h3>
|
||||
|
||||
<p>This section is not required reading,
|
||||
but may answer some of your questions on what's going on in all
|
||||
@@ -572,7 +622,7 @@ Each method has deficiencies, especially the former.</p>
|
||||
the page, you still have the trouble of what to do with characters
|
||||
that are outside of the character encoding's range. The behavior, once
|
||||
again, varies: Firefox 2.0 entity-izes them while Internet Explorer
|
||||
7.0 mangles them beyond intelligibility. For serious I18N purposes,
|
||||
7.0 mangles them beyond intelligibility. For serious internationalization purposes,
|
||||
this is not an option.</p>
|
||||
|
||||
<p>The other possibility is to set Accept-Encoding to UTF-8, which
|
||||
@@ -604,22 +654,374 @@ hounding you about broken pages.</p>
|
||||
|
||||
<h3 id="whyutf8-htmlpurifier">HTML Purifier</h3>
|
||||
|
||||
<p>And finally, we get to HTML Purifier.</p>
|
||||
<p>And finally, we get to HTML Purifier. HTML Purifier is built to
|
||||
deal with UTF-8: any indications otherwise are the result of an
|
||||
encoder that converts text from your preferred encoding to UTF-8, and
|
||||
back again. HTML Purifier never touches anything else, and leaves
|
||||
it up to the module iconv to do the dirty work.</p>
|
||||
|
||||
<p>This approach, however, is not perfect. iconv is blithely unaware
|
||||
of HTML character entities. HTML Purifier, in order to
|
||||
protect against sophisticated escaping schemes, normalizes all character
|
||||
and numeric entities before processing the text. This leads to
|
||||
one important ramification:</p>
|
||||
|
||||
<p><strong>Any character that is not supported by the target character
|
||||
set, regardless of whether or not it is in the form of a character
|
||||
entity or a raw character, will be silently ignored.</strong></p>
|
||||
|
||||
<p>Example of this principle at work: say you have <code>&theta;</code>
|
||||
in your HTML, but the output is in Latin-1 (which, understandably,
|
||||
does not understand Greek), the following process will occur (assuming you've
|
||||
set the encoding correctly using %Core.Encoding):</p>
|
||||
|
||||
<ul>
|
||||
<li>The <code>Encoder</code> will transform the text from ISO 8859-1 to UTF-8
|
||||
(note that theta is preserved since it doesn't actually use
|
||||
any non-ASCII characters): <code>&theta;</code></li>
|
||||
<li>The <code>EntityParser</code> will transform all named and numeric
|
||||
character entities to their corresponding raw UTF-8 equivalents:
|
||||
<code>θ</code></li>
|
||||
<li>HTML Purifier processes the code: <code>θ</code></li>
|
||||
<li>The <code>Encoder</code> now transforms the text back from UTF-8
|
||||
to ISO 8859-1. Since Greek is not supported by ISO 8859-1, it
|
||||
will be either ignored or replaced with a question mark:
|
||||
<code>?</code></li>
|
||||
</ul>
|
||||
|
||||
<p>This behaviour is quite unsatisfactory. It is a deal-breaker for
|
||||
international applications, and it can be mildly annoying for the provincial
|
||||
soul who occasionally needs a special character. Since 1.4.0, HTML
|
||||
Purifier has provided a slightly more palatable workaround using
|
||||
%Core.EscapeNonASCIICharacters. The process now looks like:</p>
|
||||
|
||||
<ul>
|
||||
<li>The <code>Encoder</code> transforms encoding to UTF-8: <code>&theta;</code></li>
|
||||
<li>The <code>EntityParser</code> transforms entities: <code>θ</code></li>
|
||||
<li>HTML Purifier processes the code: <code>θ</code></li>
|
||||
<li>The <code>Encoder</code> replaces all non-ASCII characters
|
||||
with numeric entities: <code>&#952;</code></li>
|
||||
<li>For good measure, <code>Encoder</code> transforms encoding back to
|
||||
original (which is strictly unnecessary for 99% of encodings
|
||||
out there): <code>&#952;</code> (remember, it's all ASCII!)</li>
|
||||
</ul>
|
||||
|
||||
<p>...which means that this is only good for an occasional foray into
|
||||
the land of Unicode characters, and is totally unacceptable for Chinese
|
||||
or Japanese texts. The even bigger kicker is that, supposing the
|
||||
input encoding was actually ISO-8859-7, which <em>does</em> support
|
||||
theta, the character would get entity-ized anyway! (The Encoder does
|
||||
not discriminate).</p>
|
||||
|
||||
<p>The current functionality is about where HTML Purifier will be for
|
||||
the rest of eternity. HTML Purifier could attempt to preserve the original
|
||||
form of the entities so that they could be substituted back in, only the
|
||||
DOM extension kills them off irreversibly. HTML Purifier could also attempt
|
||||
to be smart and only convert non-ASCII characters that weren't supported
|
||||
by the target encoding, but that would require reimplementing iconv
|
||||
with HTML awareness, something I will not do.</p>
|
||||
|
||||
<p>So there: either it's UTF-8 or crippled international support. Your pick! (and I'm
|
||||
not being sarcastic here: some people could care less about other languages)</p>
|
||||
|
||||
<h2 id="migrate">Migrate to UTF-8</h2>
|
||||
|
||||
<h3 id="migrate-editor">Text editor</h3>
|
||||
<p>So, you've decided to bite the bullet, and want to migrate to UTF-8.
|
||||
Note that this is not for the faint-hearted, and you should expect
|
||||
the process to take longer than you think it will take.</p>
|
||||
|
||||
<p>The general idea is that you convert all existing text to UTF-8,
|
||||
and then you set all the headers and META tags we discussed earlier
|
||||
to UTF-8. There are many ways going about doing this: you could
|
||||
write a conversion script that runs through the database and re-encodes
|
||||
everything as UTF-8 or you could do the conversion on the fly when someone
|
||||
reads the page. The details depend on your system, but I will cover
|
||||
some of the more subtle points of migration that may trip you up.</p>
|
||||
|
||||
<h3 id="migrate-db">Configuring your database</h3>
|
||||
|
||||
<h3 id="migrate-convert">Convert old text</h3>
|
||||
<p>Most modern databases, the most prominent open-source ones being MySQL
|
||||
4.1+ and PostgreSQL, support character encodings. If you're switching
|
||||
to UTF-8, logically speaking, you'd want to make sure your database
|
||||
knows about the change too. There are some caveats though:</p>
|
||||
|
||||
<h4 id="migrate-db-legit">Legit method</h4>
|
||||
|
||||
<p>Standardization in terms of SQL syntax for specifying character
|
||||
encodings is notoriously spotty. Refer to your respective database's
|
||||
documentation on how to do this properly.</p>
|
||||
|
||||
<p>For <a href="http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html">MySQL</a>, <code>ALTER</code> will magically perform the
|
||||
character encoding conversion for you. However, you have
|
||||
to make sure that the text inside the column is what is says it is:
|
||||
if you had put Shift-JIS in an ISO 8859-1 column, MySQL will irreversibly mangle
|
||||
the text when you try to convert it to UTF-8. You'll have to convert
|
||||
it to a binary field, convert it to a Shift-JIS field (the real encoding),
|
||||
and then finally to UTF-8. Many a website had pages irreversibly mangled
|
||||
because they didn't realize that they'd been deluding themselves about
|
||||
the character encoding all along, don't become the next victim.</p>
|
||||
|
||||
<p>For <a href="http://www.postgresql.org/docs/8.2/static/multibyte.html">PostgreSQL</a>, there appears to be no direct way to change the
|
||||
encoding of a database (as of 8.2). You will have to dump the data, and then reimport
|
||||
it into a new table. Make sure that your client encoding is set properly:
|
||||
this is how PostgreSQL knows to perform an encoding conversion.</p>
|
||||
|
||||
<p>Many times, you will be also asked about the "collation" of
|
||||
the new column. Collation is how a DBMS sorts text, like ordering
|
||||
B, C and A into A, B and C (the problem gets surprisingly complicated
|
||||
when you get to languages like Thai and Japanese). If in doubt,
|
||||
going with the default setting is usually a safe bet.</p>
|
||||
|
||||
<p>Once the conversion is all said and done, you still have to remember
|
||||
to set the client encoding (your encoding) properly on each database
|
||||
connection using <code>SET NAMES</code> (which is standard SQL and is
|
||||
usually supported).</p>
|
||||
|
||||
<h4 id="migrate-db-binary">Binary</h4>
|
||||
|
||||
<p>Due to the abovementioned compatibility issues, a more interoperable
|
||||
way of storing UTF-8 text is to stuff it in a binary datatype.
|
||||
<code>CHAR</code> becomes <code>BINARY</code>, <code>VARCHAR</code> becomes
|
||||
<code>VARBINARY</code> and <code>TEXT</code> becomes <code>BLOB</code>.
|
||||
Doing so can save you some huge headaches:</p>
|
||||
|
||||
<ul>
|
||||
<li>The syntax for binary data types is very portable,</li>
|
||||
<li>MySQL 4.0 has <em>no</em> support for character encodings, so
|
||||
if you want to support it you <em>have</em> to use binary,</li>
|
||||
<li>MySQL, as of 5.1, has no support for four byte UTF-8 characters,
|
||||
which represent characters beyond the basic multilingual
|
||||
plane, and</li>
|
||||
<li>You will never have to worry about your DBMS being too smart
|
||||
and attempting to convert your text when you don't want it to.</li>
|
||||
</ul>
|
||||
|
||||
<p>MediaWiki, a very prominent international application, uses binary fields
|
||||
for storing their data because of point three.</p>
|
||||
|
||||
<p>There are drawbacks, of course:</p>
|
||||
|
||||
<ul>
|
||||
<li>Database tools like PHPMyAdmin won't be able to offer you inline
|
||||
text editing, since it is declared as binary,</li>
|
||||
<li>It's not semantically correct: it's really text not binary
|
||||
(lying to the database),</li>
|
||||
<li>Unless you use the not-very-portable wizardry mentioned above,
|
||||
you have to change the encoding yourself (usually, you'd do
|
||||
it on the fly), and</li>
|
||||
<li>You will not have collation.</li>
|
||||
</ul>
|
||||
|
||||
<p>Choose based on your circumstances.</p>
|
||||
|
||||
<h3 id="migrate-editor">Text editor</h3>
|
||||
|
||||
<p>For more flat-file oriented systems, you will often be tasked with
|
||||
converting reams of existing text and HTML files into UTF-8, as well as
|
||||
making sure that all new files uploaded are properly encoded. Once again,
|
||||
I can only point vaguely in the right direction for converting your
|
||||
existing files: make sure you backup, make sure you use
|
||||
<a href="http://php.net/ref.iconv">iconv</a>(), and
|
||||
make sure you know what the original character encoding of the files
|
||||
is (or are, depending on the tidiness of your system).</p>
|
||||
|
||||
<p>However, I can proffer more specific advice on the subject of
|
||||
text editors. Many text editors have notoriously spotty Unicode support.
|
||||
To find out how your editor is doing, you can check out <a
|
||||
href="http://www.alanwood.net/unicode/utilities_editors.html">this list</a>
|
||||
or <a href="http://en.wikipedia.org/wiki/Comparison_of_text_editors#Encoding_support">Wikipedia's list.</a>
|
||||
I personally use Notepad++, which works like a charm when it comes to UTF-8.
|
||||
Usually, you will have to <strong>explicitly</strong> tell the editor through some dialogue
|
||||
(usually Save as or Format) what encoding you want it to use. An editor
|
||||
will often offer "Unicode" as a method of saving, which is
|
||||
ambiguous. Make sure you know whether or not they really mean UTF-8
|
||||
or UTF-16 (which is another flavor of Unicode).</p>
|
||||
|
||||
<p>The two things to look out for are whether or not the editor
|
||||
supports <strong>font mixing</strong> (multiple
|
||||
fonts in one document) and whether or not it adds a <strong>BOM</strong>.
|
||||
Font mixing is important because fonts rarely have support for every
|
||||
language known to mankind: in order to be flexible, an editor must
|
||||
be able to take a little from here and a little from there, otherwise
|
||||
all your Chinese characters will come as nice boxes. We'll discuss
|
||||
BOM below.</p>
|
||||
|
||||
<h3 id="migrate-bom">Byte Order Mark (headers already sent!)</h3>
|
||||
|
||||
<p>The BOM, or <a href="http://en.wikipedia.org/wiki/Byte_Order_Mark">Byte
|
||||
Order Mark</a>, is a magical, invisible character placed at
|
||||
the beginning of UTF-8 files to tell people what the encoding is and
|
||||
what the endianness of the text is. It is also unnecessary.</p>
|
||||
|
||||
<p>Because it's invisible, it often
|
||||
catches people by surprise when it starts doing things it shouldn't
|
||||
be doing. For example, this PHP file:</p>
|
||||
|
||||
<pre><strong>BOM</strong><?php
|
||||
header('Location: index.php');
|
||||
?></pre>
|
||||
|
||||
<p>...will fail with the all too familiar <strong>Headers already sent</strong>
|
||||
PHP error. And because the BOM is invisible, this culprit will go unnoticed.
|
||||
My suggestion is to only use ASCII in PHP pages, but if you must, make
|
||||
sure the page is saved WITHOUT the BOM.</p>
|
||||
|
||||
<blockquote class="aside">
|
||||
<p>The headers the error is referring to are <strong>HTTP headers</strong>,
|
||||
which are sent to the browser before any HTML to tell it various
|
||||
information. The moment any regular text (and yes, a BOM counts as
|
||||
ordinary text) is output, the headers must be sent, and you are
|
||||
not allowed to send anymore. Thus, the error.</p>
|
||||
</blockquote>
|
||||
|
||||
<p>If you are reading in text files to insert into the middle of another
|
||||
page, it is strongly advised (but not strictly necessary) that you replace out the UTF-8 byte
|
||||
sequence for BOM <code>"\xEF\xBB\xBF"</code> before inserting it in,
|
||||
via:</p>
|
||||
|
||||
<pre>$text = str_replace("\xEF\xBB\xBF", '', $text);</pre>
|
||||
|
||||
<h3 id="migrate-fonts">Fonts</h3>
|
||||
|
||||
<p>Generally speaking, people who are having trouble with fonts fall
|
||||
into two categories:</p>
|
||||
|
||||
<ul>
|
||||
<li>Those who want to
|
||||
use an extremely obscure language for which there is very little
|
||||
support even among native speakers of the language, and</li>
|
||||
<li>Those where the primary language of the text is
|
||||
well-supported but there are occasional characters
|
||||
that aren't supported.</li>
|
||||
</ul>
|
||||
|
||||
<p>Yes, there's always a chance where an English user happens across
|
||||
a Sinhalese website and doesn't have the right font. But an English user
|
||||
who happens not to have the right fonts probably has no business reading Sinhalese
|
||||
anyway. So we'll deal with the other two edge cases.</p>
|
||||
|
||||
<h4 id="migrate-fonts-obscure">Obscure scripts</h4>
|
||||
|
||||
<p>If you run a Bengali website, you may get comments from users who
|
||||
would like to read your website but get heaps of question marks or
|
||||
other meaningless characters. Fixing this problem requires the
|
||||
installation of a font or language pack which is often highly
|
||||
dependent on what the language is. <a href="http://bn.wikipedia.org/wiki/%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%AA%E0%A7%87%E0%A6%A1%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE:Bangla_script_display_help">Here is an example</a>
|
||||
of such a help file for the Bengali language, I am sure there are
|
||||
others out there too. You just have to point users to the appropriate
|
||||
help file.</p>
|
||||
|
||||
<h4 id="migrate-fonts-occasional">Occasional use</h4>
|
||||
|
||||
<p>A prime example of when you'll see some very obscure Unicode
|
||||
characters embedded in what otherwise would be very bland ASCII are
|
||||
letters of the
|
||||
<a href="http://en.wikipedia.org/wiki/International_Phonetic_Alphabet">International
|
||||
Phonetic Alphabet (IPA)</a>, use to designate pronounciations in a very standard
|
||||
manner (you probably see them all the time in your dictionary). Your
|
||||
average font probably won't have support for all of the IPA characters
|
||||
like ʘ (bilabial click) or ʒ (voiced postalveolar fricative).
|
||||
So what's a poor browser to do? Font mix! Smart browsers like Mozilla Firefox
|
||||
and Internet Explorer 7 will borrow glyphs from other fonts in order
|
||||
to make sure that all the characters display properly.</p>
|
||||
|
||||
<p>But what happens when the browser isn't smart and happens to be the
|
||||
most widely used browser in the entire world? Microsoft IE 6
|
||||
is not smart enough to borrow from other fonts when a character isn't
|
||||
present, so more often than not you'll be slapped with a nice big �.
|
||||
To get things to work, MSIE 6 needs a little nudge. You could configure it
|
||||
to use a different font to render the text, but you can acheive the same
|
||||
effect by selectively changing the font for blocks of special characters
|
||||
to known good Unicode fonts.</p>
|
||||
|
||||
<p>Fortunantely, the folks over at Wikipedia have already done all the
|
||||
heavy lifting for you. Get the CSS from the horses mouth here:
|
||||
<a href="http://en.wikipedia.org/wiki/MediaWiki:Common.css">Common.css</a>,
|
||||
and search for ".IPA" There are also a smattering of
|
||||
other classes you can use for other purposes, check out
|
||||
<a href="http://meta.wikimedia.org/wiki/Help:Special_characters#Displaying_Special_Characters">this page</a>
|
||||
for more details. For you lazy ones, this should work:</p>
|
||||
|
||||
<pre>.Unicode {
|
||||
font-family: Code2000, "TITUS Cyberbit Basic", "Doulos SIL",
|
||||
"Chrysanthi Unicode", "Bitstream Cyberbit",
|
||||
"Bitstream CyberBase", Thryomanes, Gentium, GentiumAlt,
|
||||
"Lucida Grande", "Arial Unicode MS", "Microsoft Sans Serif",
|
||||
"Lucida Sans Unicode";
|
||||
font-family /**/:inherit; /* resets fonts for everyone but IE6 */
|
||||
}</pre>
|
||||
|
||||
<p>The standard usage goes along the lines of <code><span class="Unicode">Crazy
|
||||
Unicode stuff here</span></code>. Characters in the
|
||||
<a href="http://en.wikipedia.org/wiki/Windows_Glyph_List_4">Windows Glyph List</a>
|
||||
usually don't need to be fixed, but for anything else you probably
|
||||
want to play it safe. Unless, of course, you don't care about IE6
|
||||
users.</p>
|
||||
|
||||
<h3 id="migrate-variablewidth">Dealing with variable width in functions</h3>
|
||||
|
||||
<p>When people claim that PHP6 will solve all our Unicode problems, they're
|
||||
misinformed. It will not fix any of the abovementioned troubles. It will,
|
||||
however, fix the problem we are about to discuss: processing UTF-8 text
|
||||
in PHP.</p>
|
||||
|
||||
<p>PHP (as of PHP5) is blithely unaware of the existence of UTF-8 (with a few
|
||||
notable exceptions). Sometimes, this will cause problems, other times,
|
||||
this won't. So far, we've avoided discussing the architecture of
|
||||
UTF-8, so, we must first ask, what is UTF-8? Yes, it supports Unicode,
|
||||
and yes, it is variable width. Other traits:</p>
|
||||
|
||||
<ul>
|
||||
<li>Every character's byte sequence is unique and will never be found
|
||||
inside the byte sequence of another character,</li>
|
||||
<li>UTF-8 may use up to four bytes to encode a character,</li>
|
||||
<li>UTF-8 text must be checked for well-formedness,</li>
|
||||
<li>Pure ASCII is also valid UTF-8, and</li>
|
||||
<li>Binary sorting will sort UTF-8 in the same order as Unicode.</li>
|
||||
</ul>
|
||||
|
||||
<p>Each of these traits affect different domains of text processing
|
||||
in different ways. It is beyond the scope of this document to explain
|
||||
what precisely these implications are. PHPWact provides
|
||||
a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a>
|
||||
on what to expect from each functions, although coverage is spotty in
|
||||
some areas. Their more general notes on
|
||||
<a href="http://www.phpwact.org/php/i18n/charsets">character sets</a>
|
||||
are also worth looking at for information on UTF-8. Some rules of thumb
|
||||
when dealing with Unicode text:</p>
|
||||
|
||||
<ul>
|
||||
<li>Do not EVER use functions that:<ul>
|
||||
<li>...convert case (strtolower, strtoupper, ucfirst, ucwords)</li>
|
||||
<li>...claim to be case-insensitive (str_ireplace, stristr, strcasecmp)</li>
|
||||
</ul></li>
|
||||
<li>Think twice before using functions that:<ul>
|
||||
<li>...count characters (strlen will return bytes, not characters;
|
||||
str_split and word_wrap may corrupt)</li>
|
||||
<li>...entity-ize things (UTF-8 doesn't need entities)</li>
|
||||
<li>...do very complex string processing (*printf)</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
|
||||
<p>...and always think in bytes, not characters. If you use strpos()
|
||||
to find the position of a character, it will be in bytes, but this
|
||||
usually won't matter since substr() also operates with byte indices!</p>
|
||||
|
||||
<p>You'll also need to make sure your UTF-8 is well-formed and will
|
||||
probably need replacements for some of these functions. I recommend
|
||||
using Harry Fuecks' <a href="http://phputf8.sourceforge.net/">PHP
|
||||
UTF-8</a> library, rather than use mb_string directly. HTML Purifier
|
||||
also defines a few useful UTF-8 compatible functions: check out
|
||||
<code>Encoder.php</code> in the <code>/library/HTMLPurifier/</code>
|
||||
directory.</p>
|
||||
|
||||
<h2 id="externallinks">Further Reading</h2>
|
||||
|
||||
<p>Well, that's it. Hopefully this document has served as a very
|
||||
practical springboard into knowledge of how UTF-8 works. You may have
|
||||
decided that you don't want to migrate yet: that's fine, just know
|
||||
what will happen to your output and what bug reports you may recieve.</p>
|
||||
|
||||
<p>Many other developers have already discussed the subject of Unicode,
|
||||
UTF-8 and internationalization, and I would like to defer to them for
|
||||
a more in-depth look into character sets and encodings.</p>
|
||||
|
Reference in New Issue
Block a user