diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html index 32526f55..8a24f7ac 100644 --- a/docs/enduser-utf8.html +++ b/docs/enduser-utf8.html @@ -5,12 +5,17 @@ + UTF-8 - HTML Purifier + +

UTF-8

@@ -24,19 +29,24 @@ to be caught by surprise by some of HTML Purifier's behavior, namely the fact that it operates UTF-8 or the limitations of the character encoding transformations it does. This document will walk you through determining the encoding of your system and how you should handle -this information.

+this information. It will stay away from excessive discussion on +the internals of character encoding, but offer the information in +asides that can easily be skipped.

-
Text in this formatting is an aside, +
+
Asides
+

Text in this formatting is an aside, interesting tidbits for the curious but not strictly necessary material to do the tutorial. If you read this text, you'll come out - with a greater understanding of the underlying issues.

+ with a greater understanding of the underlying issues.

+
-

Finding the real encoding

+

Finding the real encoding

In the beginning, there was ASCII, and things were simple. But they weren't good, for no one could write in Cryllic or Thai. So there exploded a proliferation of character encodings to remedy the problem -by extending the characters ASCII could express. This is ridiculously +by extending the characters ASCII could express. This ridiculously simplified version of the history of character encodings shows us that there are now many character encodings floating around.

@@ -85,8 +95,8 @@ Some common ones:

IE's Description Mime Name - Windows + Windows Arabic (Windows)Windows-1256 Baltic (Windows)Windows-1257 Central European (Windows)Windows-1250 @@ -98,22 +108,22 @@ Some common ones:

Vietnamese (Windows)Windows-1258 Western European (Windows)Windows-1252 - ISO - Arabic (ISO)ISO-8859-6 - Baltic (ISO)ISO-8859-4 - Central European (ISO)ISO-8859-2 - Cyrillic (ISO)ISO-8859-5 - Estonian (ISO)ISO-8859-13 - Greek (ISO)ISO-8859-7 - Hebrew (ISO-Logical)ISO-8859-8-l - Hebrew (ISO-Visual)ISO-8859-8 - Latin 9 (ISO)ISO-8859-15 - Turkish (ISO)ISO-8859-9 - Western European (ISO)ISO-8859-1 - - Other + ISO + Arabic (ISO)ISO-8859-6 + Baltic (ISO)ISO-8859-4 + Central European (ISO)ISO-8859-2 + Cyrillic (ISO)ISO-8859-5 + Estonian (ISO)ISO-8859-13 + Greek (ISO)ISO-8859-7 + Hebrew (ISO-Logical)ISO-8859-8-l + Hebrew (ISO-Visual)ISO-8859-8 + Latin 9 (ISO)ISO-8859-15 + Turkish (ISO)ISO-8859-9 + Western European (ISO)ISO-8859-1 + + Other Chinese Simplified (GB18030)GB18030 Chinese Simplified (GB2312)GB2312 Chinese Simplified (HZ)HZ @@ -130,7 +140,7 @@ character encodings, and having to lookup the real names with a table is a pain, so I recommend using Mozilla Firefox to find out your character encoding.

-

Finding the embedded encoding

+

Finding the embedded encoding

At this point, you may be asking, "Didn't we already find out our encoding?" Well, as it turns out, there are multiple places where @@ -152,12 +162,12 @@ if your META tag claims that either:

  • There is no META tag at all! (horror, horror!)
  • -

    Fixing the embedded encoding

    +

    Fixing the encoding

    If your META encoding and your real encoding match, savvy! You can skip this section. If they don't...

    -

    I have no embedded encoding!

    +

    No embedded encoding

    If this is the case, you'll want to add in the appropriate META tag to your website. It's as simple as copy-pasting @@ -175,12 +185,242 @@ of your real encoding.

    exploit.

    You might be able to get away with not specifying a character encoding with the META tag as long as your webserver - sends the right Content-Type header, but why risk it?

    + sends the right Content-Type header, but why risk it? Besides, if + the user downloads the HTML file, there is no longer any webserver + to define the character encoding.

    -

    Huh? The embedded encoding disagrees!

    +

    Embedded encoding disagrees

    -

    Further Reading

    +

    This is an extremely common mistake: another source is telling +the browser what the +character encoding is and is overriding the embedded encoding. This +source usually is the Content-Type HTTP header that the webserver (i.e. +Apache) sends. A usual Content-Type header sent with a page might +look like this:

    + +
    Content-Type: text/html; charset=ISO-8859-1
    + +

    Notice how there is a charset parameter: this is the webserver's +way of telling a browser what the character encoding is, much like +the META tags we touched upon previously.

    + +

    In fact, the META tag is +designed as a substitute for the HTTP header for contexts where +sending headers is impossible (such as locally stored files without +a webserver). Thus the name http-equiv (HTTP equivalent). +

    + +

    There are two ways to go about fixing this: changing the META +tag to match the HTTP header, or changing the HTTP header to match +the META tag. How do we know which to do? It depends +on the website's content: after all, headers and tags are only ways of +describing the actual characters on the web page.

    + +

    If your website:

    + +
    +
    ...only uses ASCII characters,
    +
    Either way is fine, but I recommend switching both to + UTF-8 (more on this later).
    +
    ...uses special characters, and they display + properly,
    +
    Change the embedded encoding to the server encoding.
    +
    ...uses special characters, but users often complain that + they come out garbled,
    +
    Change the server encoding to the embedded encoding.
    +
    + +

    Changing a META tag is easy: just swap out the old encoding +for the new. Changing the server (HTTP header) encoding, however, +is slightly more difficult.

    + +

    Changing the server encoding

    + +

    PHP header() function

    + +

    The simplest way to handle this problem is to send the encoding +yourself, via your programming language. Since you're using HTML +Purifier, I'll assume PHP, although it's not too difficult to do +similar things in +other +languages. The appropriate code is:

    + +
    header('Content-Type:text/html; charset=UTF-8');
    + +

    ...replacing UTF-8 with whatever your embedded encoding is. +This code must come before any output, so be careful about +stray whitespace in your application.

    + +

    Non-PHP

    + +

    You may, for whatever reason, may need to set the character encoding +on non-PHP files, usually plain ol' HTML files. Doing this +is more of a hit-or-miss process: depending on the software being +used as a webserver and the configuration of that software, certain +techniques may work, or may not work.

    + +

    .htaccess

    + +

    On Apache, you can use an .htaccess file to change the character +encoding. I'll defer to +W3C +for the in-depth explanation, but it boils down to creating a file +named .htaccess with the contents:

    + +
    AddCharset UTF-8 .html
    + +

    Where UTF-8 is replaced with the character encoding you want to +use and .html is a file extension that this will be applied to. This +character encoding will then be set for any file directly in +or in the subdirectories of directory you place this file in.

    + +

    If you're feeling particularly courageous, you can use:

    + +
    AddDefaultCharset UTF-8
    + +

    ...which changes the character set Apache adds to any document that +doesn't have any Content-Type parameters. This directive, which the +default configuration file sets to iso-8859-1 for security +reasons, is probably why your headers mismatch +with the META tag. If you would prefer Apache not to be +butting in on your character encodings, you can tell it not +to send anything at all:

    + +
    AddDefaultCharset Off
    + +

    ...making your META tags the sole source of +character encoding information. In these cases, it is +especially important to make sure you have valid META +tags on your pages and all the text before them is ASCII.

    + +

    These directives can also be +placed in httpd.conf file for Apache, but +in most shared hosting situations you won't be able to edit this file. +

    + +

    File extensions

    + +

    If you're not allowed to use .htaccess files, you can often +piggy-back off of Apache's default AddCharset declarations to get +your files in the proper extension. Here are Apache's default +character set declarations:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    CharsetFile extension(s)
    ISO-8859-1.iso8859-1 .latin1
    ISO-8859-2.iso8859-2 .latin2 .cen
    ISO-8859-3.iso8859-3 .latin3
    ISO-8859-4.iso8859-4 .latin4
    ISO-8859-5.iso8859-5 .latin5 .cyr .iso-ru
    ISO-8859-6.iso8859-6 .latin6 .arb
    ISO-8859-7.iso8859-7 .latin7 .grk
    ISO-8859-8.iso8859-8 .latin8 .heb
    ISO-8859-9.iso8859-9 .latin9 .trk
    ISO-2022-JP.iso2022-jp .jis
    ISO-2022-KR.iso2022-kr .kis
    ISO-2022-CN.iso2022-cn .cis
    Big5.Big5 .big5 .b5
    WINDOWS-1251.cp-1251 .win-1251
    CP866.cp866
    KOI8-r.koi8-r .koi8-ru
    KOI8-ru.koi8-uk .ua
    ISO-10646-UCS-2.ucs2
    ISO-10646-UCS-4.ucs4
    UTF-8.utf8
    GB2312.gb2312 .gb
    utf-7.utf7
    EUC-TW.euc-tw
    EUC-JP.euc-jp
    EUC-KR.euc-kr
    shift_jis.sjis
    + +

    So, for example, a file named page.utf8.html or +page.html.utf8 will probably be sent with the UTF-8 charset +attached, the difference being that if there is an +AddCharset charset .html declaration, it will override +the .utf8 extension in page.utf8.html (precedence moves +from right to left). By default, Apache has no such declaration.

    + +

    Microsoft IIS

    + +

    If anyone can contribute information on how to configure Microsoft +IIS to change character encodings, I'd be grateful.

    + +

    XML

    + +

    META tags are the most common source of embedded +encodings, but they can also come from somewhere else: XML +processing instructions. They look like:

    + +
    <?xml version="1.0" encoding="UTF-8"?>
    + +

    ...and are most often found in XML documents (including XHTML).

    + +

    For XHTML, this processing instruction theoretically +overrides the META tag. In reality, this happens only when the +XHTML is actually served as legit XML and not HTML, which is almost +always never due to Internet Explorer's lack of support for +application/xhtml+xml (even though doing so is often +argued to be good practice).

    + +

    For XML, however, this processing instruction is extremely important. +Since most webservers are not configured to send charsets for .xml files, +this is the only thing a parser has to go on. Furthermore, the default +for XML files is UTF-8, which often butts heads with more common +ISO-8859-1 encoding (you see this in garbled RSS feeds).

    + +

    In short, if you use XHTML and have gone through the +trouble of adding the XML header, be sure to make sure it jives +with your META tags and HTTP headers.

    + +

    Inside the process

    + +

    This section is not required reading, +but may answer some of your questions on what's going on in all +this character encoding hocus pocus. If you're interested in +moving on to the next phase, skip this section.

    + +

    A logical question that follows all of our wheeling and dealing +with multiple sources of character encodings is "Why are there +so many options?" To answer this question, we have to turn +back our definition of character encodings: they allow a program +to interpret bytes into human-readable characters.

    + +

    Thus, a chicken-egg problem: a character encoding +is necessary to interpret the +text of a document. A META tag is in the text of a document. +The META tag gives the character encoding. How can we +determine the contents of a META tag, inside the text, +if we don't know it's character encoding? And how do we figure out +the character encoding, if we don't know the contents of the +META tag?

    + +

    Fortunantely for us, the characters we need to write the +META are in ASCII, which is pretty much universal +over every character encoding that is in common use today. So, +all the web-browser has to do is parse all the way down until +it gets to the Content-Type tag, extract the character encoding +tag, then re-parse the document according to this new information.

    + +

    Obviously this is complicated, so browsers prefer the simpler +and more efficient solution: get the character encoding from a +somewhere other than the document itself, i.e. the HTTP headers, +much to the chagrin of HTML authors who can't set these headers.

    + +

    Why UTF-8?

    + +

    So, you've gone through all the trouble of ensuring that...

    + +

    Needs completion!

    + +

    Many other developers have already discussed the subject of Unicode, UTF-8 and internationalization, and I would like to defer to them for diff --git a/docs/style.css b/docs/style.css index bc7e85a4..f60b333c 100644 --- a/docs/style.css +++ b/docs/style.css @@ -23,6 +23,8 @@ h4 {font-family:sans-serif; font-size:0.9em; font-weight:bold; } /* Marks off asides, discussions on why something is the way it is */ .aside {margin-left:2em; font-family:sans-serif; font-size:0.9em; } +blockquote .label {font-weight:bold; font-size:1em; margin:0 0 .1em; + border-bottom:1px solid #CCC;} /* A regular table */ .table {border-collapse:collapse; border-bottom:2px solid #888; margin-left:2em; } @@ -37,4 +39,4 @@ h4 {font-family:sans-serif; font-size:0.9em; font-weight:bold; } #index {font-size:smaller; } /* Contains, without exception, $Id$, for SVN version info. */ -#version {text-align:right; font-style:italic; margin:2em 0;} \ No newline at end of file +#version {text-align:right; font-style:italic; margin:2em 0;}