From 2d22c0aa55787ccc5789ceae25be3231861a9436 Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang"
This document is not designed to be read from top to bottom: it will +slowly introduce concepts that build on each other: you need not get to +the bottom to have learned something new. However, I strongly +recommend you read all the way to Why UTF-8?, because at least +at that point you'd have made a conscious decision not to migrate, +which can be a difficult but rewarding task.
+Asides@@ -43,6 +49,50 @@ asides that can easily be skipped. with a greater understanding of the underlying issues.
In the beginning, there was ASCII, and things were simple. But they @@ -401,7 +451,7 @@ ISO-8859-1 encoding (you see this in garbled RSS feeds).
trouble of adding the XML header, make sure it jives with yourMETA
tags and HTTP headers.
-This section is not required reading, but may answer some of your questions on what's going on in all @@ -781,7 +831,7 @@ To find out how your editor is doing, you can check out this list or Wikipedia's list. I personally use Notepad++, which works like a charm when it comes to UTF-8. -You will usually have to explicitly tell the editor through some dialogue +Usually, you will have to explicitly tell the editor through some dialogue (usually Save as or Format) what encoding you want it to use. An editor will often offer "Unicode" as a method of saving, which is ambiguous. Make sure you know whether or not they really mean UTF-8 @@ -825,15 +875,153 @@ sure the page is saved WITHOUT the BOM.
If you are reading in text files to insert into the middle of another
-page, it is strongly advised that you replace out the UTF-8 byte
-sequence for BOM "\xEF\xBB\xBF"
before inserting it in.
"\xEF\xBB\xBF"
before inserting it in,
+via:
+
+$text = str_replace("\xEF\xBB\xBF", '', $text);
Generally speaking, people who are having trouble with fonts fall +into two categories:
+ +Yes, there's always a chance where an English user happens across +a Sinhalese website and doesn't have the right font. But an English user +who happens not to have the right fonts probably has no business reading Sinhalese +anyway. So we'll deal with the other two edge cases.
+ +If you run a Bengali website, you may get comments from users who +would like to read your website but get heaps of question marks or +other meaningless characters. Fixing this problem requires the +installation of a font or language pack which is often highly +dependent on what the language is. Here is an example +of such a help file for the Bengali language, I am sure there are +others out there too. You just have to point users to the appropriate +help file.
+ +A prime example of when you'll see some very obscure Unicode +characters embedded in what otherwise would be very bland ASCII are +letters of the +International +Phonetic Alphabet (IPA), use to designate pronounciations in a very standard +manner (you probably see them all the time in your dictionary). Your +average font probably won't have support for all of the IPA characters +like ʘ (bilabial click) or ʒ (voiced postalveolar fricative). +So what's a poor browser to do? Font mix! Smart browsers like Mozilla Firefox +and Internet Explorer 7 will borrow glyphs from other fonts in order +to make sure that all the characters display properly.
+ +But what happens when the browser isn't smart and happens to be the +most widely used browser in the entire world? Microsoft IE 6 +is not smart enough to borrow from other fonts when a character isn't +present, so more often than not you'll be slapped with a nice big �. +To get things to work, MSIE 6 needs a little nudge. You could configure it +to use a different font to render the text, but you can acheive the same +effect by selectively changing the font for blocks of special characters +to known good Unicode fonts.
+ +Fortunantely, the folks over at Wikipedia have already done all the +heavy lifting for you. Get the CSS from the horses mouth here: +Common.css, +and search for ".IPA" There are also a smattering of +other classes you can use for other purposes, check out +this page +for more details. For you lazy ones, this should work:
+ +.Unicode { + font-family: Code2000, "TITUS Cyberbit Basic", "Doulos SIL", + "Chrysanthi Unicode", "Bitstream Cyberbit", + "Bitstream CyberBase", Thryomanes, Gentium, GentiumAlt, + "Lucida Grande", "Arial Unicode MS", "Microsoft Sans Serif", + "Lucida Sans Unicode"; + font-family /**/:inherit; /* resets fonts for everyone but IE6 */ +}+ +
The standard usage goes along the lines of <span class="Unicode">Crazy
+Unicode stuff here</span>
. Characters in the
+Windows Glyph List
+usually don't need to be fixed, but for anything else you probably
+want to play it safe. Unless, of course, you don't care about IE6
+users.
When people claim that PHP6 will solve all our Unicode problems, they're +misinformed. It will not fix any of the abovementioned troubles. It will, +however, fix the problem we are about to discuss: processing UTF-8 text +in PHP.
+ +PHP (as of PHP5) is blithely unaware of the existence of UTF-8 (with a few +notable exceptions). Sometimes, this will cause problems, other times, +this won't. So far, we've avoided discussing the architecture of +UTF-8, so, we must first ask, what is UTF-8? Yes, it supports Unicode, +and yes, it is variable width. Other traits:
+ +Each of these traits affect different domains of text processing +in different ways. It is beyond the scope of this document to explain +what precisely these implications are. PHPWact provides +a very good reference document +on what to expect from each functions, although coverage is spotty in +some areas. Their more general notes on +character sets +are also worth looking at for information on UTF-8. Some rules of thumb +when dealing with Unicode text:
+ +...and always think in bytes, not characters. If you use strpos() +to find the position of a character, it will be in bytes, but this +usually won't matter since substr() also operates with byte indices!
+ +You'll also need to make sure your UTF-8 is well-formed and will
+probably need replacements for some of these functions. I recommend
+using Harry Fuecks' PHP
+UTF-8 library, rather than use mb_string directly. HTML Purifier
+also defines a few useful UTF-8 compatible functions: check out
+Encoder.php
in the /library/HTMLPurifier/
+directory.
Well, that's it. Hopefully this document has served as a very +practical springboard into knowledge of how UTF-8 works. You may have +decided that you don't want to migrate yet: that's fine, just know +what will happen to your output and what bug reports you may recieve.
+Many other developers have already discussed the subject of Unicode, UTF-8 and internationalization, and I would like to defer to them for a more in-depth look into character sets and encodings.
diff --git a/docs/style.css b/docs/style.css index 03bf1702..811b0103 100644 --- a/docs/style.css +++ b/docs/style.css @@ -42,3 +42,7 @@ blockquote .label {font-weight:bold; font-size:1em; margin:0 0 .1em; /* Contains, without exception, $Id$, for SVN version info. */ #version {text-align:right; font-style:italic; margin:2em 0;} + +#toc ol ol {list-style-type:lower-roman;} +#toc ol {list-style-type:decimal;} +#toc {list-style-type:upper-alpha;}