Add note about patchwork utf8

This commit is contained in:
Phil Sturgeon
2014-07-15 16:55:55 +01:00
parent 7d44093b75
commit ac548011c0

View File

@@ -18,9 +18,8 @@ for a brief, practical summary.
The basic string operations, like concatenating two strings and assigning strings to variables, don't need anything
special for UTF-8. However most string functions, like `strpos()` and `strlen()`, do need special consideration. These
functions often have an `mb_*` counterpart: for example, `mb_strpos()` and `mb_strlen()`. Together, these counterpart
functions are called the Multibyte String Functions. The multibyte string functions are specifically designed to
operate on Unicode strings.
functions often have an `mb_*` counterpart: for example, `mb_strpos()` and `mb_strlen()`. These `mb_*` strings are made
available to you via the [Multibyte String Extension], and are specifically designed to operate on Unicode strings.
You must use the `mb_*` functions whenever you operate on a Unicode string. For example, if you use `substr()` on a
UTF-8 string, there's a good chance the result will include some garbled half-characters. The correct function to use
@@ -32,17 +31,21 @@ string has a chance of being garbled during further processing.
Not all string functions have an `mb_*` counterpart. If there isn't one for what you want to do, then you might be out
of luck.
Additionally, you should use the `mb_internal_encoding()` function at the top of every PHP script you write (or at the
You should use the `mb_internal_encoding()` function at the top of every PHP script you write (or at the
top of your global include script), and the `mb_http_output()` function right after it if your script is outputting to
a browser. Explicitly defining the encoding of your strings in every script will save you a lot of headaches down the
road.
Finally, many PHP functions that operate on strings have an optional parameter letting you specify the character
Additionally, many PHP functions that operate on strings have an optional parameter letting you specify the character
encoding. You should always explicitly indicate UTF-8 when given the option. For example, `htmlentities()` has an
option for character encoding, and you should always specify UTF-8 if dealing with such strings.
option for character encoding, and you should always specify UTF-8 if dealing with such strings. Note that as of PHP 5.4.0, UTF-8 is the default encoding for `htmlentities()` and `htmlspecialchars()`.
Note that as of PHP 5.4.0, UTF-8 is the default encoding for `htmlentities()` and `htmlspecialchars()`.
Finally, If you are building an distributed application and cannot be certain that the `mbstring` extension will be
enabled, then consider using the [patchwork/utf8] Composer package. This
will use `mbstring` if it is available, and fall back to non UTF-8 functions if not.
[Multibyte String Extension]: http://php.net/manual/en/book.mbstring.php
[patchwork/utf8]: https://packagist.org/packages/patchwork/utf8
### UTF-8 at the Database level