From ac548011c06a69a1a1e343e24d56154aada0d845 Mon Sep 17 00:00:00 2001 From: Phil Sturgeon Date: Tue, 15 Jul 2014 16:55:55 +0100 Subject: [PATCH] Add note about patchwork utf8 --- _posts/05-05-01-PHP-and-UTF8.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/_posts/05-05-01-PHP-and-UTF8.md b/_posts/05-05-01-PHP-and-UTF8.md index 96952e4..c0a979c 100644 --- a/_posts/05-05-01-PHP-and-UTF8.md +++ b/_posts/05-05-01-PHP-and-UTF8.md @@ -18,9 +18,8 @@ for a brief, practical summary. The basic string operations, like concatenating two strings and assigning strings to variables, don't need anything special for UTF-8. However most string functions, like `strpos()` and `strlen()`, do need special consideration. These -functions often have an `mb_*` counterpart: for example, `mb_strpos()` and `mb_strlen()`. Together, these counterpart -functions are called the Multibyte String Functions. The multibyte string functions are specifically designed to -operate on Unicode strings. +functions often have an `mb_*` counterpart: for example, `mb_strpos()` and `mb_strlen()`. These `mb_*` strings are made +available to you via the [Multibyte String Extension], and are specifically designed to operate on Unicode strings. You must use the `mb_*` functions whenever you operate on a Unicode string. For example, if you use `substr()` on a UTF-8 string, there's a good chance the result will include some garbled half-characters. The correct function to use @@ -32,17 +31,21 @@ string has a chance of being garbled during further processing. Not all string functions have an `mb_*` counterpart. If there isn't one for what you want to do, then you might be out of luck. -Additionally, you should use the `mb_internal_encoding()` function at the top of every PHP script you write (or at the +You should use the `mb_internal_encoding()` function at the top of every PHP script you write (or at the top of your global include script), and the `mb_http_output()` function right after it if your script is outputting to a browser. Explicitly defining the encoding of your strings in every script will save you a lot of headaches down the road. -Finally, many PHP functions that operate on strings have an optional parameter letting you specify the character +Additionally, many PHP functions that operate on strings have an optional parameter letting you specify the character encoding. You should always explicitly indicate UTF-8 when given the option. For example, `htmlentities()` has an -option for character encoding, and you should always specify UTF-8 if dealing with such strings. +option for character encoding, and you should always specify UTF-8 if dealing with such strings. Note that as of PHP 5.4.0, UTF-8 is the default encoding for `htmlentities()` and `htmlspecialchars()`. -Note that as of PHP 5.4.0, UTF-8 is the default encoding for `htmlentities()` and `htmlspecialchars()`. +Finally, If you are building an distributed application and cannot be certain that the `mbstring` extension will be +enabled, then consider using the [patchwork/utf8] Composer package. This +will use `mbstring` if it is available, and fall back to non UTF-8 functions if not. +[Multibyte String Extension]: http://php.net/manual/en/book.mbstring.php +[patchwork/utf8]: https://packagist.org/packages/patchwork/utf8 ### UTF-8 at the Database level