What are Unicode, UCS and UTF-8?
The Universal Character Set (UCS) described in ISO/IEC 10646 consists of a large amount of characters. Each of them has a unique name and a code point which is an integer number. Unicode - which is an industry standard - complements the Universal Character Set with further information about the characters' properties and alternative character encodings. More information on Unicode can be found on the Unicode Consortium's website. One of the Unicode encodings is the 8-bit Unicode Transformation Format (UTF-8). It encodes characters with up to four bytes aiming for maximum compatability with the American Standard Code for Information Interchange which is a 7-bit encoding of a relatively small subset of the UCS.
@@ -1495,13 +1528,13 @@ if (utf8_case_fold_nfc($string1) == utf8_case_fold_nfc($string2))6.i. Standardisation
- +Reason:
- +phpBB is one of the most translated open-source projects, with the current stable version being available in over 60 localisations. Whilst the ad hoc approach to the naming of language packs has worked, for phpBB3 and beyond we hope to make this process saner which will allow for better interoperation with current and future web browsers.
- +Encoding:
- +With phpBB3, the output encoding for the forum in now UTF-8, a Universal Character Encoding by the Unicode Consortium that is by design a superset to US-ASCII and ISO-8859-1. By using one character set which simultaenously supports all scripts which previously would have required different encodings (eg: ISO-8859-1 to ISO-8859-15 (Latin, Greek, Cyrillic, Thai, Hebrew, Arabic); GB2312 (Simplified Chinese); Big5 (Traditional Chinese), EUC-JP (Japanese), EUC-KR (Korean), VISCII (Vietnamese); et cetera), this removes the need to convert between encodings and improves the accessibility of multilingual forums.
The impact is that the language files for phpBB must now also be encoded as UTF-8, with a caveat that the files must not contain a BOM for compatibility reasons with non-Unicode aware versions of PHP. For those with forums using the Latin character set (ie: most European languages), this change is transparent since UTF-8 is superset to US-ASCII and ISO-8859-1.
@@ -1835,7 +1868,7 @@ if (utf8_case_fold_nfc($string1) == utf8_case_fold_nfc($string2))Language's local name
Authors information
iso.txt
is automatically generated by the language pack submission system on phpBB.com. You don't have to create this file yourself if you plan on releasing your language pack on phpBB.com, but do keep in mind that phpBB itself does require this file to be present.
Because language tags themselves are meant to be machine read, they can be rather obtuse to humans and why descriptive strings as provided by iso.txt
are needed. Whilst en-US
could be fairly easily deduced to be "English as used in the United States", de-CH
is more difficult less one happens to know that de
is from "Deutsch", German for "German" and CH
is the abbreviation of the official Latin name for Switzerland, "Confoederatio Helvetica".
For the localised language description, just translate the English version though use whatever appropriate punctuation typical for your own locale, assuming the language uses punctuation at all.
Unicode bi-directional considerations:
- +Because phpBB is now UTF-8, all translators must take into account that certain strings may be shown when the directionality of the document is either opposite to normal or is ambiguous.
- +The various Unicode control characters for bi-directional text and their HTML enquivalents where appropriate are as follows:
- +For iso.txt
, the directionality of the text can be explicitly set using special Unicode characters via any of the three methods provided by left-to-right/right-to-left markers/embeds/overrides, as without them, the ordering of characters will be incorrect, eg:
In choosing which of the three methods to use, in the majority of cases, the LRM
or RLM
to put a "strong" character to fully enclose an ambiguous punctuation character and thus make it inherit the correct directionality is sufficient.
Within some cases, there may be mixed scripts of a left-to-right and right-to-left direction, so using LRE
& RLE
with PDF
may be more appropriate. Lastly, in very rare instances where directionality must be forced, then use LRO
& RLO
with PDF
.
For further information on authoring techniques of bi-directional text, please see the W3C tutorial on authoring techniques for XHTML pages with bi-directional text.
- +Working with placeholders:
- +As phpBB is translated into languages with different ordering rules to that of English, it is possible to show specific values in any order deemed appropriate. Take for example the extremely simple "Page X of Y", whilst in English this could just be coded as:
- +... 'PAGE_OF' => 'Page %s of %s', @@ -2019,9 +2052,9 @@ if (utf8_case_fold_nfc($string1) == utf8_case_fold_nfc($string2)) come and hope they are in the right order */ ...
… a clearer way to show explicit replacement ordering is to do:
- +... 'PAGE_OF' => 'Page %1$s of %2$s', @@ -2029,9 +2062,9 @@ if (utf8_case_fold_nfc($string1) == utf8_case_fold_nfc($string2)) even if they are the same order as English */ ...
Why bother at all? Because some languages, the string transliterated back to English might read something like:
- +... 'PAGE_OF' => 'Total of %2$s pages, currently on page %1$s', @@ -2039,104 +2072,104 @@ if (utf8_case_fold_nfc($string1) == utf8_case_fold_nfc($string2)) reversed compared to English as the total comes first */ ...
6.iii. Writing Style
- +Miscellaneous tips & hints:
- +As the language files are PHP files, where the various strings for phpBB are stored within an array which in turn are used for display within an HTML page, rules of syntax for both must be considered. Potentially problematic characters are: '
(straight quote/apostrophe), "
(straight double quote), <
(less-than sign), >
(greater-than sign) and &
(ampersand).
// Bad - The un-escapsed straight-quote/apostrophe will throw a PHP parse error
- +... 'CONV_ERROR_NO_AVATAR_PATH' => 'Note to developer: you must specify $convertor['avatar_path'] to use %s.', ...
// Good - Literal straight quotes should be escaped with a backslash, ie: \
- +... 'CONV_ERROR_NO_AVATAR_PATH' => 'Note to developer: you must specify $convertor[\'avatar_path\'] to use %s.', ...
However, because phpBB3 now uses UTF-8 as its sole encoding, we can actually use this to our advantage and not have to remember to escape a straight quote when we don't have to:
- +// Bad - The un-escapsed straight-quote/apostrophe will throw a PHP parse error
- +... 'USE_PERMISSIONS' => 'Test out user's permissions', ...
// Okay - However, non-programmers wouldn't type "user\'s" automatically
- +... 'USE_PERMISSIONS' => 'Test out user\'s permissions', ...
// Best - Use the Unicode Right-Single-Quotation-Mark character
- +... 'USE_PERMISSIONS' => 'Test out user’s permissions', ...
The "
(straight double quote), <
(less-than sign) and >
(greater-than sign) characters can all be used as displayed glyphs or as part of HTML markup, for example:
// Bad - Invalid HTML, as segments not part of elements are not entitised
- +... 'FOO_BAR' => 'PHP version < 4.3.3.<br /> Visit "Downloads" at <a href="http://www.php.net/">www.php.net</a>.', ...
// Okay - No more invalid HTML, but """ is rather clumsy
- +... 'FOO_BAR' => 'PHP version < 4.3.3.<br /> Visit "Downloads" at <a href="http://www.php.net/">www.php.net</a>.', ...
// Best - No more invalid HTML, and usage of correct typographical quotation marks
- +... 'FOO_BAR' => 'PHP version < 4.3.3.<br /> Visit “Downloads” at <a href="http://www.php.net/">www.php.net</a>.', ...
Lastly, the &
(ampersand) must always be entitised regardless of where it is used:
// Bad - Invalid HTML, none of the ampersands are entitised
- +... 'FOO_BAR' => '<a href="http://somedomain.tld/?foo=1&bar=2">Foo & Bar</a>.', ...
// Good - Valid HTML, amperands are correctly entitised in all cases
- +... 'FOO_BAR' => '<a href="http://somedomain.tld/?foo=1&bar=2">Foo & Bar</a>.', ...
As for how these charcters are entered depends very much on choice of Operating System, current language locale/keyboard configuration and native abilities of the text editor used to edit phpBB language files. Please see http://en.wikipedia.org/wiki/Unicode#Input_methods for more information.