diff --git a/_posts/05-06-01-Internationalization-and-Localization.md b/_posts/05-06-01-Internationalization-and-Localization.md new file mode 100644 index 0000000..c52b3a9 --- /dev/null +++ b/_posts/05-06-01-Internationalization-and-Localization.md @@ -0,0 +1,416 @@ +--- +title: Internationalization and Localization +isChild: true +anchor: i18n_l10n +--- + +## Internationalization (i18n) and Localization (l10n) {#i18n_l10n_title} + +_Disclaimer for newcomers: i18n and l10n are numeronyms, a kind of abbreviation where numbers are used to shorten +words - in our case, internationalization becomes i18n and localization, l10n._ + +First of all, we need to define those two similar concepts and other related things: + +- **Internationalization** is when you organize your code so it can be adapted to different languages or regions +without refactorings. This is usually done once - preferably, in the beginning of the project, or else you'll probably +need some huge changes in the source! +- **Localization** happens when you adapt the interface (mainly) by translating contents, based on the i18n work done +before. It usually is done every time a new language or region needs support and is updated when new interface pieces +are added, as they need to be available in all supported languages. +- **Pluralization** defines the rules needed between different languages to interoperate strings containing numbers and +counters. For instance, in English when you have only one item, it's singular, and anything different from that is +called plural; plural in this language is indicated by adding an S after some words, and sometimes changes parts of it. +In other languages, such as Russian or Serbian, there are two plural forms in addition to the singular - you may even +find languages with a total of four, five or six forms, such as Slovenian, Irish or Arabic. + +## Common ways to implement +The easiest way to internationalize PHP software is by using array files and using those strings in templates, such as +`

`. This is, however, hardly a recommended way for serious projects, as it poses +some maintenance issues along the road - some might appear in the very beginning, such as pluralization. So, please, +don't try this if your project will contain more than a couple of pages. + +The most classic way and often taken as reference for i18n and l10n is a [Unix tool called `gettext`][gettext]. It dates +back to 1995 and is still a complete implementation for translating software. It is pretty easy to get running, while +it still sports powerful supporting tools. It's about Gettext we will be talking here. Also, to help you not get messy +over the command-line, we will be presenting a great GUI application that can be used to easily update your l10n source +files. + +### Other tools + +There are common libraries used that support Gettext and other implementations of i18n. Some of them may seem easier to +install or sport additional features or i18n file formats. In this document, we focus on the tools provided with the +PHP core, but here we list others for completion: + +- [oscarotero/Gettext][oscarotero]: Gettext support with an OO interface; includes improved helper functions, powerful +extractors for several file formats (some of them not supported natively by the `gettext` command), and can also export +to other formats besides `.mo/.po` files. Can be useful if you need to integrate your translation files into other parts +of the system, like a JavaScript interface. +- [symfony/translation][symfony]: supports a lot of different formats, but recommends using verbose XLIFF's. Doesn't +include helper functions nor a built-in extractor, but supports placeholders using `strtr()` internally. +- [zend/i18n][zend]: supports array and INI files, or Gettext formats. Implements a caching layer to save you from +reading the filesystem every time. It also includes view helpers, and locale-aware input filters and validators. +However, it has no message extractor. + +Other frameworks also include i18n modules, but those are not available outside of their codebases: +- [Laravel] supports basic array files, has no automatic extractor but includes a `@lang` helper for template files. +- [Yii] supports array, Gettext, and database-based translation, and includes a messages extractor. It is backed by the +[`Intl`][intl] extension, available since PHP 5.3, and based on the [ICU project]; this enables Yii to run powerful +replacements, like spelling out numbers, formatting dates, times, intervals, currency, and ordinals. + +If you decide to go for one of the libraries that provide no extractors, you may want to use the gettext formats, so +you can use the original gettext toolchain (including Poedit) as described in the rest of the chapter. + +## Gettext + +### Installation +You might need to install Gettext and the related PHP library by using your package manager, like `apt-get` or `yum`. +After installed, enable it by adding `extension=gettext.so` (Linux/Unix) or `extension=php_gettext.dll` (Windows) to +your `php.ini`. + +Here we will also be using [Poedit] to create translation files. You will probably find it in your system's package +manager; it's available for Unix, Mac, and Windows, and can be [downloaded for free on their website][poedit_download] +as well. + +### Structure + +#### Types of files +There are three files you usually deal with while working with gettext. The main ones are PO (Portable Object) and +MO (Machine Object) files, the first being a list of readable "translated objects" and the second, the corresponding +binary to be interpreted by gettext when doing localization. There's also a POT (Template) file, that simply contains +all existing keys from your source files, and can be used as a guide to generate and update all PO files. Those template +files are not mandatory: depending on the tool you're using to do l10n, you can go just fine with only PO/MO files. +You'll always have one pair of PO/MO files per language and region, but only one POT per domain. + +### Domains +There are some cases, in big projects, where you might need to separate translations when the same words convey +different meaning given a context. In those cases, you split them into different _domains_. They're basically named +groups of POT/PO/MO files, where the filename is the said _translation domain_. Small and medium-sized projects usually, +for simplicity, use only one domain; its name is arbitrary, but we will be using "main" for our code samples. +In [Symfony] projects, for example, domains are used to separate the translation for validation messages. + +#### Locale code +A locale is simply a code that identifies one version of a language. It's defined following the [ISO 639-1][639-1] and +[ISO 3166-1 alpha-2][3166-1] specs: two lower-case letters for the language, optionally followed by an underline and two +upper-case letters identifying the country or regional code. For [rare languages][rare], three letters are used. + +For some speakers, the country part may seem redundant. In fact, some languages have dialects in different +countries, such as Austrian German (`de_AT`) or Brazilian Portuguese (`pt_BR`). The second part is used to distinguish +between those dialects - when it's not present, it's taken as a "generic" or "hybrid" version of the language. + +### Directory structure +To use Gettext, we will need to adhere to a specific structure of folders. First, you'll need to select an arbitrary +root for your l10n files in your source repository. Inside it, you'll have a folder for each needed locale, and a fixed +`LC_MESSAGES` folder that will contain all your PO/MO pairs. Example: + +{% highlight console %} + + ├─ src/ + ├─ templates/ + └─ locales/ + ├─ forum.pot + ├─ site.pot + ├─ de/ + │ └─ LC_MESSAGES/ + │ ├─ forum.mo + │ ├─ forum.po + │ ├─ site.mo + │ └─ site.po + ├─ es_ES/ + │ └─ LC_MESSAGES/ + │ └─ ... + ├─ fr/ + │ └─ ... + ├─ pt_BR/ + │ └─ ... + └─ pt_PT/ + └─ ... +{% endhighlight %} + +### Plural forms +As we said in the introduction, different languages might sport different plural rules. However, gettext saves us from +this trouble once again. When creating a new `.po` file, you'll have to declare the [plural rules][plural] for that +language, and translated pieces that are plural-sensitive will have a different form for each of those rules. When +calling Gettext in code, you'll have to specify the number related to the sentence, and it will work out the correct +form to use - even using string substitution if needed. + +Plural rules include the number of plurals available and a boolean test with `n` that would define in which rule the +given number falls (starting the count with 0). For example: + +- Japanese: `nplurals=1; plural=0` - only one rule +- English: `nplurals=2; plural=(n != 1);` - two rules, first if N is one, second rule otherwise +- Brazilian Portuguese: `nplurals=2; plural=(n > 1);` - two rules, second if N is bigger than one, first otherwise + +Now that you understood the basis of how plural rules works - and if you didn't, please look at a deeper explanation +on the [LingoHub tutorial](lingohub_plurals) -, you might want to copy the ones you need from a [list][plural] instead +of writing them by hand. + +When calling out Gettext to do localization on sentences with counters, you'll have to give him the +related number as well. Gettext will work out what rule should be in effect and use the correct localized version. +You will need to include in the `.po` file a different sentence for each plural rule defined. + +### Sample implementation +After all that theory, let's get a little practical. Here's an excerpt of a `.po` file - don't mind with its format, +but instead the overall content, you'll learn how to edit it easily later: + +{% highlight po %} +msgid "" +msgstr "" +"Language: pt_BR\n" +"Content-Type: text/plain; charset=UTF-8\n" +"Plural-Forms: nplurals=2; plural=(n > 1);\n" + +msgid "We're now translating some strings" +msgstr "Nós estamos traduzindo algumas strings agora" + +msgid "Hello %1$s! Your last visit was on %2$s" +msgstr "Olá %1$s! Sua última visita foi em %2$s" + +msgid "Only one unread message" +msgid_plural "%d unread messages" +msgstr[0] "Só uma mensagem não lida" +msgstr[1] "%d mensagens não lidas" +{% endhighlight %} + +The first section works like a header, having the `msgid` and `msgstr` especially empty. It describes the file encoding, +plural forms and other things that are less relevant. +The second section translates a simple string from English to +Brazilian Portuguese, and the third does the same, but leveraging string replacement from [`sprintf`](sprintf) so the +translation may contain the user name and visit date. +The last section is a sample of pluralization forms, displaying +the singular and plural version as `msgid` in English and their corresponding translations as `msgstr` 0 and 1 +(following the number given by the plural rule). There, string replacement is used as well so the number can be seen +directly in the sentence, by using `%d`. The plural forms always have two `msgid` (singular and plural), so it's +advised to not use a complex language as the source of translation. + +### Discussion on l10n keys +As you might have noticed, we're using as source ID the actual sentence in English. That `msgid` is the same used +throughout all your `.po` files, meaning other languages will have the same format and the same `msgid` fields but +translated `msgstr` lines. + +Talking about translation keys, there are two main "schools" here: + +1. _`msgid` as a real sentence_. + The main advantages are: + - if there are pieces of the software untranslated in any given language, the key displayed will still maintain some + meaning. Example: if you happen to translate by heart from English to Spanish but need help to translate to French, + you might publish the new page with missing French sentences, and parts of the website would be displayed in English + instead; + - it's much easier for the translator to understand what's going on and make a proper translation based on the + `msgid`; + - it gives you "free" l10n for one language - the source one; + - The only disadvantage: if you need to change the actual text, you would need to replace the same `msgid` + across several language files. + +2. _`msgid` as a unique, structured key_. +It would describe the sentence role in the application in a structured way, including the template or part where the +string is located instead of its content. + - it's a great way to have the code organized, separating the text content from the template logic. + - however, that could bring problems to the translator that would miss the context. A source language file would be + needed as a basis for other translations. Example: the developer would ideally have an `en.po` file, that + translators would read to understand what to write in `fr.po` for instance. + - missing translations would display meaningless keys on screen (`top_menu.welcome` instead of `Hello there, User!` + on the said untranslated French page). That's good it as would force translation to be complete before publishing - + but bad as translation issues would be really awful in the interface. Some libraries, though, include an option to + specify a given language as "fallback", having a similar behavior as the other approach. + +The [Gettext manual][manual] favors the first approach as, in general, it's easier for translators and users in +case of trouble. That's how we will be working here as well. However, the [Symfony documentation][symfony-keys] favors +keyword-based translation, to allow for independent changes of all translations without affecting templates as well. + +### Everyday usage +In a common application, you would use some Gettext functions while writing static text in your pages. Those sentences +would then appear in `.po` files, get translated, compiled into `.mo` files and then, used by Gettext when rendering +the actual interface. Given that, let's tie together what we have discussed so far in a step-by-step example: + +#### 1. A sample template file, including some different gettext calls +{% highlight php %} + + + +

+

+{% endhighlight %} + +- [`gettext()`][func] simply translates a `msgid` into its corresponding `msgstr` for a given language. There's also +the shorthand function `_()` that works the same way; +- [`ngettext()`][n_func] does the same but with plural rules; +- there's also [`dgettext()`][d_func] and [`dngettext()`][dn_func], that allows you to override the domain for a single +call. More on domain configuration in the next example. + +#### 2. A sample setup file (`i18n_setup.php` as used above), selecting the correct locale and configuring Gettext +{% highlight php %} + '_']); }); + foreach ($langs as $browser_lang) { + if (valid($browser_lang)) { + $lang = $browser_lang; + break; + } + } +} + +// here we define the global system locale given the found language +putenv("LANG=$lang"); + +// this might be useful for date functions (LC_TIME) or money formatting (LC_MONETARY), for instance +setlocale(LC_ALL, $lang); + +// this will make Gettext look for ../locales//LC_MESSAGES/main.mo +bindtextdomain('main', '../locales'); + +// indicates in what encoding the file should be read +bind_textdomain_codeset('main', 'UTF-8'); + +// if your application has additional domains, as cited before, you should bind them here as well +bindtextdomain('forum', '../locales'); +bind_textdomain_codeset('forum', 'UTF-8'); + +// here we indicate the default domain the gettext() calls will respond to +textdomain('main'); + +// this would look for the string in forum.mo instead of main.mo +// echo dgettext('forum', 'Welcome back!'); +?> +{% endhighlight %} + +#### 3. Preparing translation for the first run +To make matters easier - and one of the powerful advantages Gettext has over custom framework i18n packages - is its +custom file type. "Oh man, that's quite hard to understand and edit by hand, a simple array would be easier!" Make no +mistake, applications like [Poedit] are here to help - _a lot_. You can get the program from +[their website](poedit_download), it's free and available for all platforms. It's a pretty easy tool to get used to, +and a very powerful one at the same time - using all powerful features Gettext has available. + +In the first run, you should select "File > New Catalog" from the menu. There you'll have a small screen where we will +set the terrain so everything else runs smoothly. You'll be able to find those settings later through +"Catalog > Properties": + +- Project name and version, Translation Team and email address: useful information that goes in the `.po` file header; +- Language: here you should use that format we mentioned before, such as `en_US` or `pt_BR`; +- Charsets: UTF-8, preferably; +- Source charset: set here the charset used by your PHP files - probably UTF-8 as well, right? +- plural forms: here go those rules we mentioned before - there's a link in there with samples as well; +- Source paths: here you must include all folders from the project where `gettext()` (and siblings) will happen - this +is usually your templates folder(s) +- Source keywords: this last part is filled by default, but you might need to alter it later - and is one of the +powerful points of Gettext. The underlying software knows how the `gettext()` calls look like in several programming +languages, but you might as well create your own translation forms. This will be discussed later in the "Tips" section. + +After setting those points you'll be prompted to save the file - using that directory structure we mentioned as well, +and then it will run a scan through your source files to find the localization calls. They'll be fed empty into the +translation table, and you'll start typing in the localized versions of those strings. Save it and a `.mo` file will be +(re)compiled into the same folder and ta-dah: your project is internationalized. + +#### 4. Translating strings +As you may have noticed before, there are two main types of localized strings: simple ones and the ones with plural +forms. The first ones have simply two boxes: source and localized string. The source string can't be modified as +Gettext/Poedit do not include the powers to alter your source files - you should change the source itself and rescan +the files. Tip: you may right-click a translation line and it will hint you with the source files and lines where that +string is being used. +On the other hand, plural form strings include two boxes to show the two source strings, and tabs so you can configure +the different final forms. + +Whenever you change your sources and need to update the translations, just hit Refresh and Poedit will rescan the code, +removing non-existent entries, merging the ones that changed and adding new ones. It may also try to guess some +translations, based on other ones you did. Those guesses and the changed entries will receive a "Fuzzy" marker, +indicating it needs review, being highlighted in the list. It's also useful if you have a translation team and someone +tries to write something they're not sure about: just mark Fuzzy and someone else will review later. + +Finally, it's advised to leave "View > Untranslated entries first" marked, as it will help you _a lot_ to not forget +any entry. From that menu, you can also open parts of the UI that allow you to leave contextual information for +translators if needed. + +### Tips & Tricks + +#### Possible caching issues +If you're running PHP as a module on Apache (`mod_php`), you might face issues with the `.mo` file being cached. It +happens the first time it's read, and then, to update it, you might need to restart the server. On Nginx and PHP5 it +usually takes only a couple of page refreshes to refresh the translation cache, and on PHP7 it is rarely needed. + +#### Additional helper functions +As preferred by many people, it's easier to use `_()` instead of `gettext()`. Many custom i18n libraries from +frameworks use something similar to `t()` as well, to make translated code shorter. However, that's the only function +that sports a shortcut. You might want to add in your project some others, such as `__()` or `_n()` for `ngettext()`, +or maybe a fancy `_r()` that would join `gettext()` and `sprintf()` calls. Other libraries, such as +[oscarotero's Gettext][oscarotero] also provide helper functions like these. + +In those cases, you'll need to instruct the Gettext utility on how to extract the strings from those new functions. +Don't be afraid, it's very easy. It's just a field in the `.po` file, or a Settings screen on Poedit. In the editor, +that option is inside "Catalog > Properties > Source keywords". You need to include there the specifications of those +new functions, following [a specific format](func_format): + +- if you create something like `t()` that simply returns the translation for a string, you can specify it as `t`. +Gettext will know the only function argument is the string to be translated; +- if the function has more than one argument, you can specify in which one the first string is - and if needed, the +plural form as well. For instance, if we call our function like this: `__('one user', '%d users', $number)`, the +specification would be `__:1,2`, meaning the first form is the first argument, and the second form is the second +argument. If your number comes as the first argument instead, the spec would be `__:2,3`, indicating the first form is +the second argument, and so on. + +After including those new rules in the `.po` file, a new scan will bring in your new strings just as easy as before. + +### References + +* [Wikipedia: i18n and l10n](https://en.wikipedia.org/wiki/Internationalization_and_localization) +* [Wikipedia: Gettext](https://en.wikipedia.org/wiki/Gettext) +* [LingoHub: PHP internationalization with gettext tutorial](lingohub) +* [PHP Manual: Gettext](http://php.net/manual/en/book.gettext.php) +* [Gettext Manual][manual] + +[Poedit]: https://poedit.net +[poedit_download]: https://poedit.net/download +[lingohub]: https://lingohub.com/blog/2013/07/php-internationalization-with-gettext-tutorial/ +[lingohub_plurals]: https://lingohub.com/blog/2013/07/php-internationalization-with-gettext-tutorial/#Plurals +[plural]: http://docs.translatehouse.org/projects/localization-guide/en/latest/l10n/pluralforms.html +[gettext]: https://en.wikipedia.org/wiki/Gettext +[manual]: http://www.gnu.org/software/gettext/manual/gettext.html +[639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes +[3166-1]: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 +[rare]: http://www.gnu.org/software/gettext/manual/gettext.html#Rare-Language-Codes +[func_format]: https://www.gnu.org/software/gettext/manual/gettext.html#Language-specific-options +[oscarotero]: https://github.com/oscarotero/Gettext +[symfony]: https://symfony.com/doc/current/components/translation.html +[zend]: https://docs.zendframework.com/zend-i18n/translation +[laravel]: https://laravel.com/docs/master/localization +[yii]: http://www.yiiframework.com/doc-2.0/guide-tutorial-i18n.html +[intl]: http://br2.php.net/manual/en/intro.intl.php +[ICU project]: http://www.icu-project.org +[symfony-keys]: https://symfony.com/doc/current/components/translation/usage.html#creating-translations + +[sprintf]: http://php.net/manual/en/function.sprintf.php +[func]: http://php.net/manual/en/function.gettext.php +[n_func]: http://php.net/manual/en/function.ngettext.php +[d_func]: http://php.net/manual/en/function.dgettext.php +[dn_func]: http://php.net/manual/en/function.dngettext.php