.

2025-08-01 06:20:52 +02:00 · 2017-03-06 22:28:08 -08:00
parent d1a05fc82d
commit 424bcf4266
1 changed files with 116 additions and 76 deletions
--- a/localization.md
+++ b/localization.md
@@ -43,11 +43,15 @@ This, I think, is the only way to respond when the world shows up at your doorst

 ## Code base changes: let's talk about failures

-I've worked with localized code bases before and there are some nice solutions out there for websites and apps. I figured I'd see if Transifex and Mozilla's Pontoon could be used, because that's what I've used in the past, but this is where traditional localization solutions break down:
+I've worked with localized code bases before and there are some nice solutions out there for websites and apps. I figured I'd see if Transifex and Mozilla's Pontoon could be used, because that's what I've used in the past, but this is where traditional localization solutions break down.

-My article is an article. It is not a website, it is not an app, you can't just key paragraphs and then ask people to translate those on a string-by-string basis; translation needs to be done the proper way, namely translating discourse. That requires seeing, reading, and rewriting, entire sections at a time.
+Transifex is a translation service that lets you define your project in terms of key-value mappings, where you use the keys in your own content, and then you do a replacement for the values based on the locale data that you get from the transifex servers. This works really well for web apps, and general UI, but things get tricky for content like articles. In articles, where the content is structured in paragraphs and the ordering matters for the tone of the text, asking localizers to translate paragraphs or even single sentences fully detached from what comes before or after is almost guaranteed to give weird translations.

-So the traditional solutions for adding localization to a site were out.
+To make things easier in that respect, Mozilla's Pontoon project is an "on-page translation" localization system, where you load a .js file that turns all your on-page content into "double click and translate, and pontoon saves that translation for you", and while that sounds really nice, it turns out that setting it up for "not mozilla sites" is a little bit of work (and you want to, because your translations are not for a mozilla project, so you need to run your own copy of pontoon on something like a heroku app). However, even if you get that to work, you still have the problem that transifex has too: your localizers might have an easier time, but as an author you're still stuck being unable to write text yourself without then having to convert it to a weird, unreadable "mess" of `getText('section1-paragraph1')`, so you have no idea what you wrote.
+
+Worse, and all key/value localization systems suffer from this: changing the text in the authoritative locale (in this case, en-GB) invalidates all localizations of that text. What do you do? 
+
+The problems that come with these kinds of localization systems well too much. Either it didn't work for the localizers, or it didn't work for me as author, or it didn't work for the nature of the content, or any combination of those. So the traditional solutions for adding localization to a site were out. Not because they're unsuitable as localization solutions, but because for this project, they introduced more problems than they solved.

 ## Code base changes that work: Markdown 
 
@@ -60,49 +64,110 @@ So I had to make this work, and the solution to the problem was super obvious in
 - I decided on a content format: sections would be an `index.js` for the JSX code, and a `content.en-GB.md` for my own English content.
 - content would be pulled back into the JSX by... wait...

-How do you pull markdown content into a JSX file? Unlike `js` or `jsx` or `json`, markdown content can't just be imported. And this is where things go a little different than you would expect: **solving this problem would prevent translations from happening**.
+How do you pull markdown content into a JSX file? Unlike `js` or `jsx` or `json`, markdown content can't just be imported. And this is where things go a little different than you would expect: instead of solving this problem so that I had a working "full circle", solving this problem first would prevent translations from happening. Consequenlty **I stepped away from the problem and went back to my contributors**: I had a solution *for them*, and that came first.

-Yes, it's nice to solve these things as they pop up, but the most important part is still to make sure contributors can start doing what they wanted to help you with and once I figured out how to at least split up the JSX into "JSX for the skeleton" and "Markdown for the content", that was it: *they* could now at least start translating, even if the system for rebuilding the content wasn't done yet.
+Yes, it's nice to solve these things as they pop up, but the most important part is still to make sure contributors can start doing what they wanted to help you with and once I figured out how to at least split up the JSX into "JSX for the skeleton" and "Markdown for the content", that was it: I had a solution that unblocked my contributors so that *they* could now at least get started translating and making progress, even if the system for rebuilding the content wasn't done yet and I wouldn't be able to immediately deploy whatever PR they were going to throw my way.

-And so that's what happened. Once I figured out how to at least split the content as JSX and Markdown I split up the Preface and first three sections as `content.en-GB-.md` and told the hopefully-still-willing-to-be-contributors that if they were still interested they could now start on these files. All they had to do was copy it to `content.zh-CN.md` or `content.ja-JP.md` and then modify that as best they knew how to.
+And so that's what happened: their needs came first. Once I figured out how to at least split the content as JSX and Markdown, I split up the article preface and first three sections as `content.en-GB.md` and told the hopefully-still-willing-to-be-contributors that if they were still interested in helping out they could now start on these files. All they had to do was copy it to `content.zh-CN.md` or `content.ja-JP.md` and then modify that as best they knew how to.

-And in the mean time I'd be working on getting the Markdown loaded back in to generate a site that made sense.   
+And while they were doing that, I'd have some time to implement getting the Markdown loaded back into the JSX files to generate a site that, for visitors, was identical to the monolithic English one.    

 The take-away here is primarily: **unblock your contributors before you unblock yourself**
     
 ## Reintegrating sections based on JSX and Markdown

-While the contributors were working on their translations, I got back to work integrating Markdown into the JSX, and the solution I came up with, I think, is quite elegant in its simplicity:
+While the contributors were working on their translations, I got back to work integrating Markdown into the JSX, and after a bit of thinking the solution to how to achieve that integration was remarkably simple: *you don't*.

-**Don't.**
-
-Okay that's a bit of a silly statement but I solved this problem using the classic problem solving skill of "if X is hard, what Y is easy, and how do you turn X into Y". This is a general life skill when it comes to problem solving and I honestly don't practice it enough, but I practiced it here.
+I know, that sounds a bit silly, but it's not silly as you might first imagine; I solved this problem using the classic problem solving approach of "if X is hard, which Y is easy, and how do you turn X into Y". This is a general life skill when it comes to problem solving and I honestly don't practice it enough, but I practiced it here:

 - Pulling markdown into JSX is hard,
 - pulling JSX into JSX is trivial,
 - how do I convert markdown into JSX?

-So I wrote a script called `make-locales.js` which runs through the `./components/sections` dir looking for `content.*.md` files, filters the list of locales it finds that way so that it's a list of unique locales, and then for each locale found does something akin to:
+Well, I'm good at programming, and both markdown and JSX are, at their code, just string data in a file. And converting string data into other string data is a pretty easy thing if you know how to program. So I wrote a script called `make-locales.js` which runs through the `./components/sections` directories looking for `content.*.md` files, filters the list of locales it finds that way, turning it into a list of unique locales, and then for each locale in that list does something like:

 ```
-for (locale of locales) {
-  let sectionAndContentMap = getAllContentFiles(locale)
-  let javascriptVersion = JSON.stringify(sectionAndContentMap);
-  filesystem.write(`./locales/${locale}/content.js`, javascriptVersion);
+for (locale in locales) {
+  giantMarkdownCollection = getAllContentFilesBelongingTo(locale)
+  sectionAndContentMap = convertMarkdown(giantMarkdownCollection)
+  convertedToJS = JSON.stringify(sectionAndContentMap)
+  filesystem.write(`./locales/${locale}/content.js`, convertedToJS);
 }
 ``` 

-Running this script builds a `content.js` for each locale that is a plain JS object keyed on the section's directoy, which is the same keys the build system already uses to aggregate all sections, so there's no need to rewrite how that works.
+Running this script builds a `content.js` file that takes a form that matches the one necessary for any Node script (which JSX files are in my codebase) to trivially import with a single `require('content')` statement. By further making sure the data inside `content.js` is keyed in the same way as the original code base organised sections, I basically had a markdown-to-JSX conversion that the original code base didn't even notice was different. Everything basically worked the same as far as it was concerned.

-Of course, the markdown itself needed parsing so I used the `marked` library for that, but that presented unforseen problems: the content I wrote is a mix of "mostly normal text", "LaTeX", "divs with specific classes to mark bits as notes, how-to-code, and figures" and some JSX for each interactive graphic. And `marked` kind of didn't like any of that, except for the "normal text" parts, so how do you make something like `marked` convert things properly? 
+### Further challenges: I'm not using *true* markdown

-Same process: 
+Of course, while the `get all content` and `stringify` operations are pretty easy, the crucial function to get right was that `convertMarkdown` function, to turn the markdown syntax into JSX syntax instead. Thankfully, JSX syntax is basically JavaScript with embedded "HTML that follows XML rules", and converting markdown to HTML is super easy: just pick any of twenty or so libraries to do so, and you're essentially done.
+
+I picked the `marked` library to do things for me, but there was one real challenge that needed to be tackled: the content I write is a mix of "mostly normal text", "some LaTeX, sometimes", "some divs with specific classes to mark bits as notes, how-to-code, and figures, sometimes" and some JSX for each interactive graphic... also sometimes. And being a pure markdown converter, except for the "normal text" parts `marked` kind of didn't like any of that, so how would one make `marked` convert things properly? 
+
+Same problem solution process: 

 - Converting mixed Markdown content is hard,
 - converting just plain markdown is trivial,
- only convert plain markdown.
+- only convert plain markdown and leave the other bits alone.

-I wrote a chunker that takes a markdown file, looks for the various content blocks quite naively. LaTeX code is surrounded by `\[` and `\]` for instance, which is a universal LaTeX convention honoured even by new JavaScript parsers like MathJax and KaTeX, and so the chunker just runs through a markdown file, finds those blocks, and returns a chunk object that looks like:
+The crucial observation was that in the build system I already had, things like "LaTeX", "divs with classes for notes and howtos" and "JSX" already worked. So really the only thing that *needed* additional work was turning the markdown string sections into html string sections. 
+
+Easy-peasy: I know how to write tokenizers, lexers and grammar parsers in general so I wrote a simple chained chunker that takes a markdown file, and then runs a super simple "chop it up, if I know how to chop it up" action.
+
+Start with:
+
+```
+data = a full markdown file,
+chunks = empty list to fill with data chunks,
+chunkers = a list of latex, div, JSX, and BadMarkDown chunkers.
+``` 
+
+define a function to act as recursion point:
+
+```
+function performChunking(data, chunks, chunker, moreChunkers) {
+  if no chunker:
+    if data isn't empty:
+      chunks.push({ convert: true, data: data })
+    return;
+
+  // otherwise, if there is a chunker:
+  chunker(data, chunks, moreChunkers);
+}
+```
+
+and then finally, you just start blindly running through the data:
+
+```
+function chunkLatex(data, chunks, chunkMore) {
+  // run through the data looking for LaTeX blocks
+  while there is data left to examine:
+    if there is no latex left:
+	  performChunking(data.substring(p), chunks, next, otherChunkers);
+	  exit the chunkLatex function
+
+    if there is, get the start of the latex block.
+    Then, parse the non-LaTeX data prior to it using the rest of the chunkers:
+	    performChunking(data.substring(...), chunks, next=chunkMore[0], chunkMore=chunkMore[1,...])
+
+    And then capture the LaTeX block itself as a "don't convert" block
+    chunks.push({ convert: false, type: "latex", start:..., end:..., data:...});
+  }
+}
+
+function chunkDiv(data, chunks, chunkMore) {
+  Same as above, except for <div> and </div> delimiters
+}
+
+function chunkJSX(data, chunks, chunkMore) {
+  Same as above, except for <Graphic..../> lines
+}
+
+...
+```
+
+And so forth. This system ensures that a block that has no latex gets further analysed by the "div" code. Any divs are extracted, any non-div code is handed on to the JSX code, and so on and so on until there is no function left to examine with. At that point, we know it's just plain markdown and we record it as a "convert? yes!" data block.
+
+At the end of this process (which actually runs *really* quickly), we end up with an object that looks like this:

 ```
 chunked = [
@@ -124,21 +189,18 @@ chunked = [
 ]
 ```

-And this then gets mapped to the relevant converters:
-
-```
-var result = chunks.map( chunk => {
-  if (chunk.convert) return marked(chunk.data);
-  if (chunk.type === "latex") return convertLaTeX(chunk)
-  ...
-});
-``` 
-
-And so forth. The result is essentially still string data, but --crucially-- the kind of string data that the original build system was already able to deal with. Except for one thing: sections have interactive graphics, which are tied to individual React components. So I had to make sure there was a way to pass in a context so that things like `<Graphics setup={this.setup} draw={this.draw}/>` still worked.
-
-The solution here is kind of silly, but super effective: each mapped chunk, is strictly speaking already valid JSX except for the bits that the preprocessors already take care of. As such, instead of just returning that JSX I made the `make-locales` script return functions that could take a react component for binding things to, instead:
+And so we simply run through this through a quick `map` function where any data that is marked as "convert? no!" is left alone, any data that is marked as "convert? yes!" is converted by `marked` from markdown to HTML data. Then we simply join all the blocks back up, and we actually have the kind of JSX that the original monolithic English article was already using. 
+
+Winner!
+
+### One last thing: JavaScript still needs to work
+
+While the above procedure works *really* well, it left one problem: sections have interactive graphics, which are tied to individual React components. While components were single JSX files that was not a problem, but by pulling the content out I needed a way to make sure that JSX code like `<Graphics setup={this.setup} draw={this.draw}/>` still had a correct understanding of which JavaScript object was supposed to be used when it made calls for `this.something()`.
+
+The solution to this is actually the simplest, borderline trivial,  a bit silly, but super effective: as each mapped chunk is strictly speaking already valid JSX I just took the string data and wrapped it in more string data that just turned it into a function call, wrapping it in `function(handler) { return <section>` at the start and `</section>; }` at the end, and making sure the JSX chunker replaced any `this` with the word `handler` instead. The result: code like this:

 ```
+content = [
  "whatis": function(handler) {
    return <section>
     ...
@@ -146,44 +208,11 @@ The solution here is kind of silly, but super effective: each mapped chunk, is s
     ...
    </section>;
  }
+  ...
+];
 ```  

-And presto: by requiring/importing the resultant `content.js` like you would any other JS file, I could now do trivially load the content. On the content side:
-
-```
-var React = require('react');
-var Graphic = require("../../components/Graphic.jsx");
-var SectionHeader = require("../../components/SectionHeader.jsx");
-
-module.exports = {
-  "preface": {
-    "locale": "en-GB",
-    "title": "Preface",
-    "getContent": function(handler) { return <section>
-<SectionHeader name="preface" title="Preface"/>
-<p>...</p>
-...
-</section>; }
-  },
-  "introduction": {
-    "locale": "en-GB",
-    "title": "A lightning introduction",
-    "getContent": function(handler) { return <section>
-<SectionHeader name="introduction" title="A lightning introduction" number="1"/>
-...
-</section>; }
-  },
-  "whatis": {
-    "locale": "en-GB",
-    "title": "So what makes a Bézier Curve?",
-    "getContent": function(handler) { return ...
-    ...
-  },
-  ...
-};
-```
-
-So with a manager `./lib/locale.js` to do the content loading and act as content manager, section JSX could now be written as:
+And there you have it. Rather than importing this and then using it directly, a component can now import this and then call the function, passing itself in as the "handler":

 ```
 var React = require("react");
@@ -211,7 +240,9 @@ return React.createClass({

 Sorted: suddenly we have a code base that is super easy to localize. Just change the `content.{local}.md` file, and the `make-locales.js` script will take care of the rest. In fact, with an `npm` task that watches for changes in `.md` files so that `make-locales.js` gets retriggered, and a webpack task for `js` files in general, live development didn't even need any changing: it just works.

-## So what about that LaTeX...
+## So what about that LaTeX? Is maths notation universal?
+
+### spoilers: not if there's English in it

 Here's the thing about localisation: you need to update *all* the content to work for a specific locale. Translating all the English text to something like Chinese is great and all but if the graphics still have English in them, things get weird. And while at this point the contributors were being quite productive and translating sections at a time, the LaTeX blocks still had English in them. 

@@ -258,7 +289,7 @@ And then we're done, the execSync calls stop, we return execution to the latex-l

 Of course if we had to run this every single time we ran `npm run dev` or `npm start`, that would be impossibly slow, so there is a shortcut in that any image for which the content hash already exists as `.svg` file skips the whole conversion process. Instead the latex-loader immediately grabs the `.svg` file for the width/height information and moves on.

-## So how do we switch langauges?
+## So how do we switch languages?

 The page is hosted on github through the gh-pages functionality, and so some of the traditional ways to switch locales are not actually as appealing. Putting the locale in the URL, for instance, is not quite as easy when you'd literally have to make a dir by that name and put a file in there.

@@ -266,13 +297,22 @@ So initially I figured I'd just use a URL query argument: `index.html?locale=en-

 So directories it is: it's not clean from a dir structure perspective, but ultimate it's not the repo dir structure that *readers* care about. If their needs are not met, then it doesn't matter how clean the dir structure is.

-## finally: document everything
+Of course, you also need to be able to *switch* locales on the site, so I wrote a small component that offers users a compellingly simple choice:

-So now the code base can be localized! Hurray! Last question: can people read up on how to do that without needing to ask questions, and just *do it*? If not, you're not done. Yes, it's great to have everything in place but unless you also write the documentation that explains how to do what people want to do, they're going to need help, and you probably won't have time to spend on helping them, so: write the docs that take you out of the equation.
+> Read this in your own language: | **English** | **日本語** | **中文** |
+> *Don't see your language listed? [Help translate this content!](https://github.com/Pomax/BezierInfo-2/wiki/localize)*

-I ended up documenting the steps necessary to do localization on the github repo wiki, with links to the docs from the README.md, so that anyone visiting the repo --linked prominently on the Primer itself-- will immediately be able to find out what is involved and how they can make that work for them.
+A simple list of languages that people can read if it's theirs, and a call-out to anyone who *wants* localized content but can't find it: this is an Open Source project, come help out!

-## What's left?
+## Finally: document everything
+
+So now the code base can be localized! Hurray!
+
+Last question: can people read up on how to do that without needing to ask questions, and just *do it*? If not, you're not done. Yes, it's great to have everything in place but unless you also write the documentation that explains how to do what people want to do, they're going to need help, and you probably won't have time to spend on helping them, so: write the docs that take you out of the equation.
+
+I ended up documenting [the steps necessary to do localization](https://github.com/Pomax/BezierInfo-2/wiki/localize) on the github repo wiki, with links to the docs from the README.md, so that anyone visiting the repo --linked prominently on the Primer itself-- will immediately be able to find out what is involved and how they can make that work for them.
+
+## So what's left?

 There is still one area of localization left that's untackled: localizing the actual interactive graphics. The problem with these is that text for these comes from the browser, which requires having access to a font to draw text with. For English, that's no big deal because fonts are only a few tens of kilobytes, and you can pick any number of webfonts to make sure everyone sees the same graphic, but for Chinese or Japanese this gets considerably harder. As langauges with thousands of "letters" (glossing over what they really are for a moment) the fonts for these languages run in the megabytes each, and it is impossible to serve these up in an acceptable way, unless you use something like the `woff2` format with exact unicode ranges specifically for the text that needs to be typeset. That requires mining the graphics instructions for string calls and checking which exact "letters" are being used, so everyone else can be pruned from a common open source font (say, Noto Sans CJK) and the webfont literally only covers the text used on the site and nothing else.