Diff for "BashFAQ/098"

Differences between revisions 10 and 12 (spanning 2 versions)

How to add localization support to your bash scripts

Looking for examples of how to add simple localization to your bash scripts, and how to do testing? This is probably what you need....

Contents

How to add localization support to your bash scripts

First, some variables you must understand

Before we can even begin, we have to understand all the locale environment variables. This is fundamental, and extremely under-documented in the places where people actually look for documentation (man pages, etc.). Some of these variables may not apply to your system, because there seem to be various competing standards and extensions....

On recent GNU systems, the variables are used in this order:

If LANGUAGE is set, use that, unless LANG is set to C, in which case LANGUAGE is ignored. Also, some programs simply don't use LANGUAGE at all.
Otherwise, if LC_ALL is set, use that.
Otherwise, if the specific LC_* variable that covers this usage is set, use that. (For example, LC_MESSAGES covers error messages.)
Otherwise, use LANG.

That means you first have to check your current environment to see which of these, if any, are already set. If they are set, and you don't know about them, they may interfere with your testing, leaving you befuddled.

$ env | egrep 'LC|LANG'
LANG=en_US.UTF-8
LANGUAGE=en_US:en_GB:en

Here's an example from a Debian system. In this case, the LANGUAGE variable is set, which means any testing we do that involves changing LANG is likely to fail, unless we also change LANGUAGE. Now here's another example from another Debian system:

$ env | egrep 'LC|LANG'
LANG=en_US.utf8

In that case, changing LANG would actually work. A user on that system, writing a document on how to perform localization testing, might create instructions that would fail to work for the user on the first system....

So, go ahead and play around with your own system and see what works and what doesn't. You may not have a LANGUAGE variable at all (especially if you are not on GNU/Linux), so setting it may do nothing for you. You may need to use locale -a to see what locale settings are available. You may need to specify a character set in the LANG variable (e.g. es_ES.utf8 instead of es_ES). You may have to "generate locales" on your operating system (a process which is beyond the scope of this page, but which on Debian consists of running dpkg-reconfigure locales and answering questions) in order to make them work.

Try to get to the point where you can produce error messages in at least two languages:

$ wc -q
wc: invalid option -- 'q'
Try `wc --help' for more information.
$ LANGUAGE=es_ES wc -q
wc: opción inválida -- q
Pruebe `wc --help' para más información.

Once you can do that reliably, you can begin the actual work of producing a bash script with localisation.

Marking strings as translatable

This is the simplest part, at least to understand. Any string in $"..." is translated using the system's native language support (NLS) facilities. Find all the constant strings in your program that you want to translate, and mark them accordingly. Don't mark strings that contain variables or other substitutions. For example,

#!/bin/bash
echo $"Hello, world"

(As you can see, we're starting with very simple material here.)

Bash (at least up through 4.0) performs locale expansion before other substitutions. Thus, in a case like this:

echo $"The answer is $answer"

The literal string $answer will become part of the marked string. The translation should also contain $answer, and bash will perform the variable substitution on the translated string. The order in which bash does these substitutions introduces a potential security hole which we will not cover here just yet. (A patch has been submitted, but it's still too early....)

Generating and/or merging PO files

Next, generate what are called a "PO files" from your program. These contain the strings we've marked, and their translations (which we'll fill in later).

We start by creating a *.pot file, which is a template.

bash --dump-po-strings hello > hello.pot

This produces output which looks like:

#: hello:5
msgid "Hello, world"
msgstr ""

The name of your file (without the .pot extension) is called the domain of your translatable text. A domain in this context is similar to a package name. For example, the GNU coreutils package contains lots of little programs, but they're all distributed together; and so it makes sense for all their translations to be together as well. In our example, we're using a domain of hello. In a larger example containing lots of programs in a suite, we'd probably use the name of the whole suite.

This template will be copied once for each language we want to support. Let's suppose we wanted to support Spanish and French translations of our program. We'll be creating two PO files (one for each translation), so let's make two subdirectories, and copy the template into each one:

mkdir es fr
cp hello.pot es/hello.po
cp hello.pot fr/hello.po

This is what we do the first time through. If there were already some partially- or even fully-translated PO files in place, we wouldn't want to overwrite them. Instead, we would merge the new translatable material into the old PO file. We use a special tool for that called msgmerge. Let's suppose we add some more code (and translatable strings) to our program:

vi hello
bash --dump-po-strings hello > hello.pot
msgmerge --update es/hello.po hello.pot
msgmerge --update fr/hello.po hello.pot

The original author of this page created some notes which I am leaving intact here. Maybe they'll be helpful...?

# step 5: try to merge existing po with new updates
# remove duplicated strings by hand or with sed or something else
# awk '/^msgid/&&!seen[$0]++;!/^msgid/' lang/nl.pot > lang/nl.pot.new
msgmerge lang/nl.po lang/nl.pot

# step 5.1: try to merge existing po with new updates
cp --verbose lang/pct-scanner-script-nl.po lang/pct-scanner-script-nl.po.old
awk '/^msgid/&&!seen[$0]++;!/^msgid/' lang/pct-scanner-script-nl.pot > lang/pct-scanner-script-nl.pot.new
msgmerge lang/pct-scanner-script-nl.po.old lang/pct-scanner-script-nl.pot.new > lang/pct-scanner-script-nl.po

# step 5.2: try to merge existing po with new updates
touch lang/pct-scanner-script-process-nl.po lang/pct-scanner-script-process-nl.po.old
awk '/^msgid/&&!seen[$0]++;!/^msgid/' lang/pct-scanner-script-process-nl.pot > lang/pct-scanner-script-process-nl.pot.new
msgmerge lang/pct-scanner-script-process-nl.po.old lang/pct-scanner-script-process-nl.pot.new > lang/pct-scanner-script-process-nl.po

Translate the strings

This is a step which is 100% human labor. Edit each language's PO file and fill in the blanks.

#: hello:5
msgid "Hello, world"
msgstr "Hola el mundo"

#: hello:6
msgid "How are you?"
msgstr ""

Install MO files

Your operating system, if it has gotten you this far, probably already has some localized programs, with translation catalogs installed in some location such as /usr/share/locale (or elsewhere). If you want your translations to be installed there as well, you'll have to have superuser privileges, and you'll have to manage your translation domain (namespace) in such a way as to avoid collision with any OS packages.

If you're going to use the standard system location for your translations, then you only need to worry about making one change to your program: setting the TEXTDOMAIN variable.

#!/bin/bash
TEXTDOMAIN=hello

echo $"Hello, world"
echo $"How are you?"

This tells bash and the system libraries which MO file to use, from the standard location. If you're going to use a nonstandard location, then you have to set that as well, in a variable called TEXTDOMAINDIR:

#!/bin/bash
TEXTDOMAINDIR=/usr/local/share/locale
TEXTDOMAIN=hello

echo $"Hello, world"
echo $"How are you?"

Use one of these two depending on your needs.

Now, an MO file is essentially a compiled PO file. A program called msgfmt is responsible for this compilation. We just have to tell it where the PO file is, and where to write the MO file.

msgfmt -o /usr/share/locale/es/LC_MESSAGES/hello.mo es/hello.po
msgfmt -o /usr/share/locale/fr/LC_MESSAGES/hello.mo fr/hello.po

or

mkdir -p /usr/local/share/locale/{es,fr}/LC_MESSAGES
msgfmt -o /usr/local/share/locale/es/LC_MESSAGES/hello.mo es/hello.po
msgfmt -o /usr/local/share/locale/fr/LC_MESSAGES/hello.mo fr/hello.po

(If we had more than two translations to support, we might choose to mimic the structure of /usr/share/locale in order to facilitate mass-copying of MO files from the local directory to the operating system's repository. This is left as an exercise.)

Test!

Remember what we said earlier about setting locale environment variables... the examples here may or may not work for your system.

The gettext program can be used to retrieve individual translations from the catalog:

$ LANGUAGE=es_ES gettext -d hello -s "Hello, world"
Hola el mundo

Any untranslated strings will be left alone:

$ LANGUAGE=es_ES gettext -d hello -s "How are you?"
How are you?

And, finally, there is no substitute for actually running the program itself:

wooledg@wooledg:~$ LANGUAGE=es_ES ./hello
Hola el mundo
How are you?

As you can see, there's still some more translation to be done for our example. Back to work....

-  ⇤ ← Revision 10 as of 2010-03-16 18:02:44 → 
  Size: 10550
  Editor: GreyCat
  Comment: link to [[locale]], plus tiny grammar fix
+   ← Revision 12 as of 2012-03-11 00:08:39 → ⇥
  Size: 10794
  Editor: host-216-57-70-194
  Comment: Actually use the syntax we're demonstrating
-Deletions are marked like this.
+Additions are marked like this.
 Line 13:
-. If LANG is C or POSIX, this overrides everything.
 1. Otherwise, if LANGUAGE is set, use that.
+. If [[http://www.gnu.org/software/hello/manual/gettext/The-LANGUAGE-variable.html|LANGUAGE]] is set, use that, ''unless'' LANG is set to C, in which case LANGUAGE is ignored.  Also, some programs simply don't use LANGUAGE at all.
-Line 16:
+Line 15:
+. Otherwise, if the specific LC_* variable that covers this usage is set, use that.  (For example, LC_MESSAGES covers error messages.)
 Line 18:
-That means, you first have to check your current environment to see which of these, if any, are already set.  If they are set, and you don't know about them, they may interfere with your testing, leaving you befuddled.
+That means you first have to check your current environment to see which of these, if any, are already set.  If they are set, and you don't know about them, they may interfere with your testing, leaving you befuddled.
 Line 26:
-Here's an example from a Debian system.  In this case, the `LANGUAGE` variable is set, which means any testing we do that involves changing `LANG` or `LC_ALL` is likely to fail, unless we also change `LANGUAGE`.  Now here's another example from another Debian system:
+Here's an example from a Debian system.  In this case, the `LANGUAGE` variable is set, which means any testing we do that involves changing `LANG` is likely to fail, unless we also change `LANGUAGE`.  Now here's another example from another Debian system:
 Line 33:
-In that case, setting `LANG` or `LC_ALL` would actually work.  A user using that system, then writing a document on how to perform localization testing, might create instructions that would fail to work for the user on the first system....

So, go ahead and play around with your own system and see what works and what doesn't.  You may not have a `LANGUAGE` variable at all (especially if you are not on GNU/Linux), so setting it may do nothing for you.  You may need to use `locale -a` to see what locale settings are available.  You may need to specify a character set in the `LANG` variable (e.g. `es_ES.utf8` instead of `es_ES`).  You may have to "generate locales" on your operating system (a process which is beyond the scope of this page, but which on Debian consists of running `dpkg-reconfigure locales` and answering questions) in order to make them work.
+In that case, changing `LANG` would actually work.  A user on that system, writing a document on how to perform localization testing, might create instructions that would fail to work for the user on the first system....

So, go ahead and play around with your own system and see what works and what doesn't.  You may not have a `LANGUAGE` variable at all (especially if you are not on GNU/Linux), so setting it may do nothing for you.  You may need to use `locale -a` to see what [[locale]] settings are available.  You may need to specify a character set in the `LANG` variable (e.g. `es_ES.utf8` instead of `es_ES`).  You may have to "generate locales" on your operating system (a process which is beyond the scope of this page, but which on Debian consists of running `dpkg-reconfigure locales` and answering questions) in order to make them work.
 Line 63:
-echo "The answer is $answer"
+echo $"The answer is $answer"