Locale
1. Character encodings
Computers can't actually store letters and symbols; they only store numbers. There are innumerable ways to represent human language characters (like the letter A, the plus sign, etc.) as numbers, and the fact that so many different ways were actually implemented has led to chaos.
Early computers (at least in the United States) converged on two standards for mapping US English characters to numbers and back again: ASCII and EBCDIC. The latter had mostly died out by the end of the 20th century, leaving ASCII (the American Standard Code for Information Interchange) as the primary standard.
The problem with ASCII is that it's too narrow for languages other than English, or even for certain English words which, depending on the writer's stylistic preferences, may be spelled with diacritics (e.g., rôle, naïve). ASCII only covers the 26 letters of the English alphabet (capital and lower-case), the digits 0 through 9, and some basic punctuation -- typically, what you see on a US computer keyboard. On Unix machines, you can type man ascii to see a table.
Most computers use an 8-bit byte as their unit of storage (a range from 0 to 255 when represented as nonnegative integers). ASCII defines characters using only 7 bits (0 to 127), a legacy of days when long-distance data communications were considerably slower and more error-prone. Since ASCII uses only half the range of a byte, this left space for people to define their own sets of characters within a single byte.
(Many Microsoft DOS/Windows users believe ASCII covers the entire range from 0 to 255, with smiley faces and line-drawing characters and so on. This is incorrect. The well-known DOS character set is actually IBM code page 437, which is one of many supersets of ASCII. ASCII itself stops at character 127.)
Computer users in countries outside the US were unable to represent their written languages using ASCII, so while US programmers were using the space from 128 to 255 to make line-drawing characters and mathematical symbols, Europeans were using it to add their extra letters, and their accented letters. This led to more chaos (not surprisingly), out of which another set of standards evolved: ISO 8859. Note that these are supersets of ASCII. ISO-8859-1, also known as Latin-1, became the dominant standard for Unix workstations in North America and western Europe.
However, this still left some unresolved issues. First, there were still competing standards; eastern Europe has very different alphabets than western Europe does, and the various ISO 8859 standards are incompatible. Second, Asian countries have radically different ways of writing compared to European/American countries, and their character sets don't even fit within a single byte (which only allows 256 different symbols).
The Unicode standard tries to address this: instead of defining only 256 symbols, it defines many thousands. If a computer were to represent each symbol of a document using Unicode code points, it would require three bytes per symbol, making simple English documents take three times as much space as they did before, with most of that space being occupied by zeroes.
So, to attempt to preserve some efficiency (as well as some compatibility with existing data files), various encodings of Unicode characters were created; of these, currently the most popular (among English speakers, at least) is UTF-8, which is a variable-width encoding. A simple ASCII document is also a valid UTF-8 document; single-byte characters from ASCII are represented using the same byte in UTF-8. However, UTF-8 also offers multi-byte sequences capable of representing all of Unicode (using up to four bytes per character in some cases).
As of 2009, UTF-8 is the emerging standard for Linux distributions, although there are still many problems with implementations.
2. Locales
So, what's a "locale"? Since there are so many standards out there, and so many different types of computers, some of which only support some of the standards, it's important to be able to say which standard you're working with. This is where locales come in.
A locale is a set of rules determining how information is presented and processed, with respect to human beings. It covers character encodings (which we've talked about in the first part of this page), as well as the order in which those characters are sorted, the format for displaying dates and times, the rules for representing large numbers and numbers with a decimal component, etc.
Examples: an American might write "the third day of January, A.D. 2009" as 1/3/09, while an Englishman may write the same date as 3/1/09. A computer programmer would probably use 2009-01-03. Meanwhile, our American friend writes the number "ten thousand and one one-hundredth" as 10,000.01, much to the distress of his German colleague, who writes 10 000,01 instead.
A Unix system has a command named locale which is used to show which locale a user (or more precisely, a process) is using at the moment, and to list all available locales. For example,
imadev:~$ locale LANG=en_US.iso88591 LC_CTYPE="en_US.iso88591" LC_COLLATE="en_US.iso88591" LC_MONETARY="en_US.iso88591" LC_NUMERIC="en_US.iso88591" LC_TIME=POSIX LC_MESSAGES="en_US.iso88591" LC_ALL=
This shows the locale which is currently in use. To see which ones might be chosen instead:
imadev:~$ locale -a C POSIX C.iso88591 C.utf8 univ.utf8 ar_DZ.arabic8 ar_SA.arabic8 ar_SA.iso88596 bg_BG.iso88595 cs_CZ.iso88592 da_DK.iso88591 da_DK.roman8 nl_NL.iso88591 nl_NL.roman8 en_GB.iso88591 en_GB.roman8 en_US.iso88591 en_US.roman8 fi_FI.iso88591 fi_FI.roman8 fr_CA.iso88591 fr_CA.roman8 fr_FR.iso88591 fr_FR.roman8 de_DE.iso88591 de_DE.roman8 el_GR.greek8 el_GR.iso88597 iw_IL.hebrew8 iw_IL.iso88598 hu_HU.iso88592 is_IS.iso88591 is_IS.roman8 it_IT.iso88591 it_IT.roman8 no_NO.iso88591 no_NO.roman8 pl_PL.iso88592 pt_PT.iso88591 pt_PT.roman8 ro_RO.iso88592 ru_RU.iso88595 hr_HR.iso88592 sk_SK.iso88592 sl_SI.iso88592 es_ES.iso88591 es_ES.roman8 sv_SE.iso88591 sv_SE.roman8 th_TH.tis620 tr_TR.iso88599 tr_TR.turkish8 C.iso885915 da_DK.iso885915@euro de_DE.iso885915@euro en_GB.iso885915@euro es_ES.iso885915@euro fi_FI.iso885915@euro fr_CA.iso885915 fr_FR.iso885915@euro is_IS.iso885915@euro it_IT.iso885915@euro nl_NL.iso885915@euro no_NO.iso885915@euro pt_PT.iso885915@euro sv_SE.iso885915@euro zh_CN.hp15CN zh_TW.eucTW
At this point, the reader should appreciate why the first part of this page was devoted to character set encodings. Without understanding what "iso885915" means, this list would be somewhat cryptic.
A locale name has three components. The first component, which is two lower-case letters, shows the language being used. en, for example, means English; es is Spanish; de is German; and so on, using the two-letter country codes from which the primary dialect of each language is derived.
The second component (after the underscore) is the actual country the user is in (or whose locale rules the user wants enforced), and is primarily used for different dialects of a language. en_US and en_GB have a few differences in spelling, different currency symbols, and so forth.
The third component (after the period) is the character encoding. Note that the spelling of the encoding name is not quite standardized across systems. iso885915 is the normal spelling for an ISO-8859-15 encoding, but other systems may require ISO8859-15 (for example). You must use locale -a to see what is available, and how it's spelled, on your system.
The special names C and POSIX are an exception to this. They are required everywhere, and synonymous; they mean (basically) "ASCII, US English, don't apply any special rules". Output under this locale typically conforms to ISO and RFC standards for dates/times/etc., omits thousands separators entirely, uses the actual ASCII encoding values for sorting characters, and so on (generally defaulting to "traditional US computing rules").
You specify which locale you want to use by setting environment variables. (See DotFiles for a discussion of how and where to set environment variables for your interactive sessions.) The various LC_* variables, if set, define specific rules to follow; the LANG variable defines the fallback for whichever LC_* variables aren't set. In the most common cases, you will only set LANG.
To get the settings we saw on our example system, we might use something like:
imadev:~$ cat .profile ... LANG=en_US.iso88591; LC_TIME=POSIX export LANG LC_TIME ...
(Note: .profile is only the correct file for certain types of logins. See DotFiles if you don't know which file you need to edit or create.)
This gives us the "US English, with ISO-8859-1 encoding" rules for most things, but the POSIX rules for displaying dates and times.
Since these are just environment variables, we can explore what happens when we change things.
imadev:~$ LC_TIME=POSIX date Thu Apr 16 10:32:03 EDT 2009 imadev:~$ LC_TIME=en_US.iso88591 date Thu, Apr 16, 2009 10:32:13 AM
For details of what your system does with locales, you'll need to check your manuals (such things are very much open to interpretation by implementors). Debian systems have locale(7) (type man 7 locale to read it); HP-UX has lang(5); and so on.
Once you've decided how you want your session to work, and where you need to put variables, just set things however you prefer.
3. Writing locale-aware programs
When writing programs -- particularly shell scripts, but this applies to other forms of programming as well -- one must be aware of the potentially differing behavior of the target system based on locale selection.
We've already seen how the date command on one system changes its output in response to locales, with fields moved around, extra commas inserted, and a 12-hour clock used instead of a 24-hour clock. (Yours may not be quite as radical, or it could be even more so.) Error messages from other programs or from system libraries may be translated into other languages.
If you rely on the output of a program or library call to be in a standard format, you should override the locale environment variables, setting the locale to C, for the parts that require consistency. The LC_ALL variable has priority over the individual LC_* variables, which in turn have priority over LANG. Thus, you can get the behavior you expect by forcing LC_ALL=C at critical points.
Example:
imadev:~$ echo Hello World | tr A-Z a-z hÉMMÓ wÓSMÐ imadev:~$ echo Hello World | LC_ALL=C tr A-Z a-z hello world
(That's one of my favorite examples, ever.)
Many commands offer locale-aware methods of replicating traditional behaviors. For example, tr has [:upper:] to replace A-Z, and so on. These should be preferred where available. Consider that [:upper:] may include things like Á which would not be in the C locale's A-Z. But in the end, as the programmer, you bear the responsibility for choosing what is most appropriate for your project.
The behavior of globs is also locale-dependent; the LC_COLLATE variable defines the order in which names are sorted. The ls command also sorts its output by default, using the same locale-dependent ordering. Unfortunately, there are no standard ways to learn what the ordering is within a given locale. One must resort to brute force tricks. For instance,
imadev:/tmp/greg$ for i in {1..255}; do eval touch \$\'\\x$(printf %02x $i)\'; done touch: cannot change times on / imadev:/tmp/greg$ ls -b 8 Ä C É G Î M Ò P t ü ý { - ; ¶ ¥ _ \002 \014 \026 \200 \212 \224 \236 9 ä c é g î m ò p U V ÿ } × : § ¤ \003 \015 \027 \201 \213 \225 \237 0 A Å Ç È H Ï N Ô Q u v Z « ÷ " @ µ ª \004 \016 \030 \202 \214 \226 \177 1 a å ç è h ï n ô q Ú W z » ± ¿ & ^ º \005 \017 \031 \203 \215 \227 2 Á Ã D Ê I J Ñ Ö R ú w Þ < ¬ ? ° ~ ¹ \006 \020 \032 \204 \216 \230 3 á ã d ê i j ñ ö r Ù X þ > ¼ ¡ % ´ ² \007 \021 \033 \205 \217 \231 4 À Æ Ð Ë Í K O Õ S ù x ( ` ½ ! # ¨ ³ \010 \022 \034 \206 \220 \232 5 à æ ð ë í k o õ s Û Y ) ' ¾ \ $ ¸ © \011 \023 \035 \207 \221 \233 6 Â B E F Ì L Ó Ø ß û y [ = * | ¢ · ® \012 \024 \036 \210 \222 \234 7 â b e f ì l ó ø T Ü Ý ] + , ¦ £ ¯ \001 \013 \025 \037 \211 \223 \235
This omits the / character, as we cannot create a file with that name, but it does show us the ordering of all the other characters in the HP-UX 10.20 implementation of en_US.iso88591. Except for the one that I'm unable to paste into this web browser's textarea (the blank spot to the left of \003). Of course, attempting this on a multi-byte encoding like UTF-8 poses a few logistical problems. (It's probably best explored in segments.)
Since sorting is affected by locale, you may consider overriding LC_COLLATE if you require traditional "ASCIIbetical" order; but you should generally respect the user's locale choices whenever possible.