Differences between revisions 28 and 29
Revision 28 as of 2019-05-31 21:40:03
Size: 17952
Editor: nchambers
Revision 29 as of 2020-07-06 20:52:07
Size: 17878
Editor: GreyCat
Comment: clarify
Deletions are marked like this. Additions are marked like this.
Line 141: Line 141:
    IFS=,; namesArray=( $names ); unset IFS # "Works" but is BAD code. Unsafe word splitting is taking place
                                            # (implicit p
athname expansion)

IFS=,; namesArray=( $names ); unset IFS # "Works" but is BAD code. Pathname expansion will still occur.
Line 145: Line 145:
    IFS=, read -ra namesArray <<< "$names"  # IFS is scoped only to 'read' and the names variable is split     IFS=, read -ra namesArray <<< "$names," # IFS is scoped only to 'read' and the names variable is split
Line 148: Line 148:


This topic describes what is probably the most important and most misunderstood topic about shell programming.

It is absolutely vital that you understand everything that is explained here thoroughly before you do any important work in the shell. Misunderstanding what arguments are and how word-splitting works will lead to unexpected bugs, even in code you have tested and appears to work well; and in worse cases, severe corruption and data loss.

Executing commands

A shell is an interface between you (or your script) and the kernel. It allows you to execute commands using simplified syntax as compared to invoking direct system calls.

What your shell really does for you, is translate its syntax into system calls. Because of this, it is important that we at least understand the basics of what happens when the shell is reading your orders and doing your bidding.

Executing commands happens through the execve(2) system call. This call needs three pieces of information:

  • The file to execute: This can be a binary program or a script.
  • An array of arguments: A list of strings that tell the program what to do.
  • An array of environment variables

To give context to a program and tell it what to do, we provide it with an array of arguments. That means, we give it a series of strings. Each of these strings can contain any character (byte) (except for a NUL-byte). That means, each argument can be a word, a sentence, or more.

If we, for example, wish to delete a certain ebook, we might invoke the rm program, providing it with the pathnames of the files we wish to remove:

    Execute File:   [ "/bin/rm" ]
    With arguments: [ "James, P.D. - Children of Men - Chapter 1.pdf" ]
                    [ "James, P.D. - Children of Men - Chapter 2.pdf" ]

The rm command will use these two arguments to determine what to delete. In its case, it will unlink(2) each argument.

(In reality, when invoking the execve(2) system call, we pass one extra argument. This zeroth argument is the name that we wish to give the process. It's only vaguely defined what exactly that means and what it's used for, and it's irrelevant to this topic. Suffice it to say that bash uses the first chunk after word splitting (the command name chunk) as the zero'th argument. I'll be omitting any mention of it here, henceforth.)

So remember: Arguments are strings of characters; each of them can contain any character (other than a NUL byte) and we can pass several arguments when we execute a file.

Shell Syntax

To make it easy for you to express yourself when asking the system to perform an operation, shells exist that translate their simplified syntax into system calls. It is imperative that we understand this syntax correctly if we are to avoid mistakes and bugs.

Shell syntax has been built to provide an intuitive way for us to communicate with the system. It uses techniques such as WordSplitting and english keywords to allow us to express our wishes in a language that closely resembles the way we would write to each other. Don't be fooled though: this syntax is very exact and shells are not humans; they can't guess at what you might mean if you don't express yourself clearly and unambiguously. Do not guess at shell syntax based on intuition. Understand it, and then write exactly what you mean.

To execute a simple rm command from the shell, we would use a statement as plain as:

    rm myfile myotherfile

This would instruct the shell to delete (remove) two files: myfile, and myotherfile. How does the shell know this? How does it convert a sentence into system calls? The key to this is Word Splitting.

Word Splitting

To a shell, whitespace is incredibly important. So don't be fooled into thinking a space or tab more or less won't make much of a difference. And don't assume that because whitespace isn't very relevant in C or Java, that the same goes for your shell. Whitespace is vital to allowing your shell to understand you.

The shell takes your line of code and cuts it up into bits wherever there are sequences of syntactical whitespace. The command above would be split up into the following:

    rm myfile myotherfile
      ^      ^


As you can see, all syntactical whitespace has been removed. There is no more whitespace left after word splitting is done with your line. We simply have three completely separate chunks of characters: One says rm, the other says myfile, and the last reads myotherfile. The shell now uses these chunks to build its execve(2) system call.

The shell builds an array of arguments to pass to the operating system. The elements (strings) in this array are rm, myfile and myotherfile. An execve(2) call is invoked passing this array. The operating system acts on that; it searches the PATH environment variable for a program named rm, and runs it with the arguments from the array. rm then unlink(2)s those files.

(BASH actually does its own PATH search first, and stores the location of rm in a hash. Other shells may not do that, and it's not really important at the moment.)

Internal Field Separator (`IFS`)

Bash has a special variable, IFS, which is used to determine what characters to use as word splitting delimiters. By default, IFS is unset, and acts as if it were set to space, tab, newline ($' \t\n'). The result of this is that you can use any of these forms of white space to delimit words.

IFS is only used during word splitting of data. For instance, this example does not split any data. There is only script code. A line of script code is always split only on whitespace:

    rm myfile myother:file
      ^      ^


On the other hand, if the files come from a variable that gets expanded, the word splitting is being performed on data that comes from a variable. In this case, IFS is used:

    # BAD code.

    files='myfile myother:file'
    rm $files

    # After parameter expansion:
    rm myfile myother:file
      ^              ^

    [myfile myother]

Note how we have two separate forms of word splitting going on here. First, the command is split where there's whitespace. Then, the expansion is split using IFS. In this case, the whitespace does NOT split into words; only the colon does.

When and why not to use a custom IFS

It is important that you understand the danger of changing the way the shell behaves. If you modify IFS, word splitting will happen in a non-default manner henceforth. Some will recommend you save IFS and reset it to the default later on. Others will recommend to unset IFS after you're done with your custom word splitting.

Personally, I prefer to recommend you to NOT modify IFS on the script level. Ever. First of all, it is very bad practice to actually USE data-level word-splitting. Word splitting is a very dodgy technique, especially because along with it comes unwanted pathname expansion (bash will search for any files that match the name of your newly expanded words and actually replace your words with any files that match it -- it takes time and you never actually want this happening to your words). Secondly, re-setting IFS is usually not enough to limit the scope of your custom wordsplitting. Consider the following example:

    # BAD code.

    for name in $names
    do ...; done
    unset IFS

Here, IFS needs to be set to , in order to split the names string into comma-delimited elements. However, doing so means your entire for loop's body will be running under a non-default IFS, which may cause it to exhibit undesirable effects.

The right way to do this would be either of these constructs:

    # Ex 1: Arrays
    names=( Tom Jack )
    for name in "${names[@]}"
    do ...; done

    # Ex 2: Safe word-splitting (no pathname expansion) and custom IFS limited to read command only
    IFS=, read -ra namesArray <<< "$names"
    for name in "${namesArray[@]}"
    do ...; done

    # Ex 3: A while-read loop
    while IFS= read -rd, name
    do ...; done <<< "$names,"

As you can see, setting a custom IFS can be handy, but is best done on the command, so that it only applies to that one command. Only a certain number of built-in bash commands can actually use IFS this way. read is one of them. You cannot scope unquoted expansion this way, but that's alright, because you really shouldn't be doing it in the first place!


    # BAD code.
    IFS=, namesArray=( $names )             # Doesn't work: You cannot scope IFS this way.  It actually sets
                                            # both IFS and namesArray for the whole script.

    IFS=,; namesArray=( $names ); unset IFS # "Works" but is BAD code.  Pathname expansion will still occur.

    # Good code.
    IFS=, read -ra namesArray <<< "$names," # IFS is scoped only to 'read' and the names variable is split
                                            # safely, without pathname expansion.


Now, let's come back to our first example: we wanted to delete the chapter files of our ebook. Doing this from a shell seems problematic, because the chapter filenames contain whitespace. This is not a problem whatsoever for the system call, but it is a big problem for the shell. The shell already uses whitespace for something very important: determining what chunks of our statement to pass as separate arguments.

If we were to, naively, tell the shell to delete the first chapter, without any thought or consideration for its syntax, this is what would happen:

    rm James, P.D. - Children of Men - Chapter 1.pdf
      ^      ^    ^ ^        ^  ^   ^ ^       ^


Your shell would be passing the rm program 9 filenames for deletion, none of which is the intended filename. rm would try to delete each filename. If you were unlucky, rm might delete some of your files that you never intended to delete by your accident.

From the Word Splitting section above, you know why the shell does this now. But how do we help the shell to understand what we really wish to accomplish?

The problem is that whitespace is syntax to the shell. That means the shell acts on it. We don't want the whitespace in our filename to mean anything to the shell; we just want it to be part of the chunk of data, just like any of the other characters. Just like a normal, plain, happy byte. We want our whitespace to be literal whitespace.

Changing something from syntax into literal data involves one of two processes: Quoting or Escaping. Quoting our bytes is done by wrapping quotation marks around them. Escaping is done by preceding each byte by a backslash.

Pay special notice: these quotation marks must not be literal; just like our whitespace above, these quotes must be unquoted and unescaped to remain syntactical (that is, to retain their special powers).

    # Escaped:
    rm James,\ P.D.\ -\ Children\ of\ Men\ -\ Chapter\ 1.pdf

    # Quoted:
    rm James," "P.D." "-" "Children" "of" "Men" "-" "Chapter" "1.pdf

    # We can make that prettier by just quoting all of the characters in the argument:
    rm "James, P.D. - Children of Men - Chapter 1.pdf"

    [James, P.D. - Children of Men - Chapter 1.pdf]

Every byte that is embedded in syntactical quotes is no longer considered syntactical (with some quote-specific exceptions I won't go into now). What that means, is that if we quote the string foo bar, each character in that string will lose any special purpose or meaning to the shell. The shell will see them as ordinary bytes and pass them along to the chunk it's working on.

Since the quotes (or backslashes) are syntactical, they are removed by the shell; just like that one syntactical space, they are not included in any data chunks. All literal bytes are included in chunks, however, which means we now get only two chunks: one with the command name, and another with the correct filename.

For more information on quotes and how exactly they behave, see Quotes.

Parameter Expansions

You should understand arguments and quotes well now. Let's introduce another concept that is very popular in shell scripts yet is frequently misunderstood.

Parameters are containers in memory that hold strings for us. We can later use these strings in shell commands without having to repeat the data: we "unload" the data from the memory containers into the statement. This "unloading" is called expansion, hence the term Parameter Expansion.

Variables are a common type of parameter. They are parameters with a distinct name and are easy to assign data to. The name of a variable contains only alphanumeric characters (and optionally, an underscore). It does not contain a dollar sign.

Expanding a parameter occurs by prefixing it with a dollar sign. The act of expansion causes the data in this parameter to be injected into the current statement, almost as though you replaced the parameter expansion with some sentence yourself.

    # BAD code.

    $ place=lawn
    $ echo Welcome to my $place.
    Welcome to my lawn.

It is vital to understand, however, that Quoting and Escaping are considered before parameter expansion happens, while Word Splitting is performed after. That means that it remains absolutely vital that we quote our parameter expansions, in case they may expand values that contain syntactical whitespace which will then in the next step be word-split.

It is almost never desirable to put syntactical whitespace in parameters. Perhaps you may want to include multiple chunks of data in one parameter; however, when this is necessary, it is important that you do NOT use a string (scalar) parameter, but use an array instead. (POSIX shells do not necessarily have arrays, but Bash and KornShell do.)

Here's what would happen if we expanded a parameter whose data contains whitespace, without quoting it:

    # BAD code.

    book="Children of Men.pdf"
    rm $book

    # After parameter expansion:
    rm Children of Men.pdf
      ^        ^  ^


Word Splitting Happens After PE

Quoting the parameter expansion causes its data to expand inside of a quoted context, meaning its whitespace will lose its syntactical value and will become literal:

    book="Children of Men.pdf"
    rm "$book"

    # After parameter expansion:
    rm [Children of Men.pdf]      # The [ and ] are pseudo-code; they are not really there but symbolize
      ^                           # that these bytes were marked as literal by the quotes above.

    [Children of Men.pdf]

Quoting Happens Before PE

Another common mistake many people make when they see word splitting errors is to try to include quotes inside their parameter data. This doesn't work, for the simple reason that these quotes inside parameters are literal quotes, not syntactic. By the time bash expands parameter values, it has already stopped treating quotes as syntactic elements.

    # BAD code.

    book='"Children of Men.pdf"'
    rm $book

    # After parameter expansion:
    rm "Children of Men.pdf"      # The quotes here are LITERAL quotes, NOT syntactical.  I wrote no [ and ]
      ^         ^  ^              # because there were no syntactical quotes in the rm command above.


Note that since you've expanded literal quotes, these quotes are now also part of the chunks, just like any other literal bytes. The whitespace, however, is not literal. Since word splitting on whitespace (actually, IFS) is done so late in the process, the shell still considers them to be delimiters, and produces the mess shown above.


This may be a bit much for you to grasp all at once, and grasp it well. Please bookmark this page if you think it will help you to come back and re-read it later.

To make things simple, consistent and safe for you; you should follow the following guidelines:

  • "Quote" any arguments that contain data which also happens to be shell syntax.

  • "$Quote" all parameter expansions in arguments. You never really know what a parameter might expand into; and even if you think it won't expand bytes that happen to be shell syntax, quoting will future-proof your code and make it safer and more consistent.

  • Don't try to put syntactical quotes inside parameters. It doesn't work.

And some additional related tips:

  • If you need to store multiple items together, use an array:

     files=( 1.pdf 2.pdf "1 and a half.pdf" )
     rm "${files[@]}"
  • Do NOT try to put commands inside parameters; you cannot properly quote the arguments. Use a function instead:

     search() { cd /foo || return 1; find . -name "$1"; }
     search '*.pdf'; search '*.jpg'


Arguments (last edited 2020-07-06 20:52:07 by GreyCat)