Differences between revisions 6 and 7
Revision 6 as of 2010-07-30 13:12:18
Size: 13632
Editor: GreyCat
Comment:
Revision 7 as of 2010-07-30 13:13:19
Size: 13655
Editor: GreyCat
Comment: oh, and a category tag too.
Deletions are marked like this. Additions are marked like this.
Line 220: Line 220:

----
CategoryShell

Arguments

This topic describes what is probably the post important and most misunderstood topic about shell programming.

It is absolutely vital that you understand everything that is explained here thoroughly before you do any important work in the shell. Misunderstanding what arguments are and how word-splitting works will lead to unexpected bugs, even in code you have tested and appears to work well; and in worse cases, severe corruption and data loss.

Executing commands

A shell is an interface between you (or your script) and the kernel. It allows you to execute commands using simplified syntax as compared to invoking direct system calls.

What your shell really does for you, is translate its syntax into system calls. Because of this, it is important that we at least understand the basics of what happens when the shell is ready reading your orders and begin doing your bidding.

Executing commands happens through the execve(2) system call. This call needs three pieces of information:

  • The file to execute: This can be a binary program or a script.
  • An array of arguments: A list of strings that tell the program what to do.
  • An array of environment variables

To give context to a program and tell it what to do, we provide it with an array of arguments. That means, we give it a series of strings. Each of these strings can contain any character (byte) (except for a NUL-byte). That means, each argument can be a word, a sentence, or more.

If we, for example, wish to delete a certain ebook, we might invoke the rm program, providing it with the pathnames of the files we wish to remove:

    Execute File:   [ "/bin/rm" ]
    With arguments: [ "James, P.D. - Children of Men - Chapter 1.pdf" ]
                    [ "James, P.D. - Children of Men - Chapter 2.pdf" ]
                    ...

The rm command will use these two arguments to determine what to delete. In its case, it will unlink(2) each argument.

(In reality, when invoking the execve(2) system call, we pass one extra argument. This zeroth argument is the name that we wish to give the process. It's only vaguely defined what exactly that means and what it's used for, and it's irrelevant to this topic. Suffice it to say that bash uses the first chunk after word splitting (the command name chunk) as the zero'th argument. I'll be omitting any mention of it here, henceforth.)

So remember: Arguments are strings of characters; each of them can contain any character (other than a NUL byte) and we can pass several arguments when we execute a file.

Shell Syntax

To make it easy for you to express yourself when asking the system to perform an operation, shells exist that translate their simplified syntax into system calls. It is imperative that we understand this syntax correctly if we are to avoid mistakes and bugs.

Shell syntax has been built to provide an intuitive way for us to communicate with the system. It uses techniques such as WordSplitting and english keywords to allow us to express our wishes in a language that closely resembles the way we would write to each other. Don't be fooled though: this syntax is very exact and shells are no humans; they can't guess at what you might mean if you don't express yourself clearly and unambiguously. Do not guess at shell syntax based on intuition. Understand it, and then write exactly what you mean.

To execute a simple rm command from the shell, we would use a statement as plain as:

    rm myfile myotherfile

This would instruct the shell to delete (remove) two files: myfile, and myotherfile. How does the shell know this? How does it convert a sentence into system calls? The key to this is Word Splitting.

Word Splitting

To a shell, whitespace is incredibly important. So don't be fooled into thinking a space or tab more or less won't make much of a difference. And don't assume that because whitespace isn't very relevant in C or Java, that the same goes for your shell. Whitespace is vital to allowing your shell to understand you.

The shell takes your line of code and cuts it up into bits wherever there is sequences of syntactical whitespace. The command above would be split up into the following:

    rm myfile myotherfile
      ^      ^

    [rm]
    [myfile]
    [myotherfile]

As you can see, all syntactical whitespace has been removed. There is no more whitespace left after word splitting is done with your line. We simply have three completely separate chunks of characters: One says rm, the other says myfile, and the last reads myotherfile. The shell now uses these chunks to build its execve(2) system call.

The shell builds an array of arguments to pass to the operating system. The elements (strings) in this array are rm, myfile and myotherfile. An execve(2) call is invoked passing this array. The operating system acts on that; it searches the PATH environment variable for a program named rm, and runs it with the arguments from the array. rm then unlink(2)s those files.

(BASH actually does its own PATH search first, and stores the location of rm in a hash. Other shells may not do that, and it's not really important at the moment.)

Quoting

Now, let's come back to our first example: we wanted to delete the chapter files of our ebook. Doing this from a shell seems problematic, because the chapter filenames contain whitespace. This is not a problem whatsoever for the system call, but it is a big problem for the shell. The shell already uses whitespace for something very important: determining what chunks of our statement to pass as separate arguments.

If we were to, naively, tell the shell to delete the first chapter, without any thought or consideration for its syntax, this is what would happen:

    rm James, P.D. - Children of Men - Chapter 1.pdf
      ^      ^    ^ ^        ^  ^   ^ ^       ^

    [rm]
    [James,]
    [P.D.]
    [-]
    [Children]
    [of]
    [Men]
    [-]
    [Chapter]
    [1.pdf]

Your shell would be passing the rm program 9 filenames for deletion, none of which is the intended filename. rm would try to delete each filename. If you were unlucky, rm might delete some of your files that you never intended to delete by your accident.

From the Word Splitting section above, you know why the shell does this now. But how do we help the shell to understand what we really wish to accomplish?

The problem is that whitespace is syntax to the shell. That means the shell acts on it. We don't want the whitespace in our filename to mean anything to the shell; we just want it to be part of the chunk of data, just like any of the other characters. Just like a normal, plain, happy byte. We want our whitespace to be literal whitespace.

Changing something from syntax into literal data involves one of two processes: Quoting or Escaping. Quoting our bytes is done by wrapping quotation marks around them. Escaping is done by preceding each byte by a backslash.

Pay special notice: these quotation marks must not be literal; just like our whitespace above, these quotes must be unquoted and unescaped to remain syntactical (that is, to retain their special powers).

    # Escaped:
    rm James,\ P.D.\ -\ Children\ of\ Men\ -\ Chapter\ 1.pdf

    # Quoted:
    rm James," "P.D." "-" "Children" "of" "Men" "-" "Chapter" "1.pdf
    # But also valid is the cleaner:
    rm "James, P.D. - Children of Men - Chapter 1.pdf"
      ^
    [rm]
    [James, P.D. - Children of Men - Chapter 1.pdf]

Every byte that is embedded in syntactical quotes is no longer considered syntactical (with some quote-specific exceptions I won't go into now). What that means, is that if we quote the string foo bar, each character in that string will loose any special purpose or meaning to the shell. The shell will see them as ordinary bytes and pass them along to the chunk it's working on.

Since the quotes (or backslashes) are syntactical, they are removed by the shell; just like that one syntactical space, they are not included in any data chunks. All literal bytes are included in chunks, however, which means we now get only two chunks: one with the command name, and another with the correct filename.

For more information on quotes and how exactly they behave, see Quotes.

Parameter Expansions

You should understand arguments and quotes well now. Let's introduce another concept that is very popular in shell scripts yet almost just as often misunderstood.

Parameters are containers in memory that hold strings for us. We can later use these strings in shell commands without having to repeat the data: we "unload" the data from the memory containers into the statement. This "unloading" is called expansion, hence the term Parameter Expansion.

A common type of parameters are variables. They are parameters with a distinct name and are easy to assign data to. The name of a variable contains only alphanumeric characters (and optionally, an underscore). It does not contain a dollar sign.

Expanding a parameter occurs by prefixing it with a dollar sign. The act of expansion causes the data in this parameter to be injected into the current statement, almost as though you replaced the parameter expansion with some sentence yourself.

    $ place=lawn
    $ echo Welcome to my $place.
    Welcome to my lawn.

It is vital to understand, however, that Quoting and Escaping are considered before parameter expansion happens, while Word Splitting is performed after. That means that it remains absolutely vital that we quote our parameter expansions, in case they may expand values that contain syntactical whitespace which will then in the next step be word-split.

It is almost never desirable to put syntactical whitespace in parameters. Perhaps you may want to include multiple chunks of data in one parameter; however, when this is necessary, it is important that you do NOT use a string (scalar) parameter, but use an array instead. (POSIX shells do not necessarily have arrays, but Bash and KornShell do.)

Here's what would happen if we expanded a parameter whose data contains whitespace, without quoting it:

    book="Children of Men.pdf"
    rm $book

    # After parameter expansion:
    rm Children of Men.pdf
      ^        ^  ^

    [rm]
    [Children]
    [of]
    [Men.pdf]

Word Splitting Happens After PE

Quoting the parameter expansion causes its data to expand inside of a quoted context, meaning its whitespace will lose its syntactical value and will become literal:

    book="Children of Men.pdf"
    rm "$book"

    # After parameter expansion:
    rm [Children of Men.pdf]      # The [ and ] are pseudo-code; they are not really there but symbolize
      ^                           # that these bytes were marked as literal by the quotes above.

    [rm]
    [Children of Men.pdf]

Quoting Happens Before PE

Another common mistake many people make when they see word splitting errors is to try to include quotes inside their parameter data. This doesn't work, for the simple reason that these quotes inside parameters are literal quotes, not syntactic. By the time bash expands parameter values, it has already stopping treating quotes as syntactic elements.

    book='"Children of Men.pdf"'
    rm $book

    # After parameter expansion:
    rm "Children of Men.pdf"      # The quotes here are LITERAL quotes, NOT syntactical.  I wrote no [ and ] because
      ^         ^  ^              # there were no quotes in the above rm command to tell bash to literalize any bytes.

    [rm]
    ["Children]
    [of]
    [Men.pdf"]

Note that since you've expanded literal quotes, these quotes are now also part of the chunks, just like any other literal bytes. The whitespace, however, is not literal. Since word splitting on whitespace (actually, IFS) is done so late in the process, the shell still considers them to be delimiters, and produces the mess shown above.

Conclusion

This may be a bit much for you to grasp all at once, and grasp it well. Please bookmark this page if you think it will help you to come back and re-read it later.

To make things simple, consistent and safe for you; you should follow the following guidelines:

  • "Quote" any arguments that contain data which also happens to be shell syntax.

  • "$Quote" all parameter expansions in arguments. You never really know what a parameter might expand into; and even if you think it won't expand bytes that happen to be shell syntax, quoting will future-proof your code and make it safer and more consistent.

  • Don't try to put syntactical quotes inside parameters. It doesn't work.

And some additional related tips:

  • If you need to store multiple items together, use an array:

     files=( 1.pdf 2.pdf "1 and a half.pdf" )
     rm "${files[@]}"
  • Do NOT try to put commands inside parameters; you cannot properly quote the arguments. Use a function instead:

     search() { cd /foo; find . -name "$1"; }
     search '*.pdf'; search '*.jpg'


CategoryShell

Arguments (last edited 2024-06-03 03:52:03 by larryv)