Diff for "BashParser"

Differences between revisions 11 and 12

The Bash Parser

This page informally describes parsing, expansion, and argument handling, but fuzzes some important distinctions that depend upon the type of command being handled. See Bash grammar and Parsing and execution on bash-hackers for a better look at this. It is important that you have a good understanding of how Bash reads your commands in and parses them into executable code, but even more important to understand the grammar of the language than implementation-specific parser details.

Parsing

There are several stages of parsing which occur in multiple passes: on the level of entire script files; within individual commands; and line-by-line. Code undergoes several intermediary internal representations throughout the evaluation process, some of which can't be analyzed through Bash's debugging facilities.

During Bash's initial intake of code -- as it reads source files or interactive input -- commands are parsed both line-by-line and command-by-command. Certain aspects of parsing are tied closely with lines. HereDocument parsing, some error handling behaviours, and some details of metacharacter parsing (e.g. extglobs) are tied to newlines. The extent to which Bash deals in "lines" is unclear and there is considerable variation across different shells. For example, some shells will accept !; cmd or ! !; cmd, whereas Bash requires a real newline and can't handle a semicolon in this case. Still other shells can't handle either type of null pipeline even with a newline.

In other respects, Bash parses commands in chunks whose scope encompasses roughly that of the current compound command. Most will notice this when they accidentally forget a closing fi, or semicolon before a closing curly-brace command group.

Aside from syntax errors, most of the time you don't need to think about this part of parsing. It's the actual evaluation of the commands (and intermediary parsing steps that happen at that time) that matters. Nevertheless, you may run across these considerations in some advanced cases when writing portability wrappers involving code that a particular shell implementation chokes on, or when Bash handles errors that are sensitive to newlines. Much of this behaviour is unspecified, some differs between Bash POSIX and normal mode, and a few are likely bugs or just coincidental behaviour.

Command splitting

Step 1: Read data to execute.
- Data is parsed as described above with various details differing depending upon interactive mode, POSIX mode, and certain shell options. Lines that end in the middle of a context that allows continuation to the next line are considered as a whole. An incomplete list of examples: compound array assignments; lines ending in a pipeline operator, list operator other than ; or &; lines ending within a quoted context without a closing quote, in the middle of a compound command, an unclosed command substitution, or a simple command ending in a backslash character.
  - Step Input:
    echo "What's your name?"
    read name; echo "$name"
    
    Step Output:
    echo "What's your name?"
    - and
    read name; echo "$name"
Step 2: Process quotes.
- ( Note: the next three steps are somewhat interconnected and dependent on one another. Describing them separately is an approximation. )
- Roughly, once Bash has read in your line of data, it looks through in search of quotes. Bash does its best during this phase to make exceptions for brace expansion and other factors that disrupt simple quote nesting rules like command substitutions and some nested parameter expansions.
  
  Aside from that, the first "bare" quote it finds triggers a quoted state for all characters that follow up until the next quote of the same type. Note that bash doesn't actually process the contents of the quoted regions at this stage except to the extent necessary to determine later command splitting steps.
  
  If the quoted state was triggered by a double quote ("..."), all characters except for $, " and \ lose any special meaning they might have. That includes single quotes, spaces and newlines, etc. If the quoted state was triggered by a single quote ('...'), all characters except for ' lose their special meaning. Yes, also $ and \. If the quoted state was triggered by $'...', then all characters except for \ and ' lose their special meaning. Therefore, the following command will produce literal output:
```
    $ echo 'Back\Slash $dollar "Quote"'
   Back\Slash $dollar "Quote"
```
  The fact that the backslash loses its ability to cancel out the meaning of the next character means that this will not work:
```
    $ echo 'Don\'t do this'
    >
```
  Bash will ask you for the next line of input because unlike what we thought we did, the second quote, the one we tried to escape with the backslash, actually closed our quoted state meaning the t do this was not quoted. The last quote on the line then opened our quoted state again, and bash asks for more input until it is closed again (it tries to finish step 1: it reads data until it finds an unescaped newline. The opened single quote state is escaping our newline). Now that Bash knows which of the characters in the line of data are escaped (stripped of their ability to mean anything special to Bash) and which are not, Bash removes the quotes that were used to determine this from the data and proceeds to the next step.
  - Step Input:
    echo "What's your name?"
    
    Step Output:
    echo What's your name?
    - (Note: Every character originally between the double quotes has been marked as escaped. I will mark escaped characters in these examples by making them italic.)
Step 3: Split the read data into commands.
- Our line is now split up into separate commands using ;, &, ||, &&, and characters defined as metacharacters such as ( and ) as command separators (Bash doesn't always do this correctly). Remember from the previous step that any ; characters that were quoted or escaped do not have their special meaning anymore and will not be used for command splitting. They will just appear in the resulting command line literally:
```
    $ echo "What a lovely day; and sunny, too!"
   What a lovely day; and sunny, too!
```
  - Step Input:
    read name; echo $name
    
    Step Output:
    read name
    - and
    echo $name

Command expansion and evaluation

The remaining steps are processed for each individual command.

Step 4: Parse special operators.
- Look through each command to see whether there are any special operators such as {..}, <(..), < ..., <<< .., .. | .., etc. These are all processed in a specific order. If the command is compound, then Redirection operators that apply to that command are evaluated, and the command is processed following rules specific to each compound command, with different expansion steps either suppressed or applied depending on context.
  
  If the command is simple, then both assignment statements preceding commands (unless set -k is enabled) and redirections anywhere within the command are removed and saved for processing after step 5.
Step 5: Perform Expansions.
- Bash has many operators that involve expansion. The simplest of these is $parameter. The dollar sign followed by the name of a parameter, which optionally might be surrounded by braces, is called Parameter Expansion. What Bash does here is basically just replace the Parameter Expansion operator with the contents of that parameter. As such, the command echo $USER will in this step be converted to echo lhunath with me. Other expansions include Pathname Expansion (echo *.txt), Command Substitution (rm "$(which nano)"), etc.
  - Step Input:
    echo "$PWD has these files that match *.txt :" *.txt
    
    Step Output:
    echo /home/lhunath/docs has these files that match *.txt : bar.txt foo.txt
Step 6: Execute the command.
- Now that the command has been parsed into a command name and a set of arguments, Bash executes the command and sets the command's arguments to the list of words it has generated in the previous step. If the command type is a function or builtin, the command is executed by the same Bash process that just went through all these steps. Otherwise, Bash will first fork off (create a new bash process), initialize the new bash processes with the settings that were parsed out of this command (redirections, arguments, etc.) and execute the command in the forked off bash process (child process). The parent (the Bash that did these steps) waits for the child to complete the command.
  - Step Input:
    sleep 5
    
    Causes:
    ├┬· 33321 lhunath -bash
    │├──· 46931 lhunath sleep 5

After these steps, the next command, or next line is processed. Once the end of the file is reached (end of the script or the interactive bash session is closed) bash stops and returns the exit code of the last command it has executed.

Graphical Example

For a simplified example of the process, see: http://stuff.lhunath.com/parser.png

Note that word-splitting (also WordSplitting) or field splitting is used incorrectly in this graphic and confused with argument splitting, which is performed before expansions and is based upon whitespace (except in traditional Bourne shells), rather than the value of IFS during field splitting, which occurs just before pathname expansion.

Common Mistakes

These steps might seem like common sense after looking at them closely, but they can often seem counter-intuitive for certain specific cases. As an example, let me enumerate a few cases where people have often made mistakes against the way they think bash will interpret their command:

start=1; end=5; for number in {$start..$end}: Sequence Expansion happens in step 4, while Parameter Expansion happens in step 5 (this Bash-specific). Brace Expansion tries to expand {$start..$end} but can't. It sees the $start and $end as strings, not Parameter Expansions and gives up:
- Step 4 Results:
```
start=1
end=5
for number in {$start..$end}
```
  Step 5 Results:
```
start=1
end=5
for number in {1..5}
```
  And number will now become {1..5} instead of 1. No Brace Expansion has been performed.
[ $name = B. Foo ]: Word Splitting will break this example. The test program ([) looks for four arguments in this case. A left hand side, an operator, a right hand side, and a closing ]. To find out what's wrong with this command, do as Bash does: Chop the command up into arguments. Assuming name contains B. Foo:
- [
- B.
- Foo
- =
- B.
- Foo
- ]
- A whole lot more than four. You need to use Quotes to prevent the space between B. and Foo from causing Word Splitting. Quote the B. Foo AND the $name so that when $name is expanded, the whitespace in B. Foo is treated the same as on the right hand side. It is important to remember that step 5 (Perform Expansion) comes before step 6 (Split the command into a command name and arguments). That means that $name is not safe from having its result cut up, because the cutting up happens after $name is replaced by the value within name.
Remember that parts of the language that evaluate their input as full bash expressions such as eval, . / source, trap, mapfile, and several other features open up another can of worms. The data given them gets run through the full bash parser and is subject to all evaluation steps. The trouble is that often times before your code even gets to be used by these features, it gets subjected to unavoidable undesired evaluation steps that occur during the act of passing the data. Side-effects that hurt code-integrity are hard to control especially when influenced by user-input. It is also easy to violate good coding principles by mixing code stored in non-code datastructures into code that's to be evaluated.

Languages that have datastructures specifically designed for holding program code (function and object literals, and closures) suffer these problems to a much lesser extent than Bash.

CategoryShell

-  ⇤ ← Revision 11 as of 2012-07-21 16:25:18 → 
  Size: 10074
  Editor: 178
  Comment: Fix misspelled word
+   ← Revision 12 as of 2013-02-06 10:27:27 → ⇥
  Size: 13187
  Editor: ormaaj
  Comment: Partial fixes to many issues. The article is going to be hard to salvage. Completely deleted step 6 because it's in the wrong order and about the wrong concept. Graphic is also wrong, but left it.
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-It is imperative that you have a good understanding of how Bash reads your commands in and parses them into executable code.  Knowing how Bash works with your code is the key to writing code that works well with Bash.
+This page informally describes parsing, expansion, and argument handling, but fuzzes some important distinctions that depend upon the type of command being handled. See [[http://wiki.bash-hackers.org/syntax/basicgrammar | Bash grammar]] and [[http://wiki.bash-hackers.org/syntax/grammar/parser_exec | Parsing and execution]] on bash-hackers for a better look at this.  It is important that you have a good understanding of how Bash reads your commands in and parses them into executable code, but even more important to understand the grammar of the language than implementation-specific parser details.
 Line 5:
+=== Parsing ===
There are several stages of parsing which occur in multiple passes: on the level of entire script files; within individual commands; and line-by-line. Code undergoes several intermediary internal representations throughout the evaluation process, some of which can't be analyzed through Bash's debugging facilities.

During Bash's initial intake of code -- as it reads source files or interactive input -- commands are parsed both line-by-line and command-by-command. Certain aspects of parsing are tied closely with lines. HereDocument parsing, some error handling behaviours, and some details of metacharacter parsing (e.g. extglobs) are tied to newlines. The extent to which Bash deals in "lines" is unclear and there is considerable variation across different shells. For example, some shells will accept `!; cmd` or `! !; cmd`, whereas Bash requires a real newline and can't handle a semicolon in this case. Still other shells can't handle either type of ''null pipeline'' even with a newline.

In other respects, Bash parses commands in chunks whose scope encompasses roughly that of the current ''compound command''. Most will notice this when they accidentally forget a closing `fi`, or semicolon before a closing curly-brace ''command group''.

Aside from syntax errors, most of the time you don't need to think about this part of parsing. It's the actual evaluation of the commands (and intermediary parsing steps that happen at that time) that matters. Nevertheless, you may run across these considerations in some advanced cases when writing portability wrappers involving code that a particular shell implementation chokes on, or when Bash handles errors that are sensitive to newlines. Much of this behaviour is unspecified, some differs between Bash POSIX and normal mode, and a few are likely bugs or just coincidental behaviour. 

=== Command splitting ===
-Line 6:
+Line 16:
-  . Bash always reads your script or commands on the bash command prompt ''line by line''.  If your line ends with a backslash character, bash reads another line before processing the command and appends that other line to the current, with a literal newline inbetween.
  ''(I will from here on refer to the chunk of data Bash read in as the '''line''' of data; even though it is technically possible that this line contains one or more newlines.)''
+  . Data is parsed as described above with various details differing depending upon interactive mode, POSIX mode, and certain shell options. Lines that end in the middle of a context that allows continuation to the next line are considered as a whole. An incomplete list of examples: compound array assignments; lines ending in a pipeline operator, list operator other than `;` or `&`; lines ending within a quoted context without a closing quote, in the middle of a ''compound command'', an unclosed command substitution, or a ''simple command'' ending in a backslash character.
-Line 13:
+Line 23:
-  . Once Bash has read in your line of data, it looks through it in search of quotes.  The first quote it finds triggers a quoted state for all characters that follow up until the next quote of the same type.  If the quoted state was triggered by a double quote (`"..."`), all characters except for `$`, `"` and `\` lose any special meaning they might have.  That includes single quotes, spaces and newlines, etc.  If the quoted state was triggered by a single quote (`'...'`), all characters except for `'` lose their special meaning.  Yes, also `$` and `\`.  Therefore, the following command will produce literal output:
+  ( Note: the next three steps are somewhat interconnected and dependent on one another. Describing them separately is an approximation. )

  . Roughly, once Bash has read in your line of data, it looks through in search of quotes.  Bash does its best during this phase to make exceptions for ''brace expansion'' and other factors that disrupt simple quote nesting rules like ''command substitutions'' and some nested parameter expansions. <<BR>> <<BR>>Aside from that, the first "bare" quote it finds triggers a quoted state for all characters that follow up until the next quote of the same type.  Note that bash doesn't actually process the contents of the quoted regions at this stage except to the extent necessary to determine later command splitting steps. <<BR>> <<BR>> If the quoted state was triggered by a double quote (`"..."`), all characters except for `$`, `"` and `\` lose any special meaning they might have.  That includes single quotes, spaces and newlines, etc.  If the quoted state was triggered by a single quote (`'...'`), all characters except for `'` lose their special meaning.  Yes, also `$` and `\`. If the quoted state was triggered by `$'...'`, then all characters except for `\` and `'` lose their special meaning. Therefore, the following command will produce literal output:
-Line 23:
+Line 35:
-  Bash will ask you for the next line of input because unlike what we ''thought'' we did, the '''second''' quote, the one we tried to escape with the backslash, actually '''closed our quoted state''' meaning the `t do this` was '''not''' quoted.  The last quote on the line then '''opened''' our quoted state again, and bash asks for more input until it is closed again (it tries to finish step 1: it reads data until it finds an unescaped newline.  The opened single quote state is escaping our newline). Now that bash knows which of the characters in the line of data are escaped (stripped of their ability to mean anything special to Bash) and which are not, Bash removes the quotes that were used to determine this from the data and proceeds to the next step.
+  Bash will ask you for the next line of input because unlike what we ''thought'' we did, the '''second''' quote, the one we tried to escape with the backslash, actually '''closed our quoted state''' meaning the `t do this` was '''not''' quoted.  The last quote on the line then '''opened''' our quoted state again, and bash asks for more input until it is closed again (it tries to finish step 1: it reads data until it finds an unescaped newline.  The opened single quote state is escaping our newline). Now that Bash knows which of the characters in the line of data are escaped (stripped of their ability to mean anything special to Bash) and which are not, Bash removes the quotes that were used to determine this from the data and proceeds to the next step.
-Line 29:
+Line 41:
-  . Our line is now split up into separate commands using `;` as a command separator.  Remember from the previous step that any `;` characters that were quoted or escaped do not have their special meaning anymore and will not be used for command splitting.  They will just appear in the resulting command line literally:
+  . Our line is now split up into separate commands using `;`, `&`, `||`, `&&`, and characters defined as ''metacharacters'' such as `(` and `)` as command separators (Bash doesn't always do this correctly).  Remember from the previous step that any `;` characters that were quoted or escaped do not have their special meaning anymore and will not be used for command splitting.  They will just appear in the resulting command line literally:
-Line 38:
+Line 50:
-The following steps are executed for each command that resulted from splitting up the line of data:
+=== Command expansion and evaluation ===
The remaining steps are processed for each individual command.
-Line 41:
+Line 54:
-  . Look through the command to see whether there are any special operators such as `{..}`, `<(..)`, `< ...`, `<<< ..`, `.. | ..`, etc.  These are all processed in a specific order.  Redirection operators are removed from the command line, other operators are replaced by their resulting expression (eg. `{a..c}` is replaced by `a b c`).
   . '''Step Input:'''<<BR>> `diff <(foo) <(bar)`<<BR>> <<BR>> '''Step Output:'''<<BR>> `diff /dev/fd/63 /dev/fd/62`<<BR>>
    . ('''Note:''' The `<(..)` operator starts a background process to execute the command `foo` (and one for `bar`, too) and sends the output to a file.  It then replaces itself with the pathname of that file.)<<BR>>
   <<BR>>
+  . Look through each command to see whether there are any special operators such as `{..}`, `<(..)`, `< ...`, `<<< ..`, `.. | ..`, etc.  These are all processed in a specific order. If the command is compound, then Redirection operators that apply to that command are evaluated, and the command is processed following rules specific to each compound command, with different expansion steps either suppressed or applied depending on context. <<BR>> <<BR>> If the command is simple, then both assignment statements preceding commands (unless `set -k` is enabled) and redirections anywhere within the command are removed and saved for processing after step 5.
-Line 51:
+Line 61:
- * '''Step 6: Split the command into a command name and arguments.'''
  . The name of the command Bash has to execute is always the '''first word''' in the line.  The rest of the command data is split into words which make the arguments. This process is called ''Word Splitting''.  Bash basically cuts the command line into pieces wherever it sees whitespace.  This whitespace is completely removed and the pieces are called ''words''.  Whitespace in this context means:  Any spaces, tabs or newlines that are '''not escaped'''. (Escaped spaces, such as spaces inside quotes, lose their special meaning of whitespace and are not used for splitting up the command line.  They appear literally in the resulting arguments.) As such, if the name of the command that you want to execute or one of the arguments you want to pass contains spaces that you don't want bash to use for cutting the command line into words, you can use quotes or the backslash character:
  {{{
   My Command /foo/bar   ## This will execute the command named 'My' because it is the first word.
   "My Command" /foo/bar ## This will execute the command named 'My Command' because the space inside the quotes has lost its special meaning allowing it to split words.
}}}
   . '''Step Input:'''<<BR>> `echo `"`/home/lhunath/docs has these files that match *.txt :`"` bar.txt foo.txt`<<BR>> <<BR>> '''Step Output:'''<<BR>> '''Command Name: ''' '`echo`'<<BR>> '''Argument 1: ''' '`/home/lhunath/docs has these files that match *.txt :`'<<BR>> '''Argument 2: ''' '`bar.txt`'<<BR>> '''Argument 3: ''' '`foo.txt`'<<BR>> <<BR>>

 * '''Step 7: Execute the command.'''
+ * '''Step 6: Execute the command.'''
-Line 67:
+Line 69:
-For a more simplified example of the process, see:
http://stuff.lhunath.com/parser.png
+For a simplified example of the process, see: http://stuff.lhunath.com/parser.png
-Line 70:
+Line 71:
+Note that [[http://wiki.bash-hackers.org/syntax/expansion/wordsplit | word-splitting]] (also [[WordSplitting]]) or ''field splitting'' is used incorrectly in this graphic and confused with argument splitting, which is performed before expansions and is based upon whitespace (except in traditional Bourne shells), rather than the value of [[IFS]] during field splitting, which occurs just before [[glob | pathname expansion]].
-Line 74:
+Line 76:
- * `start=1; end=5; for number in {$start..$end}`:  ''Brace Expansion'' happens in step 4, while ''Parameter Expansion'' happens in step 5.  ''Brace Expansion'' tries to expand `{$start..$end}` but can't.  It sees the `$start` and `$end` as strings, not ''Parameter Expansion''s and gives up:
+ * `start=1; end=5; for number in {$start..$end}`:  ''Sequence Expansion'' happens in step 4, while ''Parameter Expansion'' happens in step 5 (this Bash-specific).  ''Brace Expansion'' tries to expand `{$start..$end}` but can't.  It sees the `$start` and `$end` as strings, not ''Parameter Expansion''s and gives up:
-Line 100:
+Line 102:
+ * Remember that parts of the language that evaluate their input as full bash expressions such as [[http://wiki.bash-hackers.org/commands/builtin/eval | eval]], `.` / `source`, [[SignalTrap | trap]], [[http://wiki.bash-hackers.org/commands/builtin/mapfile#the_callback | mapfile]], and several other features open up another can of worms. The data given them gets run through the full bash parser and is subject to all evaluation steps. The trouble is that often times before your code even gets to be used by these features, it gets subjected to unavoidable undesired evaluation steps that occur during the act of passing the data. Side-effects that hurt code-integrity are hard to control especially when influenced by user-input. It is also easy to violate good coding principles by mixing code stored in non-code datastructures into code that's to be evaluated. <<BR>> <<BR>> Languages that have datastructures specifically designed for holding program code (function and object literals, and closures) suffer these problems to a much lesser extent than Bash.