<- Collating with associative arrays | Avoiding code injection |

Avoiding code injection

Code injection is a type of bug in which a user's input (or other uncontrolled information) is executed as a command. Well-written programs will not be subject to such exploits. This page will describe ways to make sure your program is not vulnerable to them.

Contents

Avoiding code injection

Explanation

Code injection typically occurs when you have multiple layers of code interpretation nested inside each other. For example, if you write a bash program that calls awk, you have two layers of interpretation: bash and awk. Awk receives its script as an argument from bash. If you enclose the awk script in single quotes (and if there are no other layers involved), there is no chance of code injection. However, if you enclose the awk script in double quotes, with some variable expansions inside it (because you are attempting to pass information from bash to awk), then you have a potential code injection vulnerability.

# BAD!  Code injection vulnerability.
read -rp 'Enter a search value: ' value
awk "/$value/{print \$1}" "$file"

In this example, the intent was to allow the user to specify some value (a string, or a regular expression), then to ask awk to print the first field of each line which matches the user's input. This "works" as long as the user follows the rules, and only gives you strings that you expect. For example, if the user types fred, then awk sees this as its script:

/fred/{print $1}

But the user could type anything. You didn't validate or sanity check the input at all. If the user's input contains punctuation characters that are meaningful to awk, then awk may see something like this:

//{system("echo HAHA")} /fred/{print $1}

Now awk will execute the shell command echo HAHA for every line of the file, in addition to printing the first field of each line matching fred. You've opened up a door for the user to cause all kinds of chaos.

The bash/awk layering is extremely common, but there are many other ways that multiple layers of interpretation can happen. Perhaps the most notorious is the eval command, which explicitly requests a second shell interpretation. There are valid reasons to do this, but it must be done with extreme care.

Other layering combination examples include find -exec, calling ssh to run a command remotely, or passing an SQL command to a database interface such as mysql or psql. The SQL example was illustrated (literally) in an xkcd comic.

The fundamental way that you avoid code injections is to pass data separately from code. You have many choices for how to do this, depending on the data. You could use environment variables, or you could pass it on an open file descriptor (stdout), or you could put it in a file, or you could pass it as an additional argument, separate from the argument that contains the code.

Environment variables

Environment variables are a convenient way to pass small to medium amounts of data to a child process, regardless of the child process's language. There are system limitations on the size of the environment, so this isn't suitable for very large amounts of data, but for things like user input (with perhaps some sanity check on the size), it's hard to beat.

Virtually every programming language gives you some way to read environment variables. In bash, they appear just like regular string variables, primed and ready. In many other scripting languages, they appear in a hash, dictionary or array with a specific name.

Using awk as our example, we could write our program like this:

read -rp 'Enter a search value: ' value
((${#value} <= 1000)) || die "No, I don't think so"
search="$value" awk '$0 ~ ENVIRON["search"] {print $1}' "$file"

Here, we create an environment variable named "search" in the temporary execution environment of awk. This environment variable contains the user's input, from the bash variable value. The awk script finds the "search" element in the ENVIRON array (awk's arrays are like bash's associative arrays). This variable is used in a regular expression match against the entire input line ($0 in awk's syntax).

This approach is perfectly safe no matter what crazy input the user types.

The major limitation of the environment variable approach is that it only works when your next interpretation layer is a child process on the same system. If you need to pass data to a remote system (e.g. over ssh), this won't work at all.

Awk variables

As we've seen, awk can accept data in environment variables. However, for awk specifically, the most common way is to create an awk variable with awk's -v option.

read -rp 'Enter a search value: ' value
awk -v search="$value" '$0 ~ search {print $1}' "$file"

Here, we tell awk to create a variable named search, which has the user's input, in the argument that follows -v. The awk script uses the search variable in a regular expression match, just like in the environment variable example.

Additional arguments

This approach is mostly used when your next interpretation layer is a shell that you invoke with bash -c or sh -c, though it can also be adapted to other situations. Instead of embedding some variable expansion inside a double-quoted script after the -c, you pass the variable as an additional argument to the shell.

As an example, let's suppose we wanted to do something like this:

# BAD!  Code injection vulnerability.
read -rp 'Enter destination directory: ' dest
find . -name '*.txt' -exec sh -c "mv \"\$@\" \"$dest\"" x {} +

The intent here is to have find execute the following command:

mv "$@" "$dest"

However, this doesn't work safely. The user could type a double quote as part of the input. That literal double quote, plus whatever comes after it, would be embedded inside the script that sh runs.

Now, there are many ways to work around this, but for the purpose of illustrating the "additional arguments" approach, let's pass the destination directory as an extra argument to the script. Due to the limitations of find (which requires that {} be the last thing before the +), we have to put it before the {}. That leaves us with two choices. Either we pass "$dest" after the x that becomes $0, or we pass it as $0. Since we're not using $0 for anything else, we might as well use it for the destination directory.

read -rp 'Enter destination directory: ' dest
find . -name '*.txt' -exec sh -c 'mv "$@" "$0"' "$dest" {} +

The other way is a bit longer, but it may be a useful example when you adapt it to other problems (e.g. if you need to send more than one variable):

find . -name '*.txt' -exec sh -c 'dest="$1"; shift; mv "$@" "$dest"' x "$dest" {} +

Please note that it is not safe to pass additional arguments to an ssh command in the obvious way. ssh does not maintain your argument vector across the remote connection; instead, it mashes all the arguments together into a single string, and then passes that entire string as one argument to the remote shell for re-parsing. That FAQ page explains a technique that can be used instead, involving printf %q. I won't repeat it here, except to reiterate the requirement that the login shell of your account on the remote system must be bash, not any other shell. (printf %q produces output that is not portable.)

Standard input

Data can also be given to a program on standard input, or any other open FileDescriptor. This approach is commonly used when you need to send data to a remote system. Opening a new file descriptor just for this special data is a great idea, but there are many cases where only one file descriptor is available; for example, ssh only provides one. This creates limitations, but you have to work with the world as you find it.

Apart from a shortage of unused file descriptors, the main limitation of this approach is that you need to serialize the data in a stream which can be parsed by the receiving layer. If at all possible, the preferred way to serialize data is to put a NUL byte after each piece (variable). This requires that the receiver be able to handle such an input stream. Bash can do it, as we've seen previously. BSD/GNU xargs -0 can also handle it, although that should be considered a choice of last resort.

Since Bash FAQ 96 already goes over the basics, we'll need something a bit more elaborate to justify including it here.

Suppose we need to send three variables to a script on a remote host. The script doesn't need any other inputs from the client system, so it's OK if we tie up stdin for this purpose. We could do it like this:

printf '%s\0' "$v1" "$v2" "$v3" |
ssh user@host '
  IFS= read -rd "" v1
  IFS= read -rd "" v2
  IFS= read -rd "" v3
  fooscript "$v1" "$v2" "$v3"
'

For this to work, the login shell of your account on the remote host must be bash. read -d "" is a bash-specific feature, not available in (most) other shells. ssh will send us back the remote host's stdout and stderr, separately, and we can do whatever we like with those.