Differences between revisions 9 and 15 (spanning 6 versions)

<- Collating with associative arrays | Avoiding code injection | Example 1: Modifying a config file ->

Avoiding code injection

Code injection is a type of bug in which a user's input (or other uncontrolled information) is executed as a command. Well-written programs will not be subject to such exploits. This page will describe ways to make sure your program is not vulnerable to them.

Contents

Avoiding code injection

Explanation

Code injection typically occurs when you have multiple layers of code interpretation nested inside each other. For example, if you write a bash program that calls awk, you have two layers of interpretation: bash and awk. Awk receives its script as an argument from bash. If you enclose the awk script in single quotes (and if there are no other layers involved), there is no chance of code injection. However, if you enclose the awk script in double quotes, with some variable expansions inside it (because you are attempting to pass information from bash to awk), then you have a potential code injection vulnerability.

# BAD!  Code injection vulnerability.
read -rp 'Enter a search value: ' value
awk "/$value/{print \$1}" "$file"

In this example, the intent was to allow the user to specify some value (a string, or a regular expression), then to ask awk to print the first field of each line which matches the user's input. This "works" as long as the user follows the rules, and only gives you strings that you expect. For example, if the user types fred, then awk sees this as its script:

/fred/{print $1}

But the user could type anything. You didn't validate or sanity check the input at all. If the user's input contains punctuation characters that are meaningful to awk, then awk may see something like this:

//{system("echo HAHA")} /fred/{print $1}

Now awk will execute the shell command echo HAHA for every line of the file, in addition to printing the first field of each line matching fred. You've opened up a door for the user to cause all kinds of chaos.

The bash/awk layering is extremely common, but there are many other ways that multiple layers of interpretation can happen. Perhaps the most notorious is the eval command, which explicitly requests a second shell interpretation. There are valid reasons to do this, but it must be done with extreme care.

Other layering combination examples include find -exec, calling ssh to run a command remotely, or passing an SQL command to a database interface such as mysql or psql. The SQL example was illustrated (literally) in an xkcd comic.

The fundamental way that you avoid code injections is to pass data separately from code. You have many choices for how to do this, depending on the data. You could use environment variables, or you could pass it on an open file descriptor (stdout), or you could put it in a file, or you could pass it as an additional argument, separate from the argument that contains the code.

Environment variables

Environment variables are a convenient way to pass small to medium amounts of data to a child process, regardless of the child process's language. There are system limitations on the size of the environment, so this isn't suitable for very large amounts of data, but for things like user input (with perhaps some sanity check on the size), it's hard to beat.

Virtually every programming language gives you some way to read environment variables. In bash, they appear just like regular string variables, primed and ready. In many other scripting languages, they appear in a hash, dictionary or array with a specific name.

Using awk as our example, we could write our program like this:

read -rp 'Enter a search value: ' value
((${#value} <= 1000)) || die "No, I don't think so"
search="$value" awk '$0 ~ ENVIRON["search"] {print $1}' "$file"

Here, we create an environment variable named "search" in the temporary execution environment of awk. This environment variable contains the user's input, from the bash variable value. The awk script finds the "search" element in the ENVIRON array (awk's arrays are like bash's associative arrays). This variable is used in a regular expression match against the entire input line ($0 in awk's syntax).

This approach is perfectly safe no matter what crazy input the user types.

The major limitation of the environment variable approach is that it only works when your next interpretation layer is a child process on the same system. If you need to pass data to a remote system (e.g. over ssh), this won't work at all.

Awk variables

As we've seen, awk can accept data in environment variables. However, for awk specifically, the most common way is to create an awk variable with awk's -v option.

Unfortunately, variables passed with awk -v undergo backslash interpretation by awk. In order to pass the content correctly, we need to double up any backslashes in the data.

read -rp 'Enter a search value: ' value
awk -v search="${value//\\/\\\\}" '$0 ~ search {print $1}' "$file"

Here, we tell awk to create a variable named search, which has the user's input, in the argument that follows -v. The awk script uses the search variable in a regular expression match, just like in the environment variable example.

Additional arguments

This approach is mostly used when your next interpretation layer is a shell that you invoke with bash -c or sh -c, though it can also be used in other situations. Instead of embedding some variable expansion inside a double-quoted script after the -c, you pass the variable as an additional argument to the shell. Then you simply retrieve the argument as a parameter.

Let's suppose we wanted to do something like this:

# BAD!  Code injection vulnerability.
read -rp 'Enter destination directory: ' dest
find . -name '*.txt' -exec sh -c "mv \"\$@\" \"$dest\"" x {} +

The intent here is to have find execute the following command:

mv "$@" "$dest"

However, this doesn't work safely. The user could type a double quote as part of the input. That literal double quote, plus whatever comes after it, would be embedded inside the script that sh runs.

Now, there are many ways to work around this, but for the purpose of illustrating the "additional arguments" approach, let's pass the destination directory as an extra argument to the script. Due to the limitations of find (which requires that {} be the last thing before the +), we have to put it before the {}. That leaves us with two choices. Either we pass "$dest" after the x that becomes $0, or we pass it as $0. Since we're not using $0 for anything else, we might as well use it for the destination directory.

read -rp 'Enter destination directory: ' dest
find . -name '*.txt' -exec sh -c 'mv "$@" "$0"' "$dest" {} +

The other way is a bit longer, but it may be a useful example when you adapt it to other problems (e.g. if you need to send more than one variable):

find . -name '*.txt' -exec sh -c 'dest="$1"; shift; mv "$@" "$dest"' x "$dest" {} +

Please note that it is not safe to pass additional arguments to an ssh command in the obvious way. ssh does not maintain your argument vector across the remote connection; instead, it mashes all the arguments together into a single string, and then passes that entire string as one argument to the remote shell for re-parsing. That FAQ page explains a technique that can be used instead, involving printf %q. I won't repeat it here, except to reiterate the requirement that the login shell of your account on the remote system must be bash, not any other shell. (printf %q produces output that is not portable.)

Finally, awk can also retrieve arguments in a similar way:

read -rp 'Enter a search value: ' value
awk 'BEGIN {search = ARGV[1]; ARGV[1] = ""} $0 ~ search {print $1}' "$value" "$file"

(Most people would probably use the -v version instead.)

Standard input

Data can also be given to a program on standard input, or any other open FileDescriptor. This approach is commonly used when you need to send data to a remote system. Opening a new file descriptor just for this special data is a great idea, but there are many cases where only one file descriptor is available; for example, ssh only provides one. This creates limitations, but you have to work with the world as you find it.

Apart from a shortage of unused file descriptors, the main limitation of this approach is that you need to serialize the data in a stream which can be parsed by the receiving layer. If at all possible, the preferred way to serialize data is to put a NUL byte after each piece (variable). This requires that the receiver be able to handle such an input stream. Bash can do it, as we've seen previously. BSD/GNU xargs -0 can also handle it, although that should be considered a choice of last resort.

Since Bash FAQ 96 already goes over the basics, we'll need something a bit more elaborate to justify including it here.

Suppose we need to send three variables to a script on a remote host. The script doesn't need any other inputs from the client system, so it's OK if we tie up stdin for this purpose. We could do it like this:

printf '%s\0' "$v1" "$v2" "$v3" |
ssh user@host '
  IFS= read -rd "" v1
  IFS= read -rd "" v2
  IFS= read -rd "" v3
  fooscript "$v1" "$v2" "$v3"
'

For this to work, the login shell of your account on the remote host must be bash. read -d "" is a bash-specific feature, not available in (most) other shells. ssh will send us back the remote host's stdout and stderr, separately, and we can do whatever we like with those.

We could also use BSD or GNU xargs -0 for this example, because it's such a simple case. xargs -0 will read the input variables into memory, then pass them all at once (one hopes!) to a specified command. As long as the input variables are small enough, not exceeding the system's ARG_MAX, xargs -0 should run just the one command.

printf '%s\0' "$v1" "$v2" "$v3" |
ssh user@host 'xargs -0 fooscript'

This version has the advantage of not requiring a specific login shell on the remote host, but it requires an xargs command which accepts the nonstandard -0 option. You can't have it all.

SQL bind variables

When constructing an SQL command, often some part of it is based on data only known at run time. For example, you might need to retrieve the name, department, phone number and pager number of an employee, given the employee ID number. The employee ID number is only available in a variable at run time, so you can't simply put it inside the SQL command. If you try, you create a code injection vulnerability.

# BAD!  Code injection vulnerability.
sql="select last_name, first_name, dept, phone, pager
  from employees where id = $id"
psql "$sql"

As you may have guessed, a clever user (like Bobby's mom) could put SQL syntax in the id variable, and cause the psql command to do anything that the user has permissions to do. (Setting up database permissions is outside the scope of this document. Look for the GRANT command in your database documentation.)

Now, I'm afraid I've got some bad news for you. The solution to this problem is to use something called a bind variable. This is a feature of SQL application programming interfaces provided by and for each database. Bash does not have a database API. In the absence of a specialized tool designed to send queries with bind variables to a database, you cannot do this safely in bash. It is a bash weakness, and it's a reason to switch to a different language.

I've included this section here because too many people don't understand what bind variables are, or how to use them. Even though you can't use them in bash, they are an important concept that you need to understand. You'll probably work with a database at some point, in some language, and while the syntax may differ a bit, the basic ideas are the same. Sadly, most of the SQL SELECT examples you see on the Internet do not include bind variables. Let's do better.

Here's how Tcl 8.6 does it:

   1 #!/usr/local/bin/tclsh8.6
   2 package require tdbc::postgres
   3 tdbc::postgres::connection create conn -db mydatabase
   4 
   5 set sql {
   6     select last_name, first_name from employees
   7     where id = :id
   8 }
   9 
  10 set d [dict create id 1030]
  11 puts [conn allrows -- $sql $d]

This is a real program talking to a real database on localhost; I changed the database name in this example, and nothing else. The output:

$ ./foo
{last_name Wooledge first_name Gregory}

If you aren't a Tcl fan, Perl 5 works too. You may need to install an extra module or two.

   1 #!/usr/bin/perl
   2 use strict;
   3 use DBI;
   4 my $dbh = DBI->connect('dbi:Pg:dbname=mydatabase');
   5 $dbh->{RaiseError} = 1;
   6 
   7 my $sql = '
   8     select last_name, first_name from employees
   9     where id = ?
  10 ';
  11 
  12 my $sth = $dbh->prepare($sql);
  13 $sth->bind_param(1, 1030);
  14 $sth->execute;
  15 DBI::dump_results($sth);

$ ./bar
'Wooledge', 'Gregory'
1 rows

Or in python:

   1 #!/bin/python3
   2 
   3 import mysql.connector
   4 conn = mysql.connector.connect( host="localhost", user="someuser", passwd="somepass", database="somedb")
   5 cursor = conn.cursor()
   6 id = 1
   7 cursor.execute("SELECT * from test where id = %(id)s", { 'id': id })
   8 res = cursor.fetchall()
   9 
  10 for row in res:
  11     print(row[1],row[2])

$ ./baz
Firstname Surname

Arithmetic Expansion

One of the most insidious forms of arbitrary code execution occurs within bash itself, when it evalutes an ArithmeticExpression. This is extremely subtle, and not widely known.

Here's an example:

$ x='a[$(date >&2) 0]'
$ echo "$(( x ))"
Tue May 12 13:29:04 EDT 2020
0

What's happening here? It's a lot more complex than one might guess. The $(( )) tells bash to evaluate an expression arithmetically (this begins a math context). The x inside the math context is treated as a variable name, and the variable's value is evaluated recursively, to try to generate a numeric value. In the next recursive step, we have a[...] which is an array variable expansion. Since a was never declared as an associative array, it's treated as an indexed array, and the code inside the square brackets is also evaluated in a math context. Bash explicitly allows command substitution to occur there, so the date >&2 command is executed.

If we assume that echo "$(( x ))" is a shell script, and x is a variable containing user input, then we can see how easily a user's input could trigger unwanted code execution.

This issue is not restricted to the $(( )) expansion. It occurs anywhere bash goes into a math context:

The $(( )) arithmetic substitution.
The let or (( )) command.
The index inside [ ] in an indexed array variable expansion.
The start and length parameters in ${parameter:start:length} substitution.
The [[ command with -gt or other numeric operators.

The solution is to validate user input before allowing it to be used in a math context.

Associative Array Index Multiple Expansions

There are several instances where bash performs multiple expansions of associative array indices. As you might guess, this can be dangerous, possibly leading to code injection if the index contains command substitution syntax.

Here are some examples:

$ declare -A hash
$ key='x$(date >&2)'
$ [[ -v hash[$key] ]]
Mon Mar  1 09:24:20 EST 2021

$ (( hash[$key]++ ))
Mon Mar  1 09:24:58 EST 2021
Mon Mar  1 09:24:58 EST 2021

In some cases (e.g. the [[ -v command), single-quoting the array[$index] syntax as 'array[$index]' will work. However, this is insufficient for the arithmetic context cases. For those, you must understand that the arithmetic context treats its content as if it were double-quoted already, and therefore single quotes have no special meaning. You can, however, use a backslash to quote the $ in the index:

$ (( hash[\$key]++ ))

See Pitfall 62 for more explanation.

It's also worth pointing out that bash 5.0 and higher has an assoc_expand_once shell option that may suppress the multiple expansions in some cases. The effects of this option as a mitigation have not been fully explored at the time of writing, so you may need to run your own experiments if you try to use it.

Indirect Assignments

As we discussed earlier, one of the major bash weaknesses is the inability to return a value from a function to the caller. Some people try to work around this by passing a variable name as an argument to the function, and then having the function assign to this variable using eval or some form of indirect assignment.

Those approaches aren't wrong, but the variable name argument must be sanitized/validated, or else there is a possibility of code injection. For example,

f() {
  declare -g "$1=return value"
}

g() {
  printf -v "$1" %s "return value"
}

Both of these functions are subject to code injection if the caller passes an argument that contains untrusted user input:

$ f 'x[$(date >&2)0]'
Wed Jun  3 16:19:57 EDT 2020
$ g 'x[$(date >&2)0]'
Wed Jun  3 16:19:58 EDT 2020

As you might have guessed by now, the solution to this problem is to sanitize user input before passing it as a variable name (or array index) to any kind of assignment.