<- Collating with associative arrays | Avoiding code injection | Data structures ->

Avoiding code injection

Code injection is a type of bug in which a user's input (or other uncontrolled information) is executed as a command. Well-written programs will not be subject to such exploits. This page will describe ways to make sure your program is not vulnerable to them.

Contents

Avoiding code injection
1. Explanation
Methods for avoiding code injection
Causes of code injection

Explanation

Code injection typically occurs when you have multiple layers of code interpretation nested inside each other. For example, if you write a bash program that calls awk, you have two layers of interpretation: bash and awk. Awk receives its script as an argument from bash. If you enclose the awk script in single quotes (and if there are no other layers involved), there is no chance of code injection. However, if you enclose the awk script in double quotes, with some variable expansions inside it (because you are attempting to pass information from bash to awk), then you have a potential code injection vulnerability.

# BAD!  Code injection vulnerability.
IFS= read -rp 'Enter a search value: ' value
awk "/$value/{print \$1}" "$file"

In this example, the intent was to allow the user to specify some value (a string, or a regular expression), then to ask awk to print the first field of each line which matches the user's input. This "works" as long as the user follows the rules, and only gives you strings that you expect. For example, if the user types fred, then awk sees this as its script:

/fred/{print $1}

But the user could type anything. You didn't validate or sanity check the input at all. If the user's input contains punctuation characters that are meaningful to awk, then awk may see something like this:

//{system("echo HAHA")} /fred/{print $1}

Now awk will execute the shell command echo HAHA for every line of the file, in addition to printing the first field of each line matching fred. You've opened up a door for the user to cause all kinds of chaos.

The bash/awk layering is extremely common, but there are many other ways that multiple layers of interpretation can happen. Perhaps the most notorious is the eval command, which explicitly requests a second shell interpretation. There are valid reasons to do this, but it must be done with extreme care.

Other layering combination examples include find -exec, calling ssh to run a command remotely, or passing an SQL command to a database interface such as mysql or psql. The SQL example was illustrated (literally) in an xkcd comic.

The fundamental way that you avoid code injections is to pass data separately from code. You have many choices for how to do this, depending on the data. You could use environment variables, or you could pass it on an open file descriptor (stdout), or you could put it in a file, or you could pass it as an additional argument, separate from the argument that contains the code.

Incidentally here, awk also has that misfeature that arguments that come after the awk code (here the expansion of $file) are treated as variable assignments if they contain at least one = character and if what's to the left of the first occurrence is a valid awk variable name.

Some awk implementations such as older versions of busybox awk have also been known to accept options after non-options (and while that has been fixed, there are still many utilities, especially GNU ones, that accept options after non-options).

So in general, and for awk especially, if possible, it's safer to feed input over its stdin rather than via filenames passed as arguments.

Here, if the contents of $file also came from uncontrolled data, that "$file" argument would also constitute a code injection vulnerability.

For instance, with file='ORS=\nHAHA\n', the "$file" would be treated as an assignment to the ORS awk variable (the output record separator) and awk would read from stdin instead of the file whose path is stored in $file.

With file='-eBEGIN{system("reboot")}', with older versions of busybox, that would reboot.

Methods for avoiding code injection

Environment variables

Environment variables are a convenient way to pass small to medium amounts of data to a child process, regardless of the child process's language. There are system limitations on the size of the environment, so this isn't suitable for very large amounts of data, but for things like user input (with perhaps some sanity check on the size), it's hard to beat.

Virtually every programming language gives you some way to read environment variables. In bash, they appear just like regular string variables, primed and ready. In many other scripting languages, they appear in a hash, dictionary or array with a specific name.

Using awk as our example, we could write our program like this:

IFS= read -rp 'Enter a search value: ' value
((${#value} <= 1000)) || die "No, I don't think so"
search="$value" awk '$0 ~ ENVIRON["search"] {print $1}' < "$file"

Here, we create an environment variable named "search" in the temporary execution environment of awk. This environment variable contains the user's input, from the bash variable value. The awk script finds the "search" element in the ENVIRON array (awk's arrays are like bash's associative arrays). This variable is used in a regular expression match against the entire input line ($0 in awk's syntax).

This approach is perfectly safe no matter what crazy input the user types.

The major limitation of the environment variable approach is that it only works when your next interpretation layer is a child process on the same system. If you need to pass data to a remote system (e.g. over ssh), this won't work unless you configure both the ssh client to send (ssh -o SendEnv=variablename) and sshd server to accept (AcceptEnv variablename configuration directive) those variables.

Environment variables have that advantage over command line arguments that they are private on most systems so suitable to pass secret information while command line arguments are public on most systems.

Awk variables

As we've seen, awk can accept data in environment variables. However, for awk specifically, a still widely used way is to create an awk variable with awk's -v option.

Unfortunately, variables passed with awk -v undergo backslash interpretation by awk. In order to pass the content correctly, we need to double up any backslashes in the data.

IFS= read -rp 'Enter a search value: ' value
awk -v search="${value//\\/\\\\}" '$0 ~ search {print $1}' < "$file"

Here, we tell awk to create a variable named search, which has the user's input, in the argument that follows -v. The awk script uses the search variable in a regular expression match, just like in the environment variable example.

WARNING: This approach does not work with all awk implementations; nawk will refuse to run if $value contains a newline (nawk: newline in string). In the above example, that does not matter much because value is set by a read command so it will never contain a newline, but you should be aware of that if you want to pass an arbitrary value, and you want your script to be portable; as a workaround to this problem, you could try adding an extra PE to replace newlines with \n after replacing \ with \\, but it is probably easier, and more reliable to just use environment variables instead of awk -v if you want portability.

WARNING: gawk also interprets strings that start with @/ and end in / as literal regular expression strings, so -v foo=@/bar/ will set foo to bar, not @/bar/; you could add yet another replacement at the end that replaces @ at the start with \x40 to also handle this case, or just use an environment variable instead of awk -v.

WARNING: also, the shell and awk need to agree on what constitutes a backslash as that can be affected by the locale's character encoding. Byte 0x5c that is the encoding of \ in ASCII (and all supersets thereof) can also be found in the multibyte encoding of other characters, like that of α in the BIG5HKSCS character encoding. If in a locale that uses such a charmap, bash's ${value//\\/\\\\} will not double the 0x5c byte if found in the encoding of α, while some awk implementations will consider those 0x5c bytes as a backslash. Example:

$ LANG=zh_HK.big5hkscs luit bash --norc
bash-5.2$ locale charmap
BIG5-HKSCS
bash-5.2$ value='αb'
bash-5.2$ printf %s "$value" | od -vtx1
0000000 a3 5c 62
0000003
bash-5.2$ LC_ALL=C printf '%q\n' "$value" "${value//\\/\\\\}"
$'\243\\b'
$'\243\\b'

${value//\\/\\\\} rightfully did not change the 0x5c byte to 0x5c 0x5c, as it was part of the α character, not the encoding of \. But:

bash-5.2$ gawk -v search="${value//\\/\\\\}" 'BEGIN{printf "%s", search}' | LC_ALL=C od -An -vtc
 243   \   b
bash-5.2$ mawk -v search="${value//\\/\\\\}" 'BEGIN{printf "%s", search}' | LC_ALL=C od -An -vtc
 243  \b
bash-5.2$ busybox awk -v search="${value//\\/\\\\}" 'BEGIN{printf "%s", search}' | LC_ALL=C od -An -vtc
 243  \b

gawk did not treat that 0x5c character as a backslash either but mawk and busybox awk both did, they treated the sequence of bytes 0x5c 0x62 as the \ and b characters and expanded \b to a backspace (aka BS, ^H, encoded as 0x08) character.

So to sum up, doing proper escaping is a tricky business especially when it needs to be done differently for different implementations of the same application and awk's -v is best avoided for passing arbitrary data.

Additional arguments

This approach is mostly used when your next interpretation layer is a shell that you invoke with bash -c or sh -c, though it can also be used in other situations. Instead of embedding some variable expansion inside a double-quoted script after the -c, you pass the variable as an additional argument to the shell. Then you simply retrieve the argument as a parameter.

Let's suppose we wanted to do something like this:

# BAD!  Code injection vulnerability.
IFS= read -rp 'Enter destination directory: ' dest
find . -name '*.txt' -exec sh -c "mv \"\$@\" \"$dest\"" sh {} +

The intent here is to have find execute the following command:

mv "$@" "$dest"

However, that's not what happened (and wouldn't have worked anyway unless you had exported the dest variable). Instead, it asked sh to interpret code that looked like:

mv "$@" "what the user entered when prompted"

this doesn't work safely. The user could type a double quote as part of the input. That literal double quote, plus whatever comes after it, would be embedded inside the script that sh runs. For instance, if the user entered ";reboot # or $(reboot), the code interpreted by sh would become:

mv "$@" ""; reboot #"
mv "$@" "$(reboot)"

Now, there are many ways to work around this, but for the purpose of illustrating the "additional arguments" approach, let's pass the destination directory as an extra argument to the script. Due to the limitations of find (which requires that {} be the last thing before the +), we have to put it before the {}. That leaves us with two choices. Either we pass "$dest" after the sh that becomes $0, or we pass it as $0. Since we're not using $0 for anything else (though the shell may use it itself for its error messages), we might as well use it for the destination directory.

IFS= read -rp 'Enter destination directory: ' dest
find . -name '*.txt' -exec sh -c 'mv -- "$@" "$0"' "$dest/" {} +

(also note the -- which is needed for GNU mv or other mv implementations that accept options after non-options; we may also want to replace $dest with ${dest%/}/ to make sure mv does a move-into and not a move-to)

The other way is a bit longer, but it makes sure $0 remains a sensible value for use in error messages by the shell, and may be a useful example when you adapt it to other problems (e.g. if you need to send more than one variable):

find . -name '*.txt' -exec sh -c 'dest="$1"; shift; mv -- "$@" "$dest"' sh "$dest" {} +

Finally, awk can also retrieve arguments in a similar way:

IFS= read -rp 'Enter a search value: ' value
awk -- 'BEGIN {search = ARGV[1]; ARGV[1] = ""} $0 ~ search {print $1}' "$value" < "$file"

That has however little advantage over using environment variables. That could however be used to pass arrays:

array=( foo bar 'x y' $'yes\nno' )
awk -v n="${#array[@]}" -- '
  BEGIN {for (i = 1; i <= n; i++) {array[i] = ARGV[i]; ARGV[i] = ""}}
  {print FILENAME, array[FNR], $0}' "${array[@]" file1 file2

Passing arguments to ssh

Please note that it is not safe to pass additional arguments to an ssh command in the obvious way. ssh does not maintain your argument vector across the remote connection; instead, it mashes all the arguments together into a single string, and then passes that entire string as one argument to the remote shell for re-parsing.

So, for example, you can't safely do this:

# BAD!
ssh user@host fooscript "$argument"

However, if we can't use the environment variable passing method (SendEnv/AcceptEnv mentioned above) and we know what shell the account is using on the remote system, we can quote the arguments in a way that allows the remote shell to retrieve them. If the remote account's shell is bash for example, and we know it will be invoked in the same locale, using the same C library and version thereof and locale data (for the same reasons as those discussed for awk above about the encoding of backslash), we can use the @Q parameter expansion form (from bash 4.4) on the client:

# Requires bash 4,4 on the client, and any version of bash as the
# user account's shell on the server, with same locale, C library and
# version thereof as on the client
ssh user@host "fooscript ${argument@Q}"

(beware that if the argument variable is not set, ${argument@Q} expands to nothing, while it expands to '' if it's set but empty).

The nicest thing about the @Q expansion is that we can apply it to the entire set of positional parameters (or an array) at once: "${@@Q}" or "${array[@]@Q}".

As alternatives, one can use:

# Bash 3.1 on client for printf -v, and any version of bash as the
# user account's shell on the server, with same locale, C library and
# version thereof as on the client
printf -v args "${1+ %q}" "$@"
ssh user@host "fooscript $args"

Or:

# Bash 2.0 on client and server for arrays, and same locale, C library and
# version thereof on the server as on the client
args=( "$@" )
ssh user@host "$(typeset -p args);" 'fooscript "${args[@]}"'

If the remote user account's shell isn't bash, but is known to be any shell in the Bourne family, we can use "sh quoting" (convert all single-quotes to the 4 characters '\'' and then wrap the whole thing in single-quotes). That quoting method is also much safer than the one employed by @q, printf %q or typeset -p and can be used if the shell is bash without the restrictions mentioned above.

# Requires any Bourne family shell as the remote user account's shell.
q=\' b=\\
ssh user@host "fooscript '${argument//$q/$q$b$q$q}'"

Any of the quoting forms shown above becomes especially useful when we need to pass the commands to ssh on standard input. Of course, having the script reside on the remote host so that we can call it by name is much simpler, but many people seem to desire a way to generate a script dynamically on the client, to be executed on the server. The script ties up standard input, which means we can't send data over that channel. In these cases, we can use a technique like this:

# Use "sh quoting" for this example.  Other forms are possible,
# depending on the client and server configuration.
q=\' b=\\
ssh user@host "bash -s -- '${argument//$q/$q$b$q$q}'" <<'EOF'
long and complicated script
goes here
EOF

This way, we maintain the separation of code and data, which is fundamental for avoiding code injections.

Standard input

Data can also be given to a program on standard input, or any other open FileDescriptor. This approach is commonly used when you need to send data to a remote system. Opening a new file descriptor just for this special data is a great idea, but there are many cases where only one file descriptor is available; for example, ssh only provides one for input. This creates limitations, but you have to work with the world as you find it.

Apart from a shortage of unused file descriptors, the main limitation of this approach is that you need to serialize the data in a stream which can be parsed by the receiving layer. If at all possible, the preferred way to serialize data that doesn't contain NUL characters is to put a NUL byte after each piece (variable). This requires that the receiver be able to handle such an input stream. Bash can do it, as we've seen previously. BSD/GNU/busybox/toybox/ast-open/Solaris (and now POSIX since the 2024 edition) xargs -r0 can also handle it.

Since Bash FAQ 96 already goes over the basics, we'll need something a bit more elaborate to justify including it here.

Suppose we need to send three variables to a script on a remote host. The script doesn't need any other inputs from the client system, so it's OK if we tie up stdin for this purpose. We could do it like this:

printf '%s\0' "$v1" "$v2" "$v3" |
ssh user@host '
  IFS= LC_ALL=C read -rd "" v1 &&
  IFS= LC_ALL=C read -rd "" v2 &&
  IFS= LC_ALL=C read -rd "" v3 &&
  fooscript "$v1" "$v2" "$v3"
'

For this to work, the login shell of user on host must be bash, zsh or recent versions of ksh93u+m (LC_ALL=C only needed in bash to work around a bug in some versions). read -d is a ksh93 feature, copied and extended by bash to understand the empty string as NUL, and also found in zsh, mksh, NetBSD sh and now specified by POSIX for the standard read utility (but not implemented by all shells yet). ssh will send us back the remote host's stdout and stderr, separately, and we can do whatever we like with those.

Note that one must not use the ssh -t option (which would have to be passed twice to be effective as stdin is not a tty device) as that introduces a pseudo-terminal layer on the remote shell which would mangle the input (and the output).

We could also use xargs -r0 (-r being implicit in NetBSD's) for this example, because it's such a simple case. xargs -0 will read the input variables into memory, then pass them all at once (one hopes!) to a specified command. -r skips running the command if the input is empty. As long as the input variables are small enough, not exceeding the system's ARG_MAX, xargs -0 should run just the command once.

printf '%s\0' "$v1" "$v2" "$v3" | ssh user@host 'xargs -r0 fooscript'

This version has the advantage of not requiring a specific login shell on the remote host, but it requires an xargs command which accepts the (only recently made standard) -r and -0 options. You can't have it all.

If the size of the arguments is too big, xargs will run fooscript several times with a subset of the arguments each time, while the read-based approach will fail. You can also tell xargs the number of arguments you want it to pass the command with -n and tell it to fail if it can't with -x to get a similar behaviour as with the read -d '' approach.

printf '%s\0' "$v1" "$v2" "$v3" |
  ssh user@host 'xargs -0 -x -n 3 fooscript'

(it would however still run fooscript if fewer than 3 records were found on stdin like when the ssh connection is lost midway through the transfer).

SQL bind variables

When constructing an SQL command, often some part of it is based on data only known at run time. For example, you might need to retrieve the name, department, phone number and pager number of an employee, given the employee ID number. The employee ID number is only available in a variable at run time, so you can't simply put it inside the SQL command. If you try, you create a code injection vulnerability.

# BAD!  Code injection vulnerability.
sql="select last_name, first_name, dept, phone, pager
  from employees where id = $id"
psql "$sql"

As you may have guessed, a clever user (like Bobby's mom) could put SQL syntax in the id variable, and cause the psql command to do anything that the user has permissions to do. (Setting up database permissions is outside the scope of this document. Look for the GRANT command in your database documentation.)

Now, I'm afraid I've got some bad news for you. The solution to this problem is to use something called a bind variable. This is a feature of SQL application programming interfaces provided by and for each database. Bash does not have a database API. In the absence of a specialized tool designed to send queries with bind variables to a database, you cannot do this safely in bash. It is a bash weakness, and it's a reason to switch to a different language.

I've included this section here because too many people don't understand what bind variables are, or how to use them. Even though you can't use them in bash, they are an important concept that you need to understand. You'll probably work with a database at some point, in some language, and while the syntax may differ a bit, the basic ideas are the same. Sadly, most of the SQL SELECT examples you see on the Internet do not include bind variables. Let's do better.

Here's how Tcl 8.6 does it:

   1 #!/usr/local/bin/tclsh8.6
   2 package require tdbc::postgres
   3 tdbc::postgres::connection create conn -db mydatabase
   4 
   5 set sql {
   6     select last_name, first_name from employees
   7     where id = :id
   8 }
   9 
  10 set d [dict create id 1030]
  11 puts [conn allrows -- $sql $d]

This is a real program talking to a real database on localhost; I changed the database name in this example, and nothing else. The output:

$ ./foo
{last_name Wooledge first_name Gregory}

If you aren't a Tcl fan, Perl 5 works too. You may need to install an extra module or two.

   1 #!/usr/bin/perl
   2 use strict;
   3 use DBI;
   4 my $dbh = DBI->connect('dbi:Pg:dbname=mydatabase');
   5 $dbh->{RaiseError} = 1;
   6 
   7 my $sql = '
   8     select last_name, first_name from employees
   9     where id = ?
  10 ';
  11 
  12 my $sth = $dbh->prepare($sql);
  13 $sth->bind_param(1, 1030);
  14 $sth->execute;
  15 DBI::dump_results($sth);

$ ./bar
'Wooledge', 'Gregory'
1 rows

Or in python:

   1 #!/bin/python3
   2 
   3 import mysql.connector
   4 conn = mysql.connector.connect( host="localhost", user="someuser", passwd="somepass", database="somedb")
   5 cursor = conn.cursor()
   6 id = 1
   7 cursor.execute("SELECT * from test where id = %(id)s", { 'id': id })
   8 res = cursor.fetchall()
   9 
  10 for row in res:
  11     print(row[1],row[2])

$ ./baz
Firstname Surname

Causes of code injection

This is not a comprehensive list. These are just some of the surprising ways that bash scripts may exhibit code injection issues, without even invoking external commands.

Arithmetic Expansion

One of the most insidious forms of arbitrary code execution occurs within bash itself, when it evaluates an ArithmeticExpression. This is extremely subtle, and not widely known.

Here's an example:

$ x='a[$(date >&2) 0]'
$ echo "$(( x ))"
Tue May 12 13:29:04 EDT 2020
0

What's happening here? It's a lot more complex than one might guess. The $(( )) tells bash to evaluate an expression arithmetically (this begins a math context). The x inside the math context is treated as a variable name, and the variable's value is evaluated recursively, to try to generate a numeric value. In the next recursive step, we have a[...] which is an array variable expansion. Since a was never declared as an associative array, it's treated as an indexed array, and the code inside the square brackets is also evaluated in a math context. Bash explicitly allows command substitution to occur there, so the date >&2 command is executed.

If we assume that echo "$(( x ))" is a shell script, and x is a variable containing user input, then we can see how easily a user's input could trigger unwanted code execution.

This issue is not restricted to the $(( )) expansion. It occurs anywhere bash goes into a math context including:

The $(( )) arithmetic substitution.
The let or (( )) command.
The index inside [ ] in an indexed array variable expansion or assignment.
By extension to the above, and as discussed in more details in separate sections below, any builtin or construct that takes a variable name as argument for assignment or otherwise as in unset -v -- "$varname", printf -v "$varname", [ -v "$varname" ], ${!varnname}, typeset -n var="$varname", etc.
The start and length parameters in ${parameter:start:length} substitution.
The [[ command with -gt or other numeric operators.
Assignments to integer shell variables: OPTIND, SECONDS, RANDOM, SRANDOM (and possibly others), or any variable that has the -i attribute set.

The solution is to validate user input before allowing it to be used in a math context.

Associative Array Index Multiple Expansions

There are several instances where bash performs multiple expansions of associative array indices. As you might guess, this can be dangerous, possibly leading to code injection if the index contains command substitution syntax.

Here are some examples:

$ declare -A hash
$ key='x$(date >&2)'
$ [[ -v hash[$key] ]]
Mon Mar  1 09:24:20 EST 2021

$ (( hash[$key]++ ))
Mon Mar  1 09:24:58 EST 2021
Mon Mar  1 09:24:58 EST 2021

Prior to bash 5.2, there were quoting-based workarounds for both of these cases -- for the [[ -v command, single-quoting the hash[$key] syntax as 'hash[$key]' would work, while for the arithmetic context cases you could use a backslash to quote the $ in the index.

With bash 5.2, all workarounds for associative arrays within the (( )) command have become invalid. The only safe thing to do is never use an associative array reference inside (( )) at all.

Please see Pitfall 62 for more explanations.

It's also worth pointing out that bash 5.0 and higher has an assoc_expand_once shell option that may suppress the multiple expansions in some cases. The effects of this option as a mitigation have not been fully explored at the time of writing, so you may need to run your own experiments if you try to use it.

Indirect Assignments

As we discussed earlier, one of the major bash weaknesses is the inability to return a value from a function to the caller. Some people try to work around this by passing a variable name as an argument to the function, and then having the function assign to this variable using eval or some form of indirect assignment.

Those approaches aren't wrong, but the variable name argument must be sanitized/validated, or else there is a possibility of code injection. For example,

f() {
  declare -g "$1=return value"
}

g() {
  printf -v "$1" %s "return value"
}

Both of these functions are subject to code injection if the caller passes an argument that contains untrusted user input:

$ f 'x[$(date >&2)0]'
Wed Jun  3 16:19:57 EDT 2020
$ g 'x[$(date >&2)0]'
Wed Jun  3 16:19:58 EDT 2020

As you might have guessed by now, the solution to this problem is to sanitize user input before passing it as a variable name (or array index) to any kind of assignment.

Variable Inspection

Even checking whether a given string is a variable can lead to code injection:

$ v='x[0$(date >&2)]'
$ test -v "$v"
Thu Oct 28 18:25:34 EDT 2021
$ [[ -v $v ]]
Thu Oct 28 18:26:55 EDT 2021

(This code injection occurs in bash versions 4.3 and later; the -v option was introduced in 4.2.)