<- Working with files | Collating with associative arrays | Avoiding code injection ->

This page is based on Bash FAQ 116, which was one of my original motivations for writing a more comprehensive bash programming document.

Collating with associative arrays

Many problems fall under the general umbrella of "I've a got two lists of things. I want to match up the things in the first list against the things in the second list." Sometimes we just want to remove duplicates. Sometimes one is a blocklist or passlist, and should cause the removal of items in the second list. Sometimes one list is a key/value database, and we want to use the values corresponding to the first list's keys while traversing the second list.

All of these problems are straightforward when you use an associative array to hold one of the lists.

The basic method:

  1. Look at the inputs. Figure out how to extract the desired information from each input file/list. Use the analysis techniques that were covered in the "Tool selection" page, or come up with your own unique algorithm.
  2. Only read each input file once, if at all possible. Select the list that is the "blocklist" or "passlist", or the "key/value index", if your task has such a thing, to read first. If you're doing duplicate removal, it may not matter which file you read first.
  3. Use an associative array to store the data from the first file/list in memory while you traverse the second file/list. (Or if it's duplicate removal, all of the inputs share the array equally.)


Remove the IPs in my blocklist file, from my logfile

In this entirely contrived example, here is the blocklist file:

This file uses simple colon-delimited fields, so we can let IFS=: read split them for us.

And here's the logfile that we want to modify: - - [01/Apr/2017:11:12:13 -0500] "GET / HTTP/1.1" 200 2355 "-" "-" - - [01/Apr/2017:11:13:02 -0500] "GET /favicon.ico HTTP/1.1" 200 24536 "-" "-"

Looks like an Apache logfile, with the IP address followed by a space, and then a bunch of fields we don't care about, except that we have to be able to reproduce them in the new logfile.

A non-programmer who is told to do this would probably think, "I know sed -i can edit a file, but how do I get sed -i to make all of these edits...?" This is a dead-end. This is not how we approach this kind of problem. We don't actually want to edit the file. We want to produce a new file, which is a copy of the original, but with some lines omitted.

We might write the script like this:

   1 #!/bin/bash
   3 # Read blocklisted IPs into the "bad" array.
   4 declare -A bad
   5 while IFS=: read -r ip _; do
   6   bad["$ip"]=1
   7 done < blocklist
   9 # Create a temp file to hold the output.
  10 # Use the mktemp(1) command.  This is not portable.
  11 unset tmp
  12 trap '[[ $tmp ]] && rm -f "$tmp"' EXIT
  13 tmp=$(mktemp) || exit
  15 # Read logfile line by line.  Copy non-blocklisted lines
  16 # to the temp file.
  17 while read -r ip rest; do
  18   [[ ${bad["$ip"]} ]] && continue
  19   printf '%s %s\n' "$ip" "$rest"
  20 done < logfile > "$tmp"
  22 mv "$tmp" logfile

The first part of the script reads the blocklist file, and stores the IP address fields in an associative array. The IP address is the key, by which the array is indexed. The value of each element is irrelevant; I use "1", but it could be any non-empty string. The mere existence of the index in the array is what matters.

The second part creates a temporary file, using a nonportable command. Creation of temporary files has a whole FAQ page to itself, because there isn't a simple right answer. In this example, we're only concerned with getting it to work on one platform, so we use the tool that we know will work on that platform, and then we document this decision with a comment. Now, when someone tries to run the script on their Solaris web server 6 months from now, and it doesn't work, they'll know why.

The third part reads the logfile line by line, which we expect will be very slow in bash. If this is a problem, we can rewrite in awk or perl. For now, we'll let it be.

We assume there is only a single space after the IP address, so that our printf will reconstruct the original line. We make this assumption based on our analysis of the logfile format. If this were not the case, then we would have to read the entire line into one variable, to save it for spitting out later, and then perform a second step to extract the IP address. But in our example, this is not necessary, so we use the simpler technique, letting read split the fields for us.

Find duplicate files in a directory hierarchy

This example isn't contrived. People actually ask for this all the time. Expanding the question: search recursively through a directory hierarchy, and report all pairs of files that have the same MD5 sum.

Our input in this case is a directory hierarchy of files. However, we're also going to be using a tool named md5sum (a Linux-specific command), and parsing its output, so we will also need to understand how that output looks. If we were writing for a system that uses md5 (BSD) or openssl md5 (closest thing we have to portable) instead of Linux's md5sum, then the output would be different, and we'd have to write the script differently.

Given multiple files as arguments, md5sum writes output like this:

30952c6fe0b925fc71c49d361e8cb7e5  .bashrc
bae979259924efee3797b1a090d6d57f  .profile

All(?) versions of GNU coreutils md5sum mangle filenames by default. Newlines are converted to the two-character string \n, and backslashes to the two-character string \\. However, the most recent versions of md5sum have a -z option to stop the mangling, and use NUL bytes as filename terminators. So, we'll use the -z option for our script. This means that the script will only work on systems with a sufficiently recent version of GNU coreutils.

It would be possible to write a similar script that unmangles the filenames after md5sum (without -z) is done mangling them, but in the interests of brevity, we won't pursue that in this chapter.

We can use read's field splitting (together with the -d '' option) to parse this input format. The hash value in the first field won't contain spaces. The filename might, but that's OK. If we only give two variable names to read, there will be no field splitting of the second variable.

So, how does the associative array help us? Well, each time we read an MD5 value as input, we want to know if we've already seen it. Thus, we will use the MD5 value as the key of our associative array.

If we use "1" as the value, then we will only be able to report the second filename when we see a match. We will have forgotten the first filename. So, let's store the filename as the value. That way, when we get a match, we'll have the filename we just read from md5sum, and we'll have the previous filename from the array, and we can report both filenames.

   1 #!/bin/bash
   3 # Usage: finddups [directory]
   4 # If no directory is specified, start in .
   6 # This script uses the GNU md5sum(1) command.
   8 declare -A seen
   9 while read -r -d '' md5 file; do
  10   if [[ ${seen["$md5"]} ]]; then
  11     printf 'Matching MD5: "%s" and "%s"\n' "${seen["$md5"]}" "$file"
  12   fi
  13   seen["$md5"]=$file
  14 done < <(find "${1:-.}" -type f -exec md5sum -z {} +)

We begin with a ProcessSubstitution that reads from find. This will give us the entire stream of MD5/filename pairs, no matter how many times find has to run md5sum.

Each time we see an MD5 hash and its filename, we check whether we've already seen this MD5 before. If so, we print a line. Then we put the MD5/filename pair into the array. Simple!

The only "problem" is that we just report pairs of files. If there are 3 or more files with the same hash, we don't report them all together at once. We could store multiple filenames for each MD5 value, but that would require some data structure for storing an associative array of lists -- and bash doesn't provide that. Thus, we would have to hack up some sort of serialized list format to store multiple filenames in a string value. At this point, you should begin to realize that bash is not the most suitable tool for the job. If you want to report 3 or more matches together, it would be better to rewrite in some other language, which can store hashes of lists naturally.

This task could also be solved with sort | uniq (with several options to uniq), or with awk. In fact, awk is so good at this kind of thing, that it's worth seeing it in action here.

The weakness of awk for this particular task is that it can't easily split the md5sum lines into hash + filename, since the filename may contain whatever field separator characters we try to use. So, we'll have to read the entire line and then split it manually.

We need to use GNU awk to handle the NUL-delimited input records, because RS="\0" is not portable.

   1 #!/bin/sh
   3 # Requires GNU md5sum and GNU awk.
   5 find "${1:-.}" -type f -exec md5sum -z {} + |
   6 awk '
   7   BEGIN {RS="\0"}
   8   {
   9     file = substr($0, 35)
  10     if (seen[$1] != "") {print "Matching MD5: \"" seen[$1] "\" and \"" file "\""}
  11     seen[$1]=file
  12   }
  13 '

The algorithm is exactly the same as in the bash version; only the syntax differs. Since we're not actually using any bash features in this script, a shebang of #!/bin/sh is acceptable.

<- Working with files | Collating with associative arrays | Avoiding code injection ->

BashProgramming/04 (last edited 2020-06-19 20:55:24 by GreyCat)