<- Working with files | Collating with associative arrays |

This page is based on Bash FAQ 116, which was one of my original motivations for writing a more comprehensive bash programming document.

Collating with associative arrays

Many problems fall under the general umbrella of "I've a got two lists of things. I want to match up the things in the first list against the things in the second list." Sometimes we just want to remove duplicates. Sometimes one is a blacklist or whitelist, and should cause the removal of items in the second list. Sometimes one list is a key/value database, and we want to add the values corresponding to the keys in the second list.

All of these problems are straightforward when you use an associative array to hold one of the lists.

The basic method:

  1. Look at the inputs. Figure out how to extract the desired information from each input file/list. Use the analysis techniques that were covered in the "Tool selection" page, or come up with your own unique algorithm.
  2. Only read each input file once, if at all possible. Select the list that is the "blacklist" or "whitelist", or the "key/value index", if your task has such a thing, to read first. If you're doing duplicate removal, it may not matter which file you read first.
  3. Use an associative array to store the data from the first file/list in memory while you traverse the second file/list. (Or if it's duplicate removal, all of the inputs share the array equally.)

Examples:

Remove the IPs in my blacklist file, from my logfile

In this entirely contrived example, here is the blacklist file:

192.168.1.3:larry:42
192.168.2.1:sue:17
192.168.1.15:bob:0

This file uses simple colon-delimited fields, so we can let IFS=: read split them for us.

And here's the logfile that we want to modify:

10.5.8.42 - - [01/Apr/2017:11:12:13 -0500] "GET / HTTP/1.1" 200 2355 "-" "-"
192.168.2.1 - - [01/Apr/2017:11:13:02 -0500] "GET /favicon.ico HTTP/1.1" 200 24536 "-" "-"

Looks like an Apache logfile, with the IP address followed by a space, and then a bunch of fields we don't care about, except that we have to be able to reproduce them in the new logfile.

A non-programmer who is told to do this would probably think, "I know sed -i can edit a file, but how do I get sed -i to make all of these edits...?" This is a dead-end. This is not how we approach this kind of problem. We don't actually want to edit the file. We want to produce a new file, which is a copy of the original, but with some lines omitted.

We might write the script like this:

   1 #!/bin/bash
   2 
   3 # Read blacklisted IPs into the "bad" array.
   4 declare -A bad
   5 while IFS=: read -r ip _; do
   6   bad["$ip"]=1
   7 done < blacklist
   8 
   9 # Create a temp file to hold the output.
  10 # Use the mktemp(1) command.  This is not portable.
  11 unset tmp
  12 trap '[[ $tmp ]] && rm -f "$tmp"' EXIT
  13 tmp=$(mktemp) || exit
  14 
  15 # Read logfile line by line.  Copy non-blacklisted lines
  16 # to the temp file.
  17 while read -r ip rest; do
  18   [[ ${bad["$ip"]} ]] && continue
  19   printf '%s %s\n' "$ip" "$rest"
  20 done < logfile > "$tmp"
  21 
  22 mv "$tmp" logfile

The first part of the script reads the blacklist file, and stores the IP address fields in an associative array. The IP address is the key, by which the array is indexed. The value of each element is irrelevant; I use "1", but it could be any non-empty string. The mere existence of the index in the array is what matters.

The second part creates a temporary file, using a nonportable command. Creation of temporary files has a whole FAQ page to itself, because there isn't a simple right answer. In this example, we're only concerned with getting it to work on one platform, so we use the tool that we know will work on that platform, and then we document this decision with a comment. Now, when someone tries to run the script on their Solaris web server 6 months from now, and it doesn't work, they'll know why.

The third part reads the logfile line by line, which we expect will be very slow in bash. If this is a problem, we can rewrite in awk or perl. For now, we'll let it be.

We assume there is only a single space after the IP address, so that our printf will reconstruct the original line. We make this assumption based on our analysis of the logfile format. If this were not the case, then we would have to read the entire line into one variable, to save it for spitting out later, and then perform a second step to extract the IP address. But in our example, this is not necessary, so we use the simpler technique, letting read split the fields for us.

Find duplicate files in a directory hierarchy

This example isn't contrived. People actually ask for this all the time. Expanding the question: search recursively through a directory hierarchy, and report all pairs of files that have the same MD5 sum.

Our input in this case is a directory hierarchy of files. However, we're also going to be using a tool named md5sum (a Linux-specific command), and parsing its output, so we will also need to understand how that output looks. If we were writing for a system that uses md5 (BSD) or openssl md5 (closest thing we have to portable) instead of Linux's md5sum, then the output would be different, and we'd have to write the script differently.

Given multiple files as arguments, md5sum writes output like this:

30952c6fe0b925fc71c49d361e8cb7e5  .bashrc
bae979259924efee3797b1a090d6d57f  .profile

We can use read's field splitting for this. The hash value in the first field won't contain spaces. The filename might, but that's OK. If we only give two variable names to read, there will be no field splitting of the second variable.

We do have a restriction, however: the output of md5sum is line-oriented. This means we can't handle filenames with newlines in them. A quick look at the man page md5sum(1) doesn't reveal any option to use NUL delimiters. So we'll just have to accept that our program will not be able to report filenames with newlines in them. We'll have to specifically exclude those.

So, how does the associative array help us? Well, each time we read an MD5 value as input, we want to know if we've already seen it. Thus, we will use the MD5 value as the key of our associative array.

If we use "1" as the value, then we will only be able to report the second filename when we see a match. We will have forgotten the first filename. So, let's store the filename as the value. That way, when we get a match, we'll have the filename we just read from md5sum, and we'll have the previous filename from the array, and we can report both filenames.

   1 #!/bin/bash
   2 
   3 # Usage: finddups [directory]
   4 # If no directory is specified, start in .
   5 
   6 # This script uses the Linux-specific md5sum(1) command.
   7 
   8 declare -A seen
   9 while read -r md5 file; do
  10   if [[ ${seen["$md5"]} ]]; then
  11     printf 'Matching MD5: "%s" and "%s"\n' "${seen["$md5"]}" "$file"
  12   fi
  13   seen["$md5"]=$file
  14 done < <(find "${1:-.}" -name $'*\n*' -prune -o -type f -exec md5sum {} +)

We begin with a ProcessSubstitution that reads from find. This will give us the entire stream of MD5/filename pairs, no matter how many times find has to run md5sum. We've disallowed filenames with newlines in them, for the reasons stated earlier.

Each time we see an MD5 hash and its filename, we check whether we've already seen this MD5 before. If so, we print a line. Then we put the MD5/filename pair into the array. Simple!

The only "problem" is that we just report pairs of files. If there are 3 or more files with the same hash, we don't report them all together at once. We could store multiple filenames for each MD5 value (in a newline-delimited pseudo-list), but that still means we'd end up reporting the match multiple times, first with two files, then three, then four, and so on. If you want to "fix" this "problem", you might suppress all the printing until the end, and then iterate over the whole array and print only those values that contain a newline. (This is left as an exercise.)

This task could also be solved with sort | uniq (with several options to uniq), or with awk. In fact, awk is so good at this kind of thing, that it's worth seeing it in action here.

The weakness of awk for this particular task is that it can't easily split the md5sum lines into hash + filename, since the filename may contain whatever field separator characters we try to use. So, we'll have to read the entire line and then split it manually.

   1 #!/bin/bash
   2 
   3 find "${1:-.}" -name $'*\n*' -prune -o -type f -exec md5sum {} + |
   4 awk '{
   5   file = substr($0, 35)
   6   if (seen[$1] != "") {print "Matching MD5: \"" seen[$1] "\" and \"" file "\""}
   7   seen[$1]=file
   8 }'

The algorithm is exactly the same as in the bash version; only the syntax differs. In fact, we could convert this to an sh script. The only bashism we're using is the $'*\n*' argument, to avoid embedding a literal newline in a string constant.

# sh compatible version
badname='*
*'
find ... -name "$badname" ...

<- Working with files | Collating with associative arrays |