## page was renamed from GreyCat/BashProgramming/04 <- [[../03|Working with files]] | '''Collating with associative arrays''' | [[../05|Avoiding code injection]] -> This page is based on [[BashFAQ/116|Bash FAQ 116]], which was one of my original motivations for writing a more comprehensive bash programming document. <> = Collating with associative arrays = Many problems fall under the general umbrella of "I've a got two lists of things. I want to match up the things in the first list against the things in the second list." Sometimes we just want to remove duplicates. Sometimes one is a blocklist or passlist, and should cause the removal of items in the second list. Sometimes one list is a key/value database, and we want to use the values corresponding to the first list's keys while traversing the second list. All of these problems are straightforward when you use an associative array to hold one of the lists. The basic method: 1. Look at the inputs. Figure out how to extract the desired information from each input file/list. Use the analysis techniques that were covered in the "Tool selection" page, or come up with your own unique algorithm. 1. Only read each input file once, if at all possible. Select the list that is the "blocklist" or "passlist", or the "key/value index", if your task has such a thing, to read first. If you're doing duplicate removal, it may not matter which file you read first. 1. Use an associative array to store the data from the first file/list in memory while you traverse the second file/list. (Or if it's duplicate removal, all of the inputs share the array equally.) Examples: == Remove the IPs in my blocklist file, from my logfile == In this entirely contrived example, here is the blocklist file: {{{ 192.168.1.3:larry:42 192.168.2.1:sue:17 192.168.1.15:bob:0 }}} This file uses simple colon-delimited fields, so we can let `IFS=: read` split them for us. And here's the logfile that we want to modify: {{{ 10.5.8.42 - - [01/Apr/2017:11:12:13 -0500] "GET / HTTP/1.1" 200 2355 "-" "-" 192.168.2.1 - - [01/Apr/2017:11:13:02 -0500] "GET /favicon.ico HTTP/1.1" 200 24536 "-" "-" }}} Looks like an Apache logfile, with the IP address followed by a space, and then a bunch of fields we don't care about, except that we have to be able to reproduce them in the new logfile. A non-programmer who is told to do this would probably think, "I know sed -i can edit a file, but how do I get sed -i to make all of these edits...?" This is a dead-end. This is not how we approach this kind of problem. We don't actually want to ''edit'' the file. We want to produce a new file, which is a copy of the original, but with some lines omitted. We might write the script like this: {{{#!highlight bash #!/bin/bash # Read blocklisted IPs into the "bad" array. declare -A bad while IFS=: read -r ip _; do bad["$ip"]=1 done < blocklist # Create a temp file to hold the output. # Use the mktemp(1) command. This is not portable. unset tmp trap '[[ $tmp ]] && rm -f "$tmp"' EXIT tmp=$(mktemp) || exit # Read logfile line by line. Copy non-blocklisted lines # to the temp file. while read -r ip rest; do [[ ${bad["$ip"]} ]] && continue printf '%s %s\n' "$ip" "$rest" done < logfile > "$tmp" mv "$tmp" logfile }}} The first part of the script reads the blocklist file, and stores the IP address fields in an associative array. The IP address is the key, by which the array is indexed. The value of each element is irrelevant; I use "1", but it could be any non-empty string. The mere ''existence'' of the index in the array is what matters. The second part creates a [[BashFAQ/062|temporary file]], using a nonportable command. Creation of temporary files has a whole FAQ page to itself, because there isn't a simple right answer. In this example, we're only concerned with getting it to work on one platform, so we use the tool that we know will work on that platform, and then we '''document this decision with a comment.''' Now, when someone tries to run the script on their Solaris web server 6 months from now, and it doesn't work, they'll know why. The third part reads the logfile line by line, which we expect will be very slow in bash. If this is a problem, we can rewrite in awk or perl. For now, we'll let it be. We assume there is only a single space after the IP address, so that our `printf` will reconstruct the original line. We make this assumption based on our analysis of the logfile format. If this were not the case, then we would have to read the entire line into one variable, to save it for spitting out later, and then perform a second step to extract the IP address. But in our example, this is not necessary, so we use the simpler technique, letting `read` split the fields for us. == Find duplicate files in a directory hierarchy == This example isn't contrived. People actually ask for this all the time. Expanding the question: ''search recursively through a directory hierarchy, and report all pairs of files that have the same MD5 sum''. Our input in this case is a directory hierarchy of files. However, we're also going to be using a tool named `md5sum` (a Linux-specific command), and parsing its output, so we will also need to understand how that output looks. If we were writing for a system that uses `md5` (BSD) or `openssl md5` (closest thing we have to portable) instead of Linux's `md5sum`, then the output would be different, and we'd have to write the script differently. Given multiple files as arguments, `md5sum` writes output like this: {{{ 30952c6fe0b925fc71c49d361e8cb7e5 .bashrc bae979259924efee3797b1a090d6d57f .profile }}} All(?) versions of GNU coreutils `md5sum` mangle filenames by default. Newlines are converted to the two-character string `\n`, and backslashes to the two-character string `\\`. However, the most recent versions of `md5sum` have a `-z` option to stop the mangling, and use NUL bytes as filename terminators. So, we'll use the `-z` option for our script. This means that the script will only work on systems with a sufficiently recent version of GNU coreutils. It would be possible to write a similar script that unmangles the filenames after `md5sum` (without `-z`) is done mangling them, but in the interests of brevity, we won't pursue that in this chapter. We can use `read`'s field splitting (together with the `-d ''` option) to parse this input format. The hash value in the first field won't contain spaces. The filename ''might'', but that's OK. If we only give two variable names to `read`, there will be no field splitting of the second variable. So, how does the associative array help us? Well, each time we read an MD5 value as input, we want to know if we've already seen it. Thus, we will use the MD5 value as the key of our associative array. If we use "1" as the value, then we will only be able to report the second filename when we see a match. We will have forgotten the first filename. So, let's store the filename as the value. That way, when we get a match, we'll have the filename we just read from `md5sum`, and we'll have the previous filename from the array, and we can report both filenames. {{{#!highlight bash #!/bin/bash # Usage: finddups [directory] # If no directory is specified, start in . # This script uses the GNU md5sum(1) command. declare -A seen while read -r -d '' md5 file; do if [[ ${seen["$md5"]} ]]; then printf 'Matching MD5: "%s" and "%s"\n' "${seen["$md5"]}" "$file" fi seen["$md5"]=$file done < <(find "${1:-.}" -type f -exec md5sum -z {} +) }}} We begin with a ProcessSubstitution that reads from [[UsingFind|find]]. This will give us the entire stream of MD5/filename pairs, no matter how many times `find` has to run `md5sum`. Each time we see an MD5 hash and its filename, we check whether we've already seen this MD5 before. If so, we print a line. Then we put the MD5/filename pair into the array. Simple! The only "problem" is that we just report pairs of files. If there are 3 or more files with the same hash, we don't report them all together at once. We could store multiple filenames for each MD5 value, but that would require some data structure for storing an associative array of lists -- and bash doesn't provide that. Thus, we would have to hack up some sort of serialized list format to store multiple filenames in a string value. At this point, you should begin to realize that bash is ''not'' the most suitable tool for the job. If you want to report 3 or more matches together, it would be better to rewrite in some other language, which can store hashes of lists naturally. This task could also be solved with `sort | uniq` (with several options to `uniq`), or with awk. In fact, awk is ''so good'' at this kind of thing, that it's worth seeing it in action here. The weakness of awk for this particular task is that it can't easily split the `md5sum` lines into hash + filename, since the filename may contain whatever field separator characters we try to use. So, we'll have to read the entire line and then split it manually. We need to use GNU awk to handle the NUL-delimited input records, because `RS="\0"` is not portable. {{{#!highlight bash #!/bin/sh # Requires GNU md5sum and GNU awk. find "${1:-.}" -type f -exec md5sum -z {} + | awk ' BEGIN {RS="\0"} { file = substr($0, 35) if (seen[$1] != "") {print "Matching MD5: \"" seen[$1] "\" and \"" file "\""} seen[$1]=file } ' }}} The algorithm is exactly the same as in the bash version; only the syntax differs. Since we're not actually using any bash features in this script, a shebang of `#!/bin/sh` is acceptable. <- [[../03|Working with files]] | '''Collating with associative arrays''' | [[../05|Avoiding code injection]] ->