I have two files. The first one contains bad IP addresses (plus other fields). I want to remove all of these bad addresses from a second file.

This is a more generalized form of one of the questions from FAQ 36 (where the entire line is significant in each file). In this form, we're only using part of each line as a key. We're going to show how to approach this kind of problem using an associative array.

The basic concepts here are:

  1. We look at the files, using our natural intelligence. We figure out the structure of each file, and choose a parsing method which is suited to the input. This could be field splitting on a delimiter, or extracting from column J to column K, or whatever it takes.

  2. We only want to read each file one time. It would be extremely inefficient to search for each key in the second file, re-reading the entire file each time.

  3. We don't actually modify the second file. We write a new (temporary) file, containing only the desired lines, and then move it to the original second file.

  4. We use an associative array to hold a list of good/bad/ugly keys from the first file. When we read the second file, we can quickly check whether each key exists in the associative array, so we know whether to write the line to the temp file.

Let's suppose the files are delimited by fields. This is fairly common. Here is the first file, containing the bad IP addresses:

192.168.1.3:larry:42
192.168.2.1:sue:17
192.168.1.15:bob:0

The IP address, which is all that we care about, is the first field, ending in a : (colon) character. We can ignore the rest of the fields.

Here is the second file:

10.5.8.42 - - [01/Apr/2017:11:12:13 -0500] "GET / HTTP/1.1" 200 2355 "-" "-"
192.168.2.1 - - [01/Apr/2017:11:13:02 -0500] "GET /favicon.ico HTTP/1.1" 200 24536 "-" "-"

In this file, the IP address is in the first field, ending with a space character. We can ignore the rest of the fields. See those quoted fields with internal spaces? Yeah, we don't care about those. Our parsing is complete before we get that far in the line. So we simply ignore them.

So, here's the first half of the script:

   1 #!/usr/bin/env bash
   2 
   3 declare -A bad
   4 while IFS=: read -r ip _; do
   5     bad["$ip"]=1
   6 done < first.file

This uses the basic techniques from FAQ 1 to read the first file, line by line, splitting it into fields as we go. After this loop, we have an associative array named bad which contains all the IP addresses that we want to purge from the second file.

(At this point, some people may be thinking how IPv6 will completely break this script, and whatever script produced that input file. This is quite true. Good luck with that.)

Since the IP address is the index of the associative array, we can quickly check whether the IP address is in the array, or not. Do not keep a list of addresses, and then search through the entire list each time. Use the correct data structure for the problem.

The second half is not much more difficult:

   1 # Use the mktemp(1) command.  This is not portable.
   2 # There is no portable version.  The world sucks.
   3 unset tmp
   4 trap '[[ $tmp ]] && rm -f "$tmp"' EXIT
   5 tmp=$(mktemp) || exit
   6 
   7 while read -r ip rest; do
   8   [[ ${bad["$ip"]} ]] && continue
   9   printf '%s %s\n' "$ip" "$rest"
  10 done < second.file > "$tmp"
  11 
  12 mv "$tmp" second.file

FAQ 62 discusses the creation of temporary files, including some attempts to provide semi-portable alternatives. SignalTrap discusses the use of an EXIT trap to clean up temporary files in case of unexpected exits.

Here, we simply read each line of the file (using FAQ 1 again). We assume there is only a single space after the IP address, so that our printf will reconstruct the original line. We make this assumption based on our intelligent analysis of the second input file. If this were not the case, then we would have to read the entire line into one variable, to save it for spitting out later, and then perform a second step to extract the IP address. But in our example, this is not necessary, so we use the simpler technique.

Associative arrays were introduced in Bash 4.0. If you are using an older version of bash, or a POSIX shell, you can't solve this kind of problem efficiently. Switch to awk, or perl, or any other scripting language that has the correct data structure to solve your problem.

BashFAQ/116 (last edited 2017-05-26 18:20:43 by 185)