Diff for "BashFAQ/036"

Differences between revisions 1 and 4 (spanning 3 versions)

How can I get all lines that are: in both of two files (set intersection) or in only one of two files (set subtraction).

Use the comm(1) command.

  # intersection of file1 and file2
  comm -12 <(sort file1) <(sort file2)
  # subtraction of file1 from file2
  comm -13 <(sort file1) <(sort file2)

Read the comm(1) manpage for details.

If for some reason you lack the core comm(1) program, you can use these other methods:

An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
- It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
- Note that the file specified with -f is will be loaded into ram, so it doesn't scale for very large files.
- It should work with any POSIX grep; on older systems you may need to use fgrep rather than grep -F.
```
  # intersection of file1 and file2
  grep -xF -f file1 file2
  # subtraction of file1 from file2
  grep -vxF -f file1 file2
```

An implementation using sort and uniq

  # intersection of file1 and file2
  sort file1 file2 | uniq -d  (Assuming each of file1 or file2 does not have repeated content)
  # file1-file2 (Subtraction)
  sort file1 file2 file2 | uniq -u
  # same way for file2 - file1, change last file2 to file1
  sort file1 file2 file1 | uniq -u

Another implementation of subtraction:
```
  cat file1 file1 file2 | sort | uniq -c |
  awk '{ if ($1 == 2) { $1 = ""; print; } }'
```
- This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
- Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
- Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.

-  ⇤ ← Revision 1 as of 2007-05-02 23:25:20 → 
  Size: 1960
  Editor: redondos
  Comment:
+   ← Revision 4 as of 2007-12-18 22:45:15 → ⇥
  Size: 2099
  Editor: 74-140-178-145
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 10:
-  comm -13 <(sort file1) <(sort file2)
}}}
+  comm -13 <(sort file1) <(sort file2)}}}
-Line 17:
+Line 16:
-an amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
+. An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
-Line 19:
+Line 18:
-note that it probably only works with GNU grep, and that the file specified with -f is will be loaded into ram, so it doesn't scale for very large files.
+  * It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
-Line 21:
+Line 20:
-it has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
+  * Note that the file specified with -f is will be loaded into ram, so it doesn't scale for very large files.
-Line 23:
+Line 22:
-{{{
+  * It should work with any POSIX grep; on older systems you may need to use fgrep rather than grep -F.
 {{{
 Line 26:
-  # substraction of file1 from file2
  grep -vxF -f file1 file2
}}}
+  # subtraction of file1 from file2
  grep -vxF -f file1 file2}}}
-Line 30:
+Line 29:
-an implementation using sort and uniq
+. An implementation using sort and uniq
-Line 32:
+Line 31:
-{{{
+ {{{
-Line 38:
+Line 37:
-  sort file1 file2 file1 | uniq -u
}}}
+  sort file1 file2 file1 | uniq -u}}}
-Line 41:
+Line 39:
-another implementation of substraction:
{{{
+. Another implementation of subtraction:
 {{{
-Line 44:
+Line 42:
-  awk '{ if ($1 == 2) { $1 = ""; print; } }'
}}}
+  awk '{ if ($1 == 2) { $1 = ""; print; } }'}}}
-Line 47:
+Line 44:
-This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
+  * This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
-Line 49:
+Line 46:
-Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
+  * Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
-Line 51:
+Line 48:
-Finally, it sorts the output for you.  If that's a problem, then you'll have to abandon this approach altogether.  Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.
+  * Finally, it sorts the output for you.  If that's a problem, then you'll have to abandon this approach altogether.  Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.

See also: http://www.pixelbeat.org/cmdline.html#sets