Differences between revisions 1 and 2
Revision 1 as of 2007-05-02 23:25:20
Size: 1960
Editor: redondos
Comment:
Revision 2 as of 2007-05-15 19:52:44
Size: 1985
Editor: GreyCat
Comment: clean up
Deletions are marked like this. Additions are marked like this.
Line 10: Line 10:
  comm -13 <(sort file1) <(sort file2)
}}}
  comm -13 <(sort file1) <(sort file2)}}}
Line 17: Line 16:
an amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.  1. An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
Line 19: Line 18:
note that it probably only works with GNU grep, and that the file specified with -f is will be loaded into ram, so it doesn't scale for very large files.   * Note that it probably only works with GNU grep, and that the file specified with -f is will be loaded into ram, so it doesn't scale for very large files.
Line 21: Line 20:
it has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).   * It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
Line 23: Line 22:
{{{  {{{
Line 26: Line 25:
  # substraction of file1 from file2
  grep -vxF -f file1 file2
}}}
  # subtraction of file1 from file2
  grep -vxF -f file1 file2}}}
Line 30: Line 28:
an implementation using sort and uniq  1. An implementation using sort and uniq
Line 32: Line 30:
{{{  {{{
Line 38: Line 36:
  sort file1 file2 file1 | uniq -u
}}}
  sort file1 file2 file1 | uniq -u}}}
Line 41: Line 38:
another implementation of substraction:
{{{
 1. Another implementation of subtraction:
 {{{
Line 44: Line 41:
  awk '{ if ($1 == 2) { $1 = ""; print; } }'
}}}
  awk '{ if ($1 == 2) { $1 = ""; print; } }'}}}
Line 47: Line 43:
This may introduce an extra space at the start of the line; if that's a problem, just strip it away.   * This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
Line 49: Line 45:
Also, this approach assumes that neither file1 nor file2 has any duplicates in it.   * Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
Line 51: Line 47:
Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.   * Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.

Anchor(faq36)

How can I get all lines that are: in both of two files (set intersection) or in only one of two files (set subtraction).

Use the comm(1) command.

  # intersection of file1 and file2
  comm -12 <(sort file1) <(sort file2)
  # subtraction of file1 from file2
  comm -13 <(sort file1) <(sort file2)

Read the comm(1) manpage for details.

If for some reason you lack the core comm(1) program, you can use these other methods:

  1. An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
    • Note that it probably only works with GNU grep, and that the file specified with -f is will be loaded into ram, so it doesn't scale for very large files.
    • It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
      # intersection of file1 and file2
      grep -xF -f file1 file2
      # subtraction of file1 from file2
      grep -vxF -f file1 file2
  2. An implementation using sort and uniq
      # intersection of file1 and file2
      sort file1 file2 | uniq -d  (Assuming each of file1 or file2 does not have repeated content)
      # file1-file2 (Subtraction)
      sort file1 file2 file2 | uniq -u
      # same way for file2 - file1, change last file2 to file1
      sort file1 file2 file1 | uniq -u
  3. Another implementation of subtraction:
      cat file1 file1 file2 | sort | uniq -c |
      awk '{ if ($1 == 2) { $1 = ""; print; } }'
    • This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
    • Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
    • Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.

BashFAQ/036 (last edited 2017-05-02 18:53:07 by GreyCat)