Differences between revisions 10 and 11
Revision 10 as of 2010-06-25 20:06:53
Size: 2531
Editor: MatthiasPopp
Comment:
Revision 11 as of 2010-09-22 20:39:43
Size: 2786
Editor: GreyCat
Comment:
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
# intersection of file1 and file2 # Intersection of file1 and file2
Line 9: Line 9:
# subtraction of file1 from file2
# Subtraction of file1 from file2
Line 13: Line 14:
Read the `comm` manpage for details. Read the `comm` man page for details. Those are [[ProcessSubstitution|process substitutions]] you see up there.
Line 15: Line 16:
If for some reason you lack the core `comm` program, you can use these other methods: If for some reason you lack the core `comm` program, you can use these other methods. (Actually, you really should NOT use any of these. They were written by people who didn't know about `comm` yet. But people love slow, arcane alternatives!)
Line 21: Line 22:
Line 22: Line 24:
  grep -vxF -f file1 file2}}}   grep -vxF -f file1 file2
 
}}}
Line 32: Line 35:
Line 34: Line 38:
Line 35: Line 40:
  sort file1 file2 file1 | uniq -u}}}   sort file1 file2 file1 | uniq -u
 
}}}
Line 40: Line 46:
  awk '{ if ($1 == 2) { $1 = ""; print; } }'}}}   awk '{ if ($1 == 2) { $1 = ""; print; } }'
 
}}}
Line 52: Line 59:
  awk 'NR==FNR{a[$0];next} $0 in a' file1 file2 }}}   awk 'NR==FNR{a[$0];next} $0 in a' file1 file2
}}}

How can I get all lines that are: in both of two files (set intersection) or in only one of two files (set subtraction).

Use the comm(1) command:

# Bash
# Intersection of file1 and file2
comm -12 <(sort file1) <(sort file2)

# Subtraction of file1 from file2
comm -13 <(sort file1) <(sort file2)

Read the comm man page for details. Those are process substitutions you see up there.

If for some reason you lack the core comm program, you can use these other methods. (Actually, you really should NOT use any of these. They were written by people who didn't know about comm yet. But people love slow, arcane alternatives!)

  1. An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
      # intersection of file1 and file2
      grep -xF -f file1 file2
    
      # subtraction of file1 from file2
      grep -vxF -f file1 file2
    • It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
    • Note that the file specified with -f will be loaded into memory, so it doesn't scale for very large files.
    • It should work with any POSIX grep; on older systems you may need to use fgrep rather than grep -F.

  2. An implementation using sort and uniq:
      # intersection of file1 and file2
      sort file1 file2 | uniq -d  (Assuming each of file1 or file2 does not have repeated content)
    
      # file1-file2 (Subtraction)
      sort file1 file2 file2 | uniq -u
    
      # same way for file2 - file1, change last file2 to file1
      sort file1 file2 file1 | uniq -u
  3. Another implementation of subtraction:
      sort file1 file1 file2 | uniq -c |
      awk '{ if ($1 == 2) { $1 = ""; print; } }'
    • This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
    • Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
    • Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.
  4. These are subtraction and intersection with awk, regardless of whether the input files are sorted or contain duplicates:
      # prints lines only in file1 but not in file2. Reverse the arguments to get the other way round
      awk 'NR==FNR{a[$0];next} !($0 in a)' file2 file1
    
      # prints lines that are in both files; order of arguments is not important
      awk 'NR==FNR{a[$0];next} $0 in a' file1 file2

See also: http://www.pixelbeat.org/cmdline.html#sets


CategoryShell

BashFAQ/036 (last edited 2017-05-02 18:53:07 by GreyCat)