Size: 1960
Comment:
|
Size: 2531
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
[[Anchor(faq36)]] | <<Anchor(faq36)>> |
Line 3: | Line 3: |
Use the comm(1) command. |
Use the `comm(1)` command: |
Line 7: | Line 6: |
# intersection of file1 and file2 comm -12 <(sort file1) <(sort file2) # subtraction of file1 from file2 comm -13 <(sort file1) <(sort file2) |
# Bash # intersection of file1 and file2 comm -12 <(sort file1) <(sort file2) # subtraction of file1 from file2 comm -13 <(sort file1) <(sort file2) |
Line 13: | Line 13: |
Read the comm(1) manpage for details. | Read the `comm` manpage for details. |
Line 15: | Line 15: |
If for some reason you lack the core comm(1) program, you can use these other methods: | If for some reason you lack the core `comm` program, you can use these other methods: |
Line 17: | Line 17: |
an amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me. note that it probably only works with GNU grep, and that the file specified with -f is will be loaded into ram, so it doesn't scale for very large files. it has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x). {{{ |
1. An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me. {{{ |
Line 26: | Line 21: |
# substraction of file1 from file2 grep -vxF -f file1 file2 }}} |
# subtraction of file1 from file2 grep -vxF -f file1 file2}}} |
Line 30: | Line 24: |
an implementation using sort and uniq | * It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x). * Note that the file specified with -f will be loaded into memory, so it doesn't scale for very large files. * It should work with any POSIX grep; on older systems you may need to use `fgrep` rather than `grep -F`. |
Line 32: | Line 28: |
{{{ | 1. An implementation using sort and uniq: {{{ |
Line 38: | Line 35: |
sort file1 file2 file1 | uniq -u }}} |
sort file1 file2 file1 | uniq -u}}} |
Line 41: | Line 37: |
another implementation of substraction: {{{ cat file1 file1 file2 | sort | uniq -c | awk '{ if ($1 == 2) { $1 = ""; print; } }' }}} |
1. Another implementation of subtraction: {{{ sort file1 file1 file2 | uniq -c | awk '{ if ($1 == 2) { $1 = ""; print; } }'}}} |
Line 47: | Line 42: |
This may introduce an extra space at the start of the line; if that's a problem, just strip it away. | * This may introduce an extra space at the start of the line; if that's a problem, just strip it away. * Also, this approach assumes that neither file1 nor file2 has any duplicates in it. * Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead. |
Line 49: | Line 46: |
Also, this approach assumes that neither file1 nor file2 has any duplicates in it. | 1. These are subtraction and intersection with awk, regardless of whether the input files are sorted or contain duplicates: {{{ # prints lines only in file1 but not in file2. Reverse the arguments to get the other way round awk 'NR==FNR{a[$0];next} !($0 in a)' file2 file1 |
Line 51: | Line 51: |
Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead. | # prints lines that are in both files; order of arguments is not important awk 'NR==FNR{a[$0];next} $0 in a' file1 file2 }}} See also: http://www.pixelbeat.org/cmdline.html#sets ---- CategoryShell |
How can I get all lines that are: in both of two files (set intersection) or in only one of two files (set subtraction).
Use the comm(1) command:
# Bash # intersection of file1 and file2 comm -12 <(sort file1) <(sort file2) # subtraction of file1 from file2 comm -13 <(sort file1) <(sort file2)
Read the comm manpage for details.
If for some reason you lack the core comm program, you can use these other methods:
- An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
# intersection of file1 and file2 grep -xF -f file1 file2 # subtraction of file1 from file2 grep -vxF -f file1 file2
- It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
- Note that the file specified with -f will be loaded into memory, so it doesn't scale for very large files.
It should work with any POSIX grep; on older systems you may need to use fgrep rather than grep -F.
- An implementation using sort and uniq:
# intersection of file1 and file2 sort file1 file2 | uniq -d (Assuming each of file1 or file2 does not have repeated content) # file1-file2 (Subtraction) sort file1 file2 file2 | uniq -u # same way for file2 - file1, change last file2 to file1 sort file1 file2 file1 | uniq -u
- Another implementation of subtraction:
sort file1 file1 file2 | uniq -c | awk '{ if ($1 == 2) { $1 = ""; print; } }'
- This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
- Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
- Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.
- These are subtraction and intersection with awk, regardless of whether the input files are sorted or contain duplicates:
# prints lines only in file1 but not in file2. Reverse the arguments to get the other way round awk 'NR==FNR{a[$0];next} !($0 in a)' file2 file1 # prints lines that are in both files; order of arguments is not important awk 'NR==FNR{a[$0];next} $0 in a' file1 file2
See also: http://www.pixelbeat.org/cmdline.html#sets