<- Example 3: Grouping integers into ranges | Example 4: Searching for backronyms
Searching for backronyms
A #bash user presented the following fun problem:
<asker> I have a list of capitalized strings, and i want to search a file to see if any of the string matches the first letter of a sequence of words. For example it the capitalized string is ‘TPR’ it should find ‘Total Physical Response’ in the text. But the length of the search string is variable. What would be the best way to do this? <asker> all the punctuation and crap is alread taken care of... each sentence is on its own line with any punctuation.. <asker> with=without <helper> Is the "sentence"/"line" thing relevant? Are you supposed to NOT count a sequence if it spans a line boundary? <asker> yes if it crosses the line boundary it dose not need to count it..
This is not really well-suited to bash, but as a simple programming exercise in most other languages, it would be easy. It's still pretty easy in bash, although we'll have to do a bit of extra work that most languages could handle for us.
I'm going to add two additional specifications to make the problem slightly more challenging, and the resulting script more useful in real life:
Punctuation is not already taken care of, and should be discarded.
Input words are not already capitalized.
Input and sanitization
We're still accepting the asker's assertion that the input file has one sentence per line, or at least, that we don't care about backronyms that span a line boundary. Therefore, the obvious first step is to read the input file line by line, and to break each line into "words" which we store in an array. For now, let's assume the input file is fed via stdin, and the acronym/initialism is given as an argument.
Now, we need to decide what we want to do about punctuation. In normal English prose, words may have various punctuation characters attached to their fronts or backs, and some may have hyphens or apostrophes embedded within them. For the purposes of this script, I'm going to strip all leading and trailing punctuation, but leave internal punctuation alone.
Stripping leading and trailing characters from a string is best done with an extglob, so we'll turn that on. We can use bash 4.0's parameter expansion syntax to capitalize the first letter of each word as well.
At this point, we've read a single line, broken it into words on whitespace, stripped leading and trailing punctuation from each word, capitalized the first letter, and stuffed the sanitized words into an array.
Next, we might want to think about how we're going to search for the acronym argument. It's presented as a single string, like "TPR". It would make sense to generate a string containing the first letters of all of the words in our input line, so that we can simply search for "TPR" in this string. So, let's add another variable:
Searching
At this point, we have an acronym that we're searching for (e.g. TPR), and a string of letters reflecting the words from our input line. We can simply check whether the acronym appears in the larger string:
if [[ $letters = *"$1"* ]]; then
Now, this is where bash falls a bit short of what we want. In most other languages, there is some function like strstr() which will not only tell you whether a substring match exists, but where. If we're going to report the backronym that we've found, we need to know which words to print. Without a function like strstr to tell us where the substring begins, we'll have to do it ourselves.
We could in fact remove the $letters = *"$1"* check entirely, and only use the manual substring search. But my gut tells me that doing it this way may be slightly faster -- we will presumably not find a match on most input lines, so avoiding the manual substring search, which is likely to be pretty slow in bash, is desirable.
For reinventing a subtring search, remember that we were told that we don't have to search the whole file at once -- only single lines. Inputs will be small. So, anything complicated like Boyer-Moore would be a net loss; the additional overhead would cost more time than the algorithm would save. We'll just use a brute force approach. My first thought is to use a sliding window, so let's try that first.
if [[ $letters = *"$1"* ]]; then # Find the beginning of the match. for ((i=0; i <= ${#letters} - ${#1}; i++)); do if [[ ${letters:i:${#1}} = "$1" ]]; then printf '%s\n' "${words[*]:i:${#1}}" fi done fi
We'll want to be especially careful to check for off-by-one errors here. Be sure to test with inputs where the match occurs at the beginning or end of the line. For that matter, also be sure to test with inputs where the match is the entire line, and where the line is too small for a match to be possible.
We can also move some of those ${#foo} expansions into variables, which will serve two purposes: it will make the code easier to read, and it may even improve speed.
As written, this code should print multiple matches from a single line; we'll also want to test that.
Testing
Let's take a moment to write an input file for testing. We can have some fun here.
"That plonker! Really!!" The pollen really messed with my sinuses. Nobody expects the Polish Resistance! These lines don't match. Ugh.
As noted previously, I'm testing lines that match at the beginning, lines that match at the end, lines that have nothing but matching words, lines that don't match, and lines that are too short to possibly match. I've made sure there's leading and trailing punctuation to be discarded (multiple characters in at least one case), and that matching words have a mixture of upper- and lower-case first letters.
Here's the final cleaned-up version of our script, with variables added to hold string lengths:
1 #!/bin/bash
2 shopt -s extglob
3
4 len=${#1}
5
6 while read -ra tmp; do
7 words=()
8 letters=
9 for w in "${tmp[@]}"; do
10 w=${w##+([![:alpha:]])}
11 w=${w%%+([![:alpha:]])}
12 w=${w^}
13 words+=("$w")
14 letters+=${w:0:1}
15 done
16
17 if [[ $letters = *"$1"* ]]; then
18 # Find the beginning of the match.
19 n=${#letters}
20 for ((i=0; i <= n - len; i++)); do
21 if [[ ${letters:i:len} = "$1" ]]; then
22 printf '%s\n' "${words[*]:i:len}"
23 fi
24 done
25 fi
26 done
And testing it:
$ ./backronym TPR < foo That Plonker Really The Pollen Really The Polish Resistance $ ./backronym MW < foo Messed With
Well, that seems to work. In real life, we'd also need to add some basic error checking -- print a usage summary if no acronym argument is given, at the very least. We may also want to capitalize the argument, as currently a lower-case argument will never be matched. But these improvements are... well, you know.
<- Example 3: Grouping integers into ranges | Example 4: Searching for backronyms