<- Example 3: Grouping integers into ranges | Example 4: Searching for backronyms

Searching for backronyms

A #bash user presented the following fun problem:

<asker> I have a list of capitalized strings, and i want to search a file
        to see if any of the string matches the first letter of a sequence
        of words. For example it the capitalized string is ‘TPR’ it should
        find ‘Total Physical Response’ in the text. But the length of the
        search string is variable. What would be the best way to do this?
<asker> all the punctuation and crap is alread taken care of... each sentence
        is on its own line with any punctuation..
<asker> with=without
<helper> Is the "sentence"/"line" thing relevant?  Are you supposed to NOT count
         a sequence if it spans a line boundary?
<asker> yes if it crosses the line boundary it dose not need to count it..

This is not really well-suited to bash, but as a simple programming exercise in most other languages, it would be easy. It's still pretty easy in bash, although we'll have to do a bit of extra work that most languages could handle for us.

I'm going to add two additional specifications to make the problem slightly more challenging, and the resulting script more useful in real life:

  1. Punctuation is not already taken care of, and should be discarded.

  2. Input words are not already capitalized.

Input and sanitization

We're still accepting the asker's assertion that the input file has one sentence per line, or at least, that we don't care about backronyms that span a line boundary. Therefore, the obvious first step is to read the input file line by line, and to break each line into "words" which we store in an array. For now, let's assume the input file is fed via stdin, and the acronym/initialism is given as an argument.

Now, we need to decide what we want to do about punctuation. In normal English prose, words may have various punctuation characters attached to their fronts or backs, and some may have hyphens or apostrophes embedded within them. For the purposes of this script, I'm going to strip all leading and trailing punctuation, but leave internal punctuation alone.

Stripping leading and trailing characters from a string is best done with an extglob, so we'll turn that on. We can use bash 4.0's parameter expansion syntax to capitalize the first letter of each word as well.

   1 #!/bin/bash
   2 shopt -s extglob
   3 
   4 while read -ra tmp; do
   5   words=()
   6   for w in "${tmp[@]}"; do
   7     w=${w##+([![:alpha:]])}
   8     w=${w%%+([![:alpha:]])}
   9     w=${w^}
  10     words+=("$w")
  11   done

At this point, we've read a single line, broken it into words on whitespace, stripped leading and trailing punctuation from each word, capitalized the first letter, and stuffed the sanitized words into an array.

Next, we might want to think about how we're going to search for the acronym argument. It's presented as a single string, like "TPR". It would make sense to generate a string containing the first letters of all of the words in our input line, so that we can simply search for "TPR" in this string. So, let's add another variable:

   1 #!/bin/bash
   2 shopt -s extglob
   3 
   4 while read -ra tmp; do
   5   words=()
   6   letters=
   7   for w in "${tmp[@]}"; do
   8     w=${w##+([![:alpha:]])}
   9     w=${w%%+([![:alpha:]])}
  10     w=${w^}
  11     words+=("$w")
  12     letters+=${w:0:1}
  13   done

Searching

At this point, we have an acronym that we're searching for (e.g. TPR), and a string of letters reflecting the words from our input line. We can simply check whether the acronym appears in the larger string:

  if [[ $letters = *"$1"* ]]; then

Now, this is where bash falls a bit short of what we want. In most other languages, there is some function like strstr() which will not only tell you whether a substring match exists, but where. If we're going to report the backronym that we've found, we need to know which words to print. Without a function like strstr to tell us where the substring begins, we'll have to do it ourselves.

We could in fact remove the $letters = *"$1"* check entirely, and only use the manual substring search. But my gut tells me that doing it this way may be slightly faster -- we will presumably not find a match on most input lines, so avoiding the manual substring search, which is likely to be pretty slow in bash, is desirable.

For reinventing a subtring search, remember that we were told that we don't have to search the whole file at once -- only single lines. Inputs will be small. So, anything complicated like Boyer-Moore would be a net loss; the additional overhead would cost more time than the algorithm would save. We'll just use a brute force approach. My first thought is to use a sliding window, so let's try that first.

  if [[ $letters = *"$1"* ]]; then
    # Find the beginning of the match.
    for ((i=0; i <= ${#letters} - ${#1}; i++)); do
      if [[ ${letters:i:${#1}} = "$1" ]]; then
        printf '%s\n' "${words[*]:i:${#1}}"
      fi
    done
  fi

We'll want to be especially careful to check for off-by-one errors here. Be sure to test with inputs where the match occurs at the beginning or end of the line. For that matter, also be sure to test with inputs where the match is the entire line, and where the line is too small for a match to be possible.

We can also move some of those ${#foo} expansions into variables, which will serve two purposes: it will make the code easier to read, and it may even improve speed.

As written, this code should print multiple matches from a single line; we'll also want to test that.

Testing

Let's take a moment to write an input file for testing. We can have some fun here.

"That plonker!  Really!!"
The pollen really messed with my sinuses.
Nobody expects the Polish Resistance!
These lines don't match.
Ugh.

As noted previously, I'm testing lines that match at the beginning, lines that match at the end, lines that have nothing but matching words, lines that don't match, and lines that are too short to possibly match. I've made sure there's leading and trailing punctuation to be discarded (multiple characters in at least one case), and that matching words have a mixture of upper- and lower-case first letters.

Here's the final cleaned-up version of our script, with variables added to hold string lengths:

   1 #!/bin/bash
   2 shopt -s extglob
   3 
   4 len=${#1}
   5 
   6 while read -ra tmp; do
   7   words=()
   8   letters=
   9   for w in "${tmp[@]}"; do
  10     w=${w##+([![:alpha:]])}
  11     w=${w%%+([![:alpha:]])}
  12     w=${w^}
  13     words+=("$w")
  14     letters+=${w:0:1}
  15   done
  16 
  17   if [[ $letters = *"$1"* ]]; then
  18     # Find the beginning of the match.
  19     n=${#letters}
  20     for ((i=0; i <= n - len; i++)); do
  21       if [[ ${letters:i:len} = "$1" ]]; then
  22         printf '%s\n' "${words[*]:i:len}"
  23       fi
  24     done
  25   fi
  26 done

And testing it:

$ ./backronym TPR < foo
That Plonker Really
The Pollen Really
The Polish Resistance
$ ./backronym MW < foo
Messed With

Well, that seems to work. In real life, we'd also need to add some basic error checking -- print a usage summary if no acronym argument is given, at the very least. We may also want to capitalize the argument, as currently a lower-case argument will never be matched. But these improvements are... well, you know.

<- Example 3: Grouping integers into ranges | Example 4: Searching for backronyms

BashProgramming/Ex04 (last edited 2020-01-24 18:04:59 by GreyCat)