<- [[../Ex03|Example 3: Grouping integers into ranges]] | '''Example 4: Searching for backronyms''' = Searching for backronyms = A #bash user presented the following fun problem: {{{ I have a list of capitalized strings, and i want to search a file to see if any of the string matches the first letter of a sequence of words. For example it the capitalized string is ‘TPR’ it should find ‘Total Physical Response’ in the text. But the length of the search string is variable. What would be the best way to do this? all the punctuation and crap is alread taken care of... each sentence is on its own line with any punctuation.. with=without Is the "sentence"/"line" thing relevant? Are you supposed to NOT count a sequence if it spans a line boundary? yes if it crosses the line boundary it dose not need to count it.. }}} This is not really well-suited to bash, but as a simple programming exercise in most other languages, it would be '''easy'''. It's still pretty easy in bash, although we'll have to do a bit of extra work that most languages could handle for us. I'm going to add two additional specifications to make the problem slightly more challenging, and the resulting script more useful in real life: 1. Punctuation is '''not''' already taken care of, and should be discarded. 1. Input words are '''not ''' already capitalized. == Input and sanitization == We're still accepting the asker's assertion that the input file has one sentence per line, or at least, that we don't care about backronyms that span a line boundary. Therefore, the obvious first step is to read the input file line by line, and to break each line into "words" which we store in an array. For now, let's assume the input file is fed via stdin, and the acronym/initialism is given as an argument. Now, we need to decide what we want to do about punctuation. In normal English prose, words may have various punctuation characters attached to their fronts or backs, and some may have hyphens or apostrophes embedded within them. For the purposes of this script, I'm going to strip all ''leading'' and ''trailing'' punctuation, but leave ''internal'' punctuation alone. Stripping leading and trailing characters from a string is best done with an [[glob|extglob]], so we'll turn that on. We can use bash 4.0's parameter expansion syntax to capitalize the first letter of each word as well. {{{#!highlight bash #!/bin/bash shopt -s extglob while read -ra tmp; do words=() for w in "${tmp[@]}"; do w=${w##+([![:alpha:]])} w=${w%%+([![:alpha:]])} w=${w^} words+=("$w") done }}} At this point, we've read a single line, broken it into words on whitespace, stripped leading and trailing punctuation from each word, capitalized the first letter, and stuffed the sanitized words into an array. Next, we might want to think about how we're going to ''search'' for the acronym argument. It's presented as a single string, like "TPR". It would make sense to generate a string containing the first letters of all of the words in our input line, so that we can simply search for "TPR" in this string. So, let's add another variable: {{{#!highlight bash #!/bin/bash shopt -s extglob while read -ra tmp; do words=() letters= for w in "${tmp[@]}"; do w=${w##+([![:alpha:]])} w=${w%%+([![:alpha:]])} w=${w^} words+=("$w") letters+=${w:0:1} done }}} == Searching == At this point, we have an acronym that we're searching for (e.g. TPR), and a string of letters reflecting the words from our input line. We can simply check whether the acronym appears in the larger string: {{{ if [[ $letters = *"$1"* ]]; then }}} Now, this is where bash falls a bit short of what we want. In most other languages, there is some function like `strstr()` which will not only tell you whether a substring match exists, but ''where''. If we're going to report the backronym that we've found, we need to know which words to print. Without a function like `strstr` to tell us where the substring begins, we'll have to do it ourselves. We could in fact remove the `$letters = *"$1"*` check entirely, and ''only'' use the manual substring search. But my gut tells me that doing it this way may be slightly faster -- we will presumably ''not'' find a match on most input lines, so avoiding the manual substring search, which is likely to be pretty slow in bash, is desirable. For reinventing a subtring search, remember that we were told that we ''don't'' have to search the whole file at once -- only single lines. Inputs will be small. So, anything complicated like [[WikiPedia:Boyer–Moore string-search algorithm|Boyer-Moore]] would be a net loss; the additional overhead would cost more time than the algorithm would save. We'll just use a brute force approach. My first thought is to use a sliding window, so let's try that first. {{{ if [[ $letters = *"$1"* ]]; then # Find the beginning of the match. for ((i=0; i <= ${#letters} - ${#1}; i++)); do if [[ ${letters:i:${#1}} = "$1" ]]; then printf '%s\n' "${words[*]:i:${#1}}" fi done fi }}} We'll want to be especially careful to check for off-by-one errors here. Be sure to test with inputs where the match occurs at the beginning or end of the line. For that matter, also be sure to test with inputs where the match ''is'' the entire line, and where the line is too small for a match to be possible. We can also move some of those `${#foo}` expansions into variables, which will serve two purposes: it will make the code easier to read, and it may even improve speed. As written, this code should print multiple matches from a single line; we'll also want to test that. == Testing == Let's take a moment to write an input file for testing. We can have some fun here. {{{ "That plonker! Really!!" The pollen really messed with my sinuses. Nobody expects the Polish Resistance! These lines don't match. Ugh. }}} As noted previously, I'm testing lines that match at the beginning, lines that match at the end, lines that have nothing ''but'' matching words, lines that ''don't match'', and lines that are too short to possibly match. I've made sure there's leading and trailing punctuation to be discarded (multiple characters in at least one case), and that matching words have a mixture of upper- and lower-case first letters. Here's the final cleaned-up version of our script, with variables added to hold string lengths: {{{#!highlight bash #!/bin/bash shopt -s extglob len=${#1} while read -ra tmp; do words=() letters= for w in "${tmp[@]}"; do w=${w##+([![:alpha:]])} w=${w%%+([![:alpha:]])} w=${w^} words+=("$w") letters+=${w:0:1} done if [[ $letters = *"$1"* ]]; then # Find the beginning of the match. n=${#letters} for ((i=0; i <= n - len; i++)); do if [[ ${letters:i:len} = "$1" ]]; then printf '%s\n' "${words[*]:i:len}" fi done fi done }}} And testing it: {{{ $ ./backronym TPR < foo That Plonker Really The Pollen Really The Polish Resistance $ ./backronym MW < foo Messed With }}} Well, that seems to work. In real life, we'd also need to add some basic error checking -- print a usage summary if no acronym argument is given, at the very least. We may also want to capitalize the argument, as currently a lower-case argument will never be matched. But these improvements are... well, you know. <- [[../Ex03|Example 3: Grouping integers into ranges]] | '''Example 4: Searching for backronyms'''