Differences between revisions 1 and 2
Revision 1 as of 2019-04-15 18:56:07
Size: 5298
Editor: GreyCat
Comment: Exercise 3: Grouping integers into ranges
Revision 2 as of 2020-01-24 18:06:08
Size: 5404
Editor: GreyCat
Comment: link to Ex04
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
<- [[../Ex02|Example 2: Unraveling the X-Y problem]] | '''Example 3: Grouping integers into ranges''' <- [[../Ex02|Example 2: Unraveling the X-Y problem]] | '''Example 3: Grouping integers into ranges''' | [[../Ex04|Example 4: Searching for backronyms]] ->
Line 108: Line 108:
<- [[../Ex02|Example 2: Unraveling the X-Y problem]] | '''Example 3: Grouping integers into ranges''' <- [[../Ex02|Example 2: Unraveling the X-Y problem]] | '''Example 3: Grouping integers into ranges''' | [[../Ex04|Example 4: Searching for backronyms]] ->

<- Example 2: Unraveling the X-Y problem | Example 3: Grouping integers into ranges | Example 4: Searching for backronyms ->

Grouping integers into ranges

This is a pure programming exercise. For an experienced programmer, it is probably easy. For someone who is more accustomed to the kinds of problems that bash normally deals with, it might be medium difficulty.

<asker> Hello I have a long list of numbers on each line. I have sorted them
        and made them unique. I am intrested in seeing the ranges of numbers
        I have in this format for example... 3-500, 505-570, 600, 800-1024,
        3000-4095.
<asker> they are sorted and unique. they only go from 1 to 4095

For the purposes of this page, I ignored the claim that the input is already sorted. The solution I wrote does its own sorting, just in case.

I also assumed that each input line is formatted with whitespace between the numbers.

Sorting the input

I thought of two ways to approach this problem. The first way is to count from 1 to 4095 (lowest to highest possible values), and see whether each integer is present in the input. This approach may work well for input files with small ranges, like this one, but what happens if the numbers could be much larger? This would be inefficient.

The second approach is to sort the numbers on each line, then iterate over the sorted sets, comparing each number against the previous number. If the difference between two adjacent numbers is greater than one, then we've found a gap in the sequence.

So, we begin by reading a line at a time from the input, and sorting the numbers on that line:

while read -ra tmp; do
  # Sort the numbers.
  mapfile -t tmp < <(printf '%s\n' "${tmp[@]}" | sort -n)

Bash has no realistic way to sort an array other than by piping it to sort(1), so there we have it. After the mapfile command, our array will contain a list of the numbers, sorted from smallest to largest.

Generating a list of ranges

Next, we iterate over the numbers and find the gaps which will tell us where each range, or block, starts and ends. This is somewhat tedious, but it's the kind of operation that a programmer needs to be able to do.

Think for a moment about what information you need when you're looking at a number in a list. We only want to make one pass over the list, so what do we have to remember about the past numbers, when we're looking at the current number?

Essentially, we need to know whether the current number is the continuation of a block, or the start of a new block. So, we need to know the value of the previous number in the list (or have a way to get that value).

Next, what happens if we're starting a new block? We need to be able to write out the endpoints of the current block. Therefore, we also need to remember the starting point of the current block. The end of the current block is simply the previous value, which we already have.

Thus:

  blockstart=
  prev=
  out=()
  for i in "${tmp[@]}"; do
    if [[ ! $prev ]]; then
      # Start of the list = start of a new block.
      prev=$i
      blockstart=$i
      continue
    fi

    if ((i - prev > 1)); then
      # We've found a gap.  Report the previous block and start a new one.
      if ((prev == blockstart)); then
        # Block of one value.
        out+=("$blockstart")
      else
        # Block of more than one value.
        out+=("$blockstart-$prev")
      fi
      blockstart=$i
    fi

    prev=$i
  done

But remember: once we're done traversing the list of numbers, we still have an unprinted block sitting around waiting to be reported. So, we repeat a chunk of code one more time:

  # Report the final block.
  if ((prev == blockstart)); then
    out+=("$blockstart")
  else
    out+=("$blockstart-$prev")
  fi

After the loop, and then this final repeated if statement, we have an array named out which contains all of the "blocks" of numbers. Now we just need to print that in the desired format.

Printing the output

The output format was given as a list of blocks with a comma and a space between them. Since this delimiter is two characters long, we can't use the ${out[*]} expansion with IFS -- that only allows a single-character delimiter.

Instead, we'll use the implicit looping of printf, which will leave us with an extra delimiter at the end, which we can remove.

  # Print the list of blocks.
  printf -v tmp "%s, " "${out[@]}"
  tmp=${tmp%, }
  printf '%s\n' "$tmp"
done

And there we have it -- the program is complete.

I chose to iterate over the list of numbers by value (i.e. using "${tmp[@]}") rather than by index (using "${!tmp[@]}"). This means I needed to store the previous value in a separate variable (prev). Had I chosen to iterate by index, I could have used "${tmp[i]}" and "${tmp[i-1]}". This would also entail several more array element expansions where I used simple string variable expansions.

Each approach has its advantages and disadvantages. Feel free to try writing it both ways.

<- Example 2: Unraveling the X-Y problem | Example 3: Grouping integers into ranges | Example 4: Searching for backronyms ->

BashProgramming/Ex03 (last edited 2020-01-24 18:06:08 by GreyCat)