Differences between revisions 1 and 2
Revision 1 as of 2011-11-03 01:58:04
Size: 8847
Editor: GreyCat
Comment:
Revision 2 as of 2011-11-03 02:00:12
Size: 8850
Editor: GreyCat
Comment: withOUT zero padding, *sigh*
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
I chose to sort the output lexicographically, to keep related elements together; the order of the fields shown here is ''almost'' the same as the order inside the .torrent file (the dictionaries must be sorted lexicographically, but my use of `.n.` for lists, with zero-padding, means the lists are printed out of sequence if they have more than 10 elements). I chose to sort the output lexicographically, to keep related elements together; the order of the fields shown here is ''almost'' the same as the order inside the .torrent file (the dictionaries must be sorted lexicographically, but my use of `.n.` for lists, without zero-padding, means the lists are printed out of sequence if they have more than 10 elements).

This is a .torrent file parser. It was written for fun/challenge purposes, and is not necessarily intended to be useful for any serious purpose.

Torrent files are "bencoded". Writing a bencode parser in bash poses several difficult challenges, and pushes against the limitations of the language.

The approach used here is to parse the data stream and store the contents as fields within an associative array. The structure of the fields is determined dynamically during the parsing, which is by necessity recursive (a dictionary can contain a list, which can contain a dictionary, which can contain a list, etc.). Once the associative array has been built, all the fields become available for any purpose, assuming the application can determine how to identify the field(s) of interest.

This approach would also work well for parsing an XML document, I think. I haven't actually tried it yet.

First, here is the output. You can see the hierarchical structure of the field-namespace contained within the .torrent file. The dots between namespace elements were chosen arbitrarily, and could be replaced by another delimiter. I chose the dot partly because the .torrent namespace is reminiscent of the window namespace in Tk. The way I've represented lists with numeric namespace components is a design choice I made.

I chose to sort the output lexicographically, to keep related elements together; the order of the fields shown here is almost the same as the order inside the .torrent file (the dictionaries must be sorted lexicographically, but my use of .n. for lists, without zero-padding, means the lists are printed out of sequence if they have more than 10 elements).

$ ben /stuff/tori2007-09-11.torrent 
.announce                http://tracker.toritraders.com/announce.php
.comment                 2007-09-11
.created by              BitsOnWheels 1.0.6
.creation date           1192017330
.encoding                UTF-8
.info.files.0.length     1083
.info.files.0.path.0     Tori Amos 2007-09-11.txt
.info.files.1.length     1276
.info.files.1.path.0     tori2007-09-11.ffp
.info.files.10.length    40970187
.info.files.10.path.0    Tori2007-09-11d1t09.flac
.info.files.11.length    2100576
.info.files.11.path.0    Tori2007-09-11d1t10.flac
.info.files.12.length    37919531
.info.files.12.path.0    Tori2007-09-11d1t11.flac
.info.files.13.length    39882119
.info.files.13.path.0    Tori2007-09-11d1t12.flac
.info.files.14.length    32640182
.info.files.14.path.0    Tori2007-09-11d1t13.flac
.info.files.15.length    21839105
.info.files.15.path.0    Tori2007-09-11d1t14.flac
.info.files.16.length    35343829
.info.files.16.path.0    Tori2007-09-11d2t01.flac
.info.files.17.length    10502680
.info.files.17.path.0    Tori2007-09-11d2t02.flac
.info.files.18.length    22704770
.info.files.18.path.0    Tori2007-09-11d2t03.flac
.info.files.19.length    27933648
.info.files.19.path.0    Tori2007-09-11d2t04.flac
.info.files.2.length     36546205
.info.files.2.path.0     Tori2007-09-11d1t01.flac
.info.files.20.length    37142439
.info.files.20.path.0    Tori2007-09-11d2t05.flac
.info.files.21.length    48347811
.info.files.21.path.0    Tori2007-09-11d2t06.flac
.info.files.22.length    42226577
.info.files.22.path.0    Tori2007-09-11d2t07.flac
.info.files.23.length    38228003
.info.files.23.path.0    Tori2007-09-11d2t08.flac
.info.files.3.length     44009544
.info.files.3.path.0     Tori2007-09-11d1t02.flac
.info.files.4.length     27210926
.info.files.4.path.0     Tori2007-09-11d1t03.flac
.info.files.5.length     28364865
.info.files.5.path.0     Tori2007-09-11d1t04.flac
.info.files.6.length     29017426
.info.files.6.path.0     Tori2007-09-11d1t05.flac
.info.files.7.length     31116854
.info.files.7.path.0     Tori2007-09-11d1t06.flac
.info.files.8.length     26280337
.info.files.8.path.0     Tori2007-09-11d1t07.flac
.info.files.9.length     29748219
.info.files.9.path.0     Tori2007-09-11d1t08.flac
.info.name               Tori2007-09-11
.info.name.utf8          Tori2007-09-11
.info.piece length       524288

And here is the script:

   1 #!/bin/bash
   2 export LC_ALL=C
   3 
   4 # Filename ($1)
   5 # Results go into global associative array ben
   6 declare -A ben
   7 benparse() {
   8   local data skip p max
   9   [[ -r $1 ]] || { echo "cannot read file '$1'"; return 1; }
  10   IFS= read -rd '' data < <(tr \\0 \\1 <"$1")
  11   max=${#data}
  12 
  13   # Begin parsing file at offset 0, namespace ""
  14   bp_parse 0 ""
  15 }
  16 
  17 # Starting offset ($1), starting namespace ($2)
  18 # Uses global AA ben, module variable data
  19 # Return parsed data in module var p
  20 # Return total char length of parsed data in module var skip
  21 bp_parse() {
  22   (($1 >= max)) && return
  23   case "${data:$1:1}" in
  24   d)
  25     # Data dictionary, terminated by "e".  Get pairs.
  26     local i=$1 j=$(($1 + 1)) key value
  27     while ((j < max)) && [[ ${data:j:1} != e ]]; do
  28       bp_parse $j "$2."
  29       key=$p
  30       ((j+=skip))
  31       bp_parse $j "$2.$key"
  32       value=$p
  33       ((j+=skip))
  34       [[ $value ]] && ben["$2.$key"]=$value
  35     done
  36     p=""        # We populate the AA ourselves, rather than passing data back
  37     skip=$((j-i+1))
  38     ;;
  39   i)
  40     # Integer, terminated by "e"
  41     local i=$1 j=$(($1 + 1))
  42     while [[ ${data:j:1} != e ]]; do
  43       ((j++))
  44     done
  45     p=${data:i+1:j-i-1}
  46     skip=$((j-i+1))
  47     ;;
  48   l)
  49     # List, concatenated elements, terminated by "e"
  50     local i=$1 j=$(($1 + 1)) k=0 value
  51     while [[ ${data:j:1} != e ]]; do
  52       bp_parse $j "$2.$k"
  53       [[ $p ]] && ben["$2.$k"]=$p
  54       ((k++, j+=skip))
  55     done
  56     p=""
  57     skip=$((j-i+1))
  58     ;;
  59   *)
  60     # String, length-prefixed (integer, colon).  Get the length first.
  61     local n n_len
  62     bp_getnum $1
  63     n_len=${#n}
  64     p=${data:$1+n_len+1:n}
  65     skip=$((n_len+1+n))
  66     ;;
  67   esac
  68 }
  69 
  70 # Find an integer in data, beginning at offset ($1)
  71 # Return value in upstream variable n
  72 bp_getnum() {
  73   local i=$1 j=$1
  74   while [[ ${data:j:1} = [[:digit:]-] ]]; do
  75     ((j++))
  76   done
  77   n=${data:i:j-i}
  78 }
  79 
  80 benparse "$1"
  81 # Dump the AA, indices in string-sorted order
  82 # Skip the big binary blob
  83 printf "%s\n" "${!ben[@]}" | sort | while IFS= read -r idx; do
  84   ((${#ben["$idx"]} > 80)) && continue
  85   printf "%-24.24s %s\n" "$idx" "${ben["$idx"]}"
  86 done

Here are some points of interest in the code:

  • We override the locale (LC_ALL=C) because we're reading binary data from a file, and we don't want Bash doing anything funny like treating multiple bytes as a single character.

  • The associative array requires Bash 4.0 or higher. The machine on which this was written and tested has Bash 4.1, which means the associative array (ben) has to be declared in the global scope. In Bash 4.2, the AA could have been declared inside the benparse function using declare -g.

  • There are a lot of function calls in this parser, and the functions often need to return multiple pieces of data -- both strings and integers. Both for performance reasons, and because multiple returned elements are needed, I decided to return the data in variables that would be available to the function's caller (rather than some sort of foo=$(myfunc) capturing, which has many issues).

    • In order to permit this, the variables to be populated by the function must be global, or must be declared at a scope outside of the function. Bash uses dynamic scoping, meaning that in a function call tree f -> g -> h, variables declared in g are available to h. If a variable by the same name is declared in f, or at the global scope, the definition in g will shadow it.

    • So, we declare all the variables that our functions will populate as return values (p, skip, etc.) in the caller.

    • The variables declared at the top of the benparse function (the first function in the call chain) are available to every function in the parser, but not to the global scope. This means every function "shares" the same data (the full bencoded stream), and so on. They simply index it as a string as needed.

  • We can't read NUL bytes from the data file, and .torrent files most definitely contain them (within the data dictionary!). So we munge all the NULs into \1s during the read. The resulting munged data will not be usable for whatever it does. You can't write an actual Bit Torrent client in Bash, or at least, not using this approach.

  • When we dump the AA at the end of the script, we sort the indices (piping to sort and reading them back in), and omit any long fields. In my .torrent file, there's a gigantic binary blob called pieces (.info.pieces in my namespace). This is the data that has munged NUL bytes and is thus ruined for its purpose.

TorrentParser (last edited 2014-10-09 16:56:32 by GreyCat)