This is a .torrent file parser. It was written for fun/challenge purposes, and is not necessarily intended to be useful for any serious purpose.
Torrent files are "bencoded". Writing a bencode parser in bash poses several difficult challenges, and pushes against the limitations of the language.
The approach used here is to parse the data stream and store the contents as fields within an associative array. The structure of the fields is determined dynamically during the parsing, which is by necessity recursive (a dictionary can contain a list, which can contain a dictionary, which can contain a list, etc.). Once the associative array has been built, all the fields become available for any purpose, assuming the application can determine how to identify the field(s) of interest.
This approach would also work well for parsing an XML document, I think. I haven't actually tried it yet.
First, here is the output. You can see the hierarchical structure of the field-namespace contained within the .torrent file. The dots between namespace elements were chosen arbitrarily, and could be replaced by another delimiter. I chose the dot partly because the .torrent namespace is reminiscent of the window namespace in Tk. The way I've represented lists with numeric namespace components is a design choice I made.
I chose to sort the output lexicographically, to keep related elements together; the order of the fields shown here is almost the same as the order inside the .torrent file (the dictionaries must be sorted lexicographically, but my use of .n. for lists, without zero-padding, means the lists are printed out of sequence if they have more than 10 elements).
$ ben /stuff/tori2007-09-11.torrent .announce http://tracker.toritraders.com/announce.php .comment 2007-09-11 .created by BitsOnWheels 1.0.6 .creation date 1192017330 .encoding UTF-8 .info.files.0.length 1083 .info.files.0.path.0 Tori Amos 2007-09-11.txt .info.files.1.length 1276 .info.files.1.path.0 tori2007-09-11.ffp .info.files.10.length 40970187 .info.files.10.path.0 Tori2007-09-11d1t09.flac .info.files.11.length 2100576 .info.files.11.path.0 Tori2007-09-11d1t10.flac .info.files.12.length 37919531 .info.files.12.path.0 Tori2007-09-11d1t11.flac .info.files.13.length 39882119 .info.files.13.path.0 Tori2007-09-11d1t12.flac .info.files.14.length 32640182 .info.files.14.path.0 Tori2007-09-11d1t13.flac .info.files.15.length 21839105 .info.files.15.path.0 Tori2007-09-11d1t14.flac .info.files.16.length 35343829 .info.files.16.path.0 Tori2007-09-11d2t01.flac .info.files.17.length 10502680 .info.files.17.path.0 Tori2007-09-11d2t02.flac .info.files.18.length 22704770 .info.files.18.path.0 Tori2007-09-11d2t03.flac .info.files.19.length 27933648 .info.files.19.path.0 Tori2007-09-11d2t04.flac .info.files.2.length 36546205 .info.files.2.path.0 Tori2007-09-11d1t01.flac .info.files.20.length 37142439 .info.files.20.path.0 Tori2007-09-11d2t05.flac .info.files.21.length 48347811 .info.files.21.path.0 Tori2007-09-11d2t06.flac .info.files.22.length 42226577 .info.files.22.path.0 Tori2007-09-11d2t07.flac .info.files.23.length 38228003 .info.files.23.path.0 Tori2007-09-11d2t08.flac .info.files.3.length 44009544 .info.files.3.path.0 Tori2007-09-11d1t02.flac .info.files.4.length 27210926 .info.files.4.path.0 Tori2007-09-11d1t03.flac .info.files.5.length 28364865 .info.files.5.path.0 Tori2007-09-11d1t04.flac .info.files.6.length 29017426 .info.files.6.path.0 Tori2007-09-11d1t05.flac .info.files.7.length 31116854 .info.files.7.path.0 Tori2007-09-11d1t06.flac .info.files.8.length 26280337 .info.files.8.path.0 Tori2007-09-11d1t07.flac .info.files.9.length 29748219 .info.files.9.path.0 Tori2007-09-11d1t08.flac .info.name Tori2007-09-11 .info.name.utf8 Tori2007-09-11 .info.piece length 524288
And here is the script:
1 #!/bin/bash
2 export LC_ALL=C
3
4 # Filename ($1)
5 # Results go into global associative array ben
6 declare -A ben
7 benparse() {
8 local data skip p max
9 [[ -r $1 ]] || { echo "cannot read file '$1'"; return 1; }
10 IFS= read -rd '' data < <(tr \\0 \\1 <"$1")
11 max=${#data}
12
13 # Begin parsing file at offset 0, namespace ""
14 bp_parse 0 ""
15 }
16
17 # Starting offset ($1), starting namespace ($2)
18 # Uses global AA ben, module variable data
19 # Return parsed data in module var p
20 # Return total char length of parsed data in module var skip
21 bp_parse() {
22 (($1 >= max)) && return
23 case "${data:$1:1}" in
24 d)
25 # Data dictionary, terminated by "e". Get pairs.
26 local i=$1 j=$(($1 + 1)) key value
27 while ((j < max)) && [[ ${data:j:1} != e ]]; do
28 bp_parse $j "$2."
29 key=$p
30 ((j+=skip))
31 bp_parse $j "$2.$key"
32 value=$p
33 ((j+=skip))
34 [[ $value ]] && ben["$2.$key"]=$value
35 done
36 p="" # We populate the AA ourselves, rather than passing data back
37 skip=$((j-i+1))
38 ;;
39 i)
40 # Integer, terminated by "e"
41 local i=$1 j=$(($1 + 1))
42 while [[ ${data:j:1} != e ]]; do
43 ((j++))
44 done
45 p=${data:i+1:j-i-1}
46 skip=$((j-i+1))
47 ;;
48 l)
49 # List, concatenated elements, terminated by "e"
50 local i=$1 j=$(($1 + 1)) k=0 value
51 while [[ ${data:j:1} != e ]]; do
52 bp_parse $j "$2.$k"
53 [[ $p ]] && ben["$2.$k"]=$p
54 ((k++, j+=skip))
55 done
56 p=""
57 skip=$((j-i+1))
58 ;;
59 *)
60 # String, length-prefixed (integer, colon). Get the length first.
61 local n n_len
62 bp_getnum $1
63 n_len=${#n}
64 p=${data:$1+n_len+1:n}
65 skip=$((n_len+1+n))
66 ;;
67 esac
68 }
69
70 # Find an integer in data, beginning at offset ($1)
71 # Return value in upstream variable n
72 bp_getnum() {
73 local i=$1 j=$1
74 while [[ ${data:j:1} = [[:digit:]-] ]]; do
75 ((j++))
76 done
77 n=${data:i:j-i}
78 }
79
80 benparse "$1"
81 # Dump the AA, indices in string-sorted order
82 # Skip the big binary blob
83 printf "%s\n" "${!ben[@]}" | sort | while IFS= read -r idx; do
84 ((${#ben["$idx"]} > 80)) && continue
85 printf "%-24.24s %s\n" "$idx" "${ben["$idx"]}"
86 done
Here are some points of interest in the code:
We override the locale (LC_ALL=C) because we're reading binary data from a file, and we don't want Bash doing anything funny like treating multiple bytes as a single character.
The associative array requires Bash 4.0 or higher. The machine on which this was written and tested has Bash 4.1, which means the associative array (ben) has to be declared in the global scope. In Bash 4.2, the AA could have been declared inside the benparse function using declare -g.
There are a lot of function calls in this parser, and the functions often need to return multiple pieces of data -- both strings and integers. Both for performance reasons, and because multiple returned elements are needed, I decided to return the data in variables that would be available to the function's caller (rather than some sort of foo=$(myfunc) capturing, which has many issues).
In order to permit this, the variables to be populated by the function must be global, or must be declared at a scope outside of the function. Bash uses dynamic scoping, meaning that in a function call tree f -> g -> h, variables declared in g are available to h. If a variable by the same name is declared in f, or at the global scope, the definition in g will shadow it.
So, we declare all the variables that our functions will populate as return values (p, skip, etc.) in the caller.
The variables declared at the top of the benparse function (the first function in the call chain) are available to every function in the parser, but not to the global scope. This means every function "shares" the same data (the full bencoded stream), and so on. They simply index it as a string as needed.
We can't read NUL bytes from the data file, and .torrent files most definitely contain them (within the data dictionary!). So we munge all the NULs into \1s during the read. The resulting munged data will not be usable for whatever it does. You can't write an actual Bit Torrent client in Bash, or at least, not using this approach.
When we dump the AA at the end of the script, we sort the indices (piping to sort and reading them back in), and omit any long fields. In my .torrent file, there's a gigantic binary blob called pieces (.info.pieces in my namespace). This is the data that has munged NUL bytes and is thus ruined for its purpose.
- If your version of 'sort' supports the -V (natural sort of version number within text), then this solves the 'lists are printed out of sequence if they have more than 10 elements' problem. However, it does then alter the order of other elements. Strangely, this is fixed by piping the output of the script to `sort -V' again! --Nemo