BashFAQ/113 - Greg's Wiki

How do I extract data from an HTML or XML file?

Do not attempt this with sed, awk, grep, and so on (it leads to undesired results). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you.

lynx

You may know Lynx as a terminal-mode web browser with extreme limitations. It is that, but it is also a scriptable HTML parser. It's particularly good at extracting links from a document and printing them for you:

$ lynx -dump -listonly -nonumbers http://mywiki.wooledge.org/
http://mywiki.wooledge.org/EnglishFrontPage?action=rss_rc&unique=1&ddiffs=1
http://mywiki.wooledge.org/EnglishFrontPage?action=edit
http://mywiki.wooledge.org/EnglishFrontPage
http://mywiki.wooledge.org/EnglishFrontPage?action=raw
http://mywiki.wooledge.org/EnglishFrontPage?action=print
http://mywiki.wooledge.org/EnglishFrontPage?action=AttachFile&do=view&target=Greg's-wiki.zip
[...]

Add -image_links to include image links, if those are what you seek. Filtering the links according to your needs should be relatively simple now that each one is on a separate line with no HTML in the way.

You'd think wget would also be good at this, right? I mean, it has that recursive mirroring mode, so it obviously does this internally. Good luck finding a way to get it to print those URLs for you instead of downloading them all.

I tried my luck and found a way. Not well tested. We can use --rejected-log and a --reject-regex that always matches. We use --spider to not save the file.

$ wget -q --spider -r --rejected-log=rejected --reject-regex=^ http://mywiki.wooledge.org/
$ cat rejected
REASON  U_URL   U_SCHEME        U_HOST  U_PORT  U_PATH  U_PARAMS        U_QUERY U_FRAGMENT      P_URL   P_SCHEME        P_HOST  P_PORT  P_PATH  P_PARAMS        P_QUERY P_FRAGMENT
REGEX   http%3A//mywiki.wooledge.org/moin_static198/common/js/common.js SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/common/js/common.js      http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                              
REGEX   http%3A//mywiki.wooledge.org/moin_static198/modernized/css/common.css   SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/modernized/css/common.css                                http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                              
REGEX   http%3A//mywiki.wooledge.org/moin_static198/modernized/css/screen.css   SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/modernized/css/screen.css                                http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                              
REGEX   http%3A//mywiki.wooledge.org/moin_static198/modernized/css/print.css    SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/modernized/css/print.css                         http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                              
REGEX   http%3A//mywiki.wooledge.org/moin_static198/modernized/css/projection.css       SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/modernized/css/projection.css                            http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                              
[...]

To extract the links to stdout:

$ wget -q --spider -r --rejected-log=/dev/stdout --reject-regex=^ http://mywiki.wooledge.org/ | tail -n +2 | cut -f 2
http%3A//mywiki.wooledge.org/moin_static198/common/js/common.js
http%3A//mywiki.wooledge.org/moin_static198/modernized/css/common.css
http%3A//mywiki.wooledge.org/moin_static198/modernized/css/screen.css
http%3A//mywiki.wooledge.org/moin_static198/modernized/css/print.css
http%3A//mywiki.wooledge.org/moin_static198/modernized/css/projection.css
[...]

xmllint

Perhaps the best choice for most XML processing is xmllint. Unfortunately, using it requires learning XPath, and I do not know of any reasonable XPath introductions. Here are a few simple tricks. They are shown using the following input file:

<staff>
<person name="bob"><salary>70000</salary></person>
<person name="sue"><salary>90000</salary></person>
</staff>

Note that xmllint does not add a newline to its output. If you're capturing with a CommandSubstitution this is not an issue. If you're testing in an interactive shell, it will quickly become annoying. You may want to consider writing a wrapper function, like:

xmllint() { command xmllint "$@"; echo; }

Simple tricks:

Print the first salary tag:

$ xmllint --xpath 'string(//salary)' foo.xml
70000

Print all salary tags (note that this is not particularly useful in this form):
- ```
$ xmllint --xpath '//salary/text()' foo.xml
7000090000
```

Count the number of person tags:

$ xmllint --xpath 'count(//person)' foo.xml
2

Print each person's salary separately:

$ xmllint --xpath '//person[1]/salary/text()' foo.xml
70000
$ xmllint --xpath '//person[2]/salary/text()' foo.xml
90000

Print bob's salary:

$ xmllint --xpath '//person[@name="bob"]/salary/text()' foo.xml 
70000

Print the second person's name:

$ xmllint --xpath 'string(//person[2]/@name)' foo.xml
sue

Namespaces

The above examples show that it is fairly easy to parse XML when you have a decent XML parser, but this defeats the purpose of XML, which is to make everyone miserable. Therefore some clever people introduced XML namespaces.

An example of such technology is a typical maven build file, called pom.xml, which looks something like this

<project xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>org.codehaus.mojo</groupId>
  <artifactId>my-project</artifactId>
  <version>1.0-SNAPSHOT</version>
</project>

There will usually be a few hundred lines dedicated to dependencies too, but let's keep it short.

With the examples from the previous chapter, we know that extracting the version from this file will simply be to use the xpath /project/version/text():

$ xmllint --xpath '/project/version/text()' pom.xml
XPath set is empty

Well no, because the author has cleverly added a default namespace for this xmlns="http://maven.apache.org/POM/4.0.0", so now you first have to specify that exact url before you can address that you want the text inside the version element inside the project element.

xmllint --shell

xmllint's --xpath option does not allow a way to specify the namespace, so it's now off the table (unless you edit the file and remove the namespace declaration). Its shell feature does allow setting the namespace though

xmllint --shell pom.xml << EOF
setns ns=http://maven.apache.org/POM/4.0.0
cat /ns:project/ns:version/text()
EOF
/ > / >  -------
1.0-SNAPSHOT
/ >

Yea! We got the version number ... plus some prompts and crap from the xmllint shell which will have to be removed afterwards.

xmlstarlet

xmlstarlet is a bit easier to use for this

$ xmlstarlet sel -N ns=http://maven.apache.org/POM/4.0.0 -t -v /ns:project/ns:version -n pom.xml
1.0-SNAPSHOT

python

python bundles with an xml parser too, and is generally more available than xmllint and xmlstarlet. It also allows dealing with namespaces in a cludgy fashion.

$ python -c 'import xml.etree.ElementTree as ET;print(ET.parse("pom.xml").find("{http://maven.apache.org/POM/4.0.0}version").text)'
1.0-SNAPSHOT

xsltproc

xsltproc happens to be installed on most linux systems. For example to extract titles and urls of a podcast:

xslt() {
cat << 'EOX'
<?xml version="1.0"?>
<x:stylesheet version="1.0" xmlns:x="http://www.w3.org/1999/XSL/Transform">
<x:output method="text" />
<x:template match="/">
<x:for-each select="//item">
        <x:text># </x:text>
        <x:value-of select="./title/text()" /><x:text>&#10;<!-- newline --></x:text>
        <x:value-of select="enclosure/@url" /><x:text>&#10;</x:text>
</x:for-each>
</x:template>
</x:stylesheet>
EOX
}

curl -s http://podcasts.files.bbci.co.uk/p02nq0lx.rss | xsltproc <(xslt) -