How do I extract data from an HTML or XML file?

Do not attempt this with sed, awk, grep, and so on (it leads to undesired results). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you.

lynx

You may know Lynx as a terminal-mode web browser with extreme limitations. It is that, but it is also a scriptable HTML parser. It's particularly good at extracting links from a document and printing them for you:

$ lynx -dump -listonly -nonumbers http://mywiki.wooledge.org/
http://mywiki.wooledge.org/EnglishFrontPage?action=rss_rc&unique=1&ddiffs=1
http://mywiki.wooledge.org/EnglishFrontPage?action=edit
http://mywiki.wooledge.org/EnglishFrontPage
http://mywiki.wooledge.org/EnglishFrontPage?action=raw
http://mywiki.wooledge.org/EnglishFrontPage?action=print
http://mywiki.wooledge.org/EnglishFrontPage?action=AttachFile&do=view&target=Greg's-wiki.zip
[...]

Add -image_links to include image links, if those are what you seek. Filtering the links according to your needs should be relatively simple now that each one is on a separate line with no HTML in the way.

You'd think wget would also be good at this, right? I mean, it has that recursive mirroring mode, so it obviously does this internally. Good luck finding a way to get it to print those URLs for you instead of downloading them all.

xmllint

Perhaps the best choice for most XML processing is xmllint. Unfortunately, using it requires learning XPath, and I do not know of any reasonable XPath introductions. Here are a few simple tricks. They are shown using the following input file:

<staff>
<person name="bob"><salary>70000</salary></person>
<person name="sue"><salary>90000</salary></person>
</staff>

Note that xmllint does not add a newline to its output. If you're capturing with a CommandSubstitution this is not an issue. If you're testing in an interactive shell, it will quickly become annoying. You may want to consider writing a wrapper function, like:

xmllint() { command xmllint "$@"; echo; }

Simple tricks:

Namespaces

The above examples show that it is fairly easy to parse XML when you have a decent XML parser, but this defeats the purpose of XML, which is to make everyone miserable. Therefore some clever people introduced XML namespaces.

An example of such technology is a typical maven build file, called pom.xml, which looks something like this

<project xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>org.codehaus.mojo</groupId>
  <artifactId>my-project</artifactId>
  <version>1.0-SNAPSHOT</version>
</project>

There will usually be a few hundred lines dedicated to dependencies too, but let's keep it short.

With the examples from the previous chapter, we know that extracting the version from this file will simply be to use the xpath /project/version/text():

$ xmllint --xpath '/project/version/text()' pom.xml
XPath set is empty

Well no, because the author has cleverly added a default namespace for this xmlns="http://maven.apache.org/POM/4.0.0", so now you first have to specify that exact url before you can address that you want the text inside the version element inside the project element.

xmllint --shell

xmllint's --xpath option does not allow a way to specify the namespace, so it's now off the table (unless you edit the file and remove the namespace declaration). Its shell feature does allow setting the namespace though

xmllint --shell pom.xml << EOF
setns ns=http://maven.apache.org/POM/4.0.0
cat /ns:project/ns:version/text()
EOF
/ > / >  -------
1.0-SNAPSHOT
/ >

Yea! We got the version number ... plus some prompts and crap from the xmllint shell which will have to be removed afterwards.

xmlstarlet

xmlstarlet is a bit easier to use for this

$ xmlstarlet sel -N ns=http://maven.apache.org/POM/4.0.0 -t -v /ns:project/ns:version -n pom.xml
1.0-SNAPSHOT

python

python bundles with an xml parser too, and is generally more available than xmllint and xmlstarlet. It also allows dealing with namespaces in a cludgy fashion.

$ python -c 'import xml.etree.ElementTree as ET;print(ET.parse("pom.xml").find("{http://maven.apache.org/POM/4.0.0}version").text)'
1.0-SNAPSHOT

xsltproc

xsltproc happens to be installed on most linux systems. For example to extract titles and urls of a podcast:

xslt() {
cat << 'EOX'
<?xml version="1.0"?>
<x:stylesheet version="1.0" xmlns:x="http://www.w3.org/1999/XSL/Transform">
<x:output method="text" />
<x:template match="/">
<x:for-each select="//item">
        <x:text># </x:text>
        <x:value-of select="./title/text()" /><x:text>&#10;<!-- newline --></x:text>
        <x:value-of select="enclosure/@url" /><x:text>&#10;</x:text>
</x:for-each>
</x:template>
</x:stylesheet>
EOX
}

curl -s http://podcasts.files.bbci.co.uk/p02nq0lx.rss | xsltproc <(xslt) -

BashFAQ/113 (last edited 2024-05-27 21:55:46 by 107)