Differences between revisions 1 and 3 (spanning 2 versions)
Revision 1 as of 2016-04-11 20:38:30
Size: 2931
Editor: GreyCat
Comment: How do I extract data from an HTML or XML file?
Revision 3 as of 2017-05-02 18:35:29
Size: 5521
Editor: geirha
Comment: The joy of namespaces
Deletions are marked like this. Additions are marked like this.
Line 68: Line 68:
 * Print the second person's name:
  {{{
$ xmllint --xpath 'string(//person[2]/@name)' foo.xml
sue
}}}

=== Namespaces ===

The above examples show that it is fairly easy to parse XML when you have a decent XML parser, but this defeats the purpose of XML, which is to make everyone miserable. Therefore some clever people introduced XML namespaces.

An example of such technology is a typical maven build file, called `pom.xml`, which looks something like this

{{{#!xml
<project xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>org.codehaus.mojo</groupId>
  <artifactId>my-project</artifactId>
  <version>1.0-SNAPSHOT</version>
</project>
}}}

There will usually be a few hundred lines dedicated to dependencies too, but let's keep it short.

With the examples from the previous chapter, we know that extracting the version from this file will simply be to use the xpath `/project/version/text()`:

{{{
$ xmllint --xpath '/project/version/text()' pom.xml
XPath set is empty
}}}

Well no, because the author has cleverly added a default namespace for this xmlns="http://maven.apache.org/POM/4.0.0", so now you first have to specify that exact url before you can address that you want the text inside the version element inside the project element.

==== xmllint --shell ====

xmllint's --xpath option does not allow a way to specify the namespace, so it's now off the table (unless you edit the file and remove the namespace declaration). Its shell feature does allow setting the namespace though

{{{
xmllint --shell pom.xml << EOF
setns ns=http://maven.apache.org/POM/4.0.0
cat /ns:project/ns:version/text()
EOF
/ > / > -------
1.0-SNAPSHOT
/ >
}}}

Yea! We got the version number ... plus some prompts and crap from the xmllint shell which will have to be removed afterwards.

==== xmlstarlet ====

xmlstarlet is a bit easier to use for this

{{{
$ xmlstarlet sel -N ns=http://maven.apache.org/POM/4.0.0 -t -v /ns:project/ns:version -n pom.xml
1.0-SNAPSHOT
}}}

==== python ====

python bundles with an xml parser too, and is generally more available than xmllint and xmlstarlet. It also allows dealing with namespaces in a cludgy fashion.

{{{
$ python -c 'import xml.etree.ElementTree as ET;print(ET.parse("pom.xml").find("{http://maven.apache.org/POM/4.0.0}version").text)'
1.0-SNAPSHOT
}}}

How do I extract data from an HTML or XML file?

Do not attempt this with sed, awk, grep, and so on (it leads to undesired results). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you.

lynx

You may know Lynx as a terminal-mode web browser with extreme limitations. It is that, but it is also a scriptable HTML parser. It's particularly good at extracting links from a document and printing them for you:

$ lynx -dump -listonly -nonumbers http://mywiki.wooledge.org/
http://mywiki.wooledge.org/EnglishFrontPage?action=rss_rc&unique=1&ddiffs=1
http://mywiki.wooledge.org/EnglishFrontPage?action=edit
http://mywiki.wooledge.org/EnglishFrontPage
http://mywiki.wooledge.org/EnglishFrontPage?action=raw
http://mywiki.wooledge.org/EnglishFrontPage?action=print
http://mywiki.wooledge.org/EnglishFrontPage?action=AttachFile&do=view&target=Greg's-wiki.zip
[...]

You'd think wget would also be good at this, right? I mean, it has that recursive mirroring mode, so it obviously does this internally. Good luck finding a way to get it to print those URLs for you instead of downloading them all.

Add -image_links to include image links, if those are what you seek. Filtering the links according to your needs should be relatively simple now that each one is on a separate line with no HTML in the way.

xmllint

Perhaps the best choice for most XML processing is xmllint. Unfortunately, using it requires learning XPath, and I do not know of any reasonable XPath introductions. Here are a few simple tricks. They are shown using the following input file:

<staff>
<person name="bob"><salary>70000</salary></person>
<person name="sue"><salary>90000</salary></person>
</staff>

Note that xmllint does not add a newline to its output. If you're capturing with a CommandSubstitution this is not an issue. If you're testing in an interactive shell, it will quickly become annoying. You may want to consider writing a wrapper function, like:

xmllint() { command xmllint "$@"; echo; }

Simple tricks:

  • Print the first salary tag:
    • $ xmllint --xpath 'string(//salary)' foo.xml
      70000
  • Print all salary tags (note that this is not particularly useful in this form):
    • $ xmllint --xpath '//salary/text()' foo.xml
      7000090000
  • Count the number of person tags:
    • $ xmllint --xpath 'count(//person)' foo.xml
      2
  • Print each person's salary separately:
    • $ xmllint --xpath '//person[1]/salary/text()' foo.xml
      70000
      $ xmllint --xpath '//person[2]/salary/text()' foo.xml
      90000
  • Print bob's salary:
    • $ xmllint --xpath '//person[@name="bob"]/salary/text()' foo.xml 
      70000
  • Print the second person's name:
    • $ xmllint --xpath 'string(//person[2]/@name)' foo.xml
      sue

Namespaces

The above examples show that it is fairly easy to parse XML when you have a decent XML parser, but this defeats the purpose of XML, which is to make everyone miserable. Therefore some clever people introduced XML namespaces.

An example of such technology is a typical maven build file, called pom.xml, which looks something like this

<project xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>org.codehaus.mojo</groupId>
  <artifactId>my-project</artifactId>
  <version>1.0-SNAPSHOT</version>
</project>

There will usually be a few hundred lines dedicated to dependencies too, but let's keep it short.

With the examples from the previous chapter, we know that extracting the version from this file will simply be to use the xpath /project/version/text():

$ xmllint --xpath '/project/version/text()' pom.xml
XPath set is empty

Well no, because the author has cleverly added a default namespace for this xmlns="http://maven.apache.org/POM/4.0.0", so now you first have to specify that exact url before you can address that you want the text inside the version element inside the project element.

xmllint --shell

xmllint's --xpath option does not allow a way to specify the namespace, so it's now off the table (unless you edit the file and remove the namespace declaration). Its shell feature does allow setting the namespace though

xmllint --shell pom.xml << EOF
setns ns=http://maven.apache.org/POM/4.0.0
cat /ns:project/ns:version/text()
EOF
/ > / >  -------
1.0-SNAPSHOT
/ >

Yea! We got the version number ... plus some prompts and crap from the xmllint shell which will have to be removed afterwards.

xmlstarlet

xmlstarlet is a bit easier to use for this

$ xmlstarlet sel -N ns=http://maven.apache.org/POM/4.0.0 -t -v /ns:project/ns:version -n pom.xml
1.0-SNAPSHOT

python

python bundles with an xml parser too, and is generally more available than xmllint and xmlstarlet. It also allows dealing with namespaces in a cludgy fashion.

$ python -c 'import xml.etree.ElementTree as ET;print(ET.parse("pom.xml").find("{http://maven.apache.org/POM/4.0.0}version").text)'
1.0-SNAPSHOT

BashFAQ/113 (last edited 2022-03-03 01:20:25 by emanuele6)