Differences between revisions 1 and 2
Revision 1 as of 2016-04-11 20:38:30
Size: 2931
Editor: GreyCat
Comment: How do I extract data from an HTML or XML file?
Revision 2 as of 2016-04-15 20:42:50
Size: 3039
Editor: GreyCat
Comment: Apparently you can use /@x to print the x="y" fields of a tag. Thanks geirha.
Deletions are marked like this. Additions are marked like this.
Line 68: Line 68:
 * Print the second person's name:
  {{{
$ xmllint --xpath 'string(//person[2]/@name)' foo.xml
sue
}}}

How do I extract data from an HTML or XML file?

Do not attempt this with sed, awk, grep, and so on (it leads to undesired results). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you.

lynx

You may know Lynx as a terminal-mode web browser with extreme limitations. It is that, but it is also a scriptable HTML parser. It's particularly good at extracting links from a document and printing them for you:

$ lynx -dump -listonly -nonumbers http://mywiki.wooledge.org/
http://mywiki.wooledge.org/EnglishFrontPage?action=rss_rc&unique=1&ddiffs=1
http://mywiki.wooledge.org/EnglishFrontPage?action=edit
http://mywiki.wooledge.org/EnglishFrontPage
http://mywiki.wooledge.org/EnglishFrontPage?action=raw
http://mywiki.wooledge.org/EnglishFrontPage?action=print
http://mywiki.wooledge.org/EnglishFrontPage?action=AttachFile&do=view&target=Greg's-wiki.zip
[...]

You'd think wget would also be good at this, right? I mean, it has that recursive mirroring mode, so it obviously does this internally. Good luck finding a way to get it to print those URLs for you instead of downloading them all.

Add -image_links to include image links, if those are what you seek. Filtering the links according to your needs should be relatively simple now that each one is on a separate line with no HTML in the way.

xmllint

Perhaps the best choice for most XML processing is xmllint. Unfortunately, using it requires learning XPath, and I do not know of any reasonable XPath introductions. Here are a few simple tricks. They are shown using the following input file:

<staff>
<person name="bob"><salary>70000</salary></person>
<person name="sue"><salary>90000</salary></person>
</staff>

Note that xmllint does not add a newline to its output. If you're capturing with a CommandSubstitution this is not an issue. If you're testing in an interactive shell, it will quickly become annoying. You may want to consider writing a wrapper function, like:

xmllint() { command xmllint "$@"; echo; }

Simple tricks:

  • Print the first salary tag:
    • $ xmllint --xpath 'string(//salary)' foo.xml
      70000
  • Print all salary tags (note that this is not particularly useful in this form):
    • $ xmllint --xpath '//salary/text()' foo.xml
      7000090000
  • Count the number of person tags:
    • $ xmllint --xpath 'count(//person)' foo.xml
      2
  • Print each person's salary separately:
    • $ xmllint --xpath '//person[1]/salary/text()' foo.xml
      70000
      $ xmllint --xpath '//person[2]/salary/text()' foo.xml
      90000
  • Print bob's salary:
    • $ xmllint --xpath '//person[@name="bob"]/salary/text()' foo.xml 
      70000
  • Print the second person's name:
    • $ xmllint --xpath 'string(//person[2]/@name)' foo.xml
      sue

BashFAQ/113 (last edited 2022-03-03 01:20:25 by emanuele6)