Differences between revisions 1 and 2
Revision 1 as of 2016-04-11 20:38:30
Size: 2931
Editor: GreyCat
Comment: How do I extract data from an HTML or XML file?
Revision 2 as of 2016-04-15 20:42:50
Size: 3039
Editor: GreyCat
Comment: Apparently you can use /@x to print the x="y" fields of a tag. Thanks geirha.
Deletions are marked like this. Additions are marked like this.
Line 68: Line 68:
 * Print the second person's name:
$ xmllint --xpath 'string(//person[2]/@name)' foo.xml

How do I extract data from an HTML or XML file?

Do not attempt this with sed, awk, grep, and so on (it leads to undesired results). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you.


You may know Lynx as a terminal-mode web browser with extreme limitations. It is that, but it is also a scriptable HTML parser. It's particularly good at extracting links from a document and printing them for you:

$ lynx -dump -listonly -nonumbers http://mywiki.wooledge.org/

You'd think wget would also be good at this, right? I mean, it has that recursive mirroring mode, so it obviously does this internally. Good luck finding a way to get it to print those URLs for you instead of downloading them all.

Add -image_links to include image links, if those are what you seek. Filtering the links according to your needs should be relatively simple now that each one is on a separate line with no HTML in the way.


Perhaps the best choice for most XML processing is xmllint. Unfortunately, using it requires learning XPath, and I do not know of any reasonable XPath introductions. Here are a few simple tricks. They are shown using the following input file:

<person name="bob"><salary>70000</salary></person>
<person name="sue"><salary>90000</salary></person>

Note that xmllint does not add a newline to its output. If you're capturing with a CommandSubstitution this is not an issue. If you're testing in an interactive shell, it will quickly become annoying. You may want to consider writing a wrapper function, like:

xmllint() { command xmllint "$@"; echo; }

Simple tricks:

  • Print the first salary tag:
    • $ xmllint --xpath 'string(//salary)' foo.xml
  • Print all salary tags (note that this is not particularly useful in this form):
    • $ xmllint --xpath '//salary/text()' foo.xml
  • Count the number of person tags:
    • $ xmllint --xpath 'count(//person)' foo.xml
  • Print each person's salary separately:
    • $ xmllint --xpath '//person[1]/salary/text()' foo.xml
      $ xmllint --xpath '//person[2]/salary/text()' foo.xml
  • Print bob's salary:
    • $ xmllint --xpath '//person[@name="bob"]/salary/text()' foo.xml 
  • Print the second person's name:
    • $ xmllint --xpath 'string(//person[2]/@name)' foo.xml

BashFAQ/113 (last edited 2018-02-10 09:39:50 by ip5f5ac798)