Diff for "BashFAQ/113"

Differences between revisions 1 and 2

How do I extract data from an HTML or XML file?

Do not attempt this with sed, awk, grep, and so on (it leads to undesired results). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you.

lynx

You may know Lynx as a terminal-mode web browser with extreme limitations. It is that, but it is also a scriptable HTML parser. It's particularly good at extracting links from a document and printing them for you:

$ lynx -dump -listonly -nonumbers http://mywiki.wooledge.org/
http://mywiki.wooledge.org/EnglishFrontPage?action=rss_rc&unique=1&ddiffs=1
http://mywiki.wooledge.org/EnglishFrontPage?action=edit
http://mywiki.wooledge.org/EnglishFrontPage
http://mywiki.wooledge.org/EnglishFrontPage?action=raw
http://mywiki.wooledge.org/EnglishFrontPage?action=print
http://mywiki.wooledge.org/EnglishFrontPage?action=AttachFile&do=view&target=Greg's-wiki.zip
[...]

You'd think wget would also be good at this, right? I mean, it has that recursive mirroring mode, so it obviously does this internally. Good luck finding a way to get it to print those URLs for you instead of downloading them all.

Add -image_links to include image links, if those are what you seek. Filtering the links according to your needs should be relatively simple now that each one is on a separate line with no HTML in the way.

xmllint

Perhaps the best choice for most XML processing is xmllint. Unfortunately, using it requires learning XPath, and I do not know of any reasonable XPath introductions. Here are a few simple tricks. They are shown using the following input file:

<staff>
<person name="bob"><salary>70000</salary></person>
<person name="sue"><salary>90000</salary></person>
</staff>

Note that xmllint does not add a newline to its output. If you're capturing with a CommandSubstitution this is not an issue. If you're testing in an interactive shell, it will quickly become annoying. You may want to consider writing a wrapper function, like:

xmllint() { command xmllint "$@"; echo; }

Simple tricks:

Print the first salary tag:

$ xmllint --xpath 'string(//salary)' foo.xml
70000

Print all salary tags (note that this is not particularly useful in this form):
- ```
$ xmllint --xpath '//salary/text()' foo.xml
7000090000
```

Count the number of person tags:

$ xmllint --xpath 'count(//person)' foo.xml
2

Print each person's salary separately:

$ xmllint --xpath '//person[1]/salary/text()' foo.xml
70000
$ xmllint --xpath '//person[2]/salary/text()' foo.xml
90000

Print bob's salary:

$ xmllint --xpath '//person[@name="bob"]/salary/text()' foo.xml 
70000

Print the second person's name:

$ xmllint --xpath 'string(//person[2]/@name)' foo.xml
sue

-  ⇤ ← Revision 1 as of 2016-04-11 20:38:30 → 
  Size: 2931
  Editor: GreyCat
  Comment: How do I extract data from an HTML or XML file?
+   ← Revision 2 as of 2016-04-15 20:42:50 → ⇥
  Size: 3039
  Editor: GreyCat
  Comment: Apparently you can use /@x to print the x="y" fields of a tag.  Thanks geirha.
-Deletions are marked like this.
+Additions are marked like this.
 Line 68:
+ * Print the second person's name:
  {{{
$ xmllint --xpath 'string(//person[2]/@name)' foo.xml
sue
}}}