<> == How do I extract data from an HTML or XML file? == '''Do not''' attempt this with sed, awk, grep, and so on (it leads to [[http://xrl.us/p0ny|undesired results]]). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you. === lynx === You may know Lynx as a terminal-mode web browser with extreme limitations. It is that, but it is also a scriptable HTML parser. It's particularly good at extracting links from a document and printing them for you: {{{ $ lynx -dump -listonly -nonumbers http://mywiki.wooledge.org/ http://mywiki.wooledge.org/EnglishFrontPage?action=rss_rc&unique=1&ddiffs=1 http://mywiki.wooledge.org/EnglishFrontPage?action=edit http://mywiki.wooledge.org/EnglishFrontPage http://mywiki.wooledge.org/EnglishFrontPage?action=raw http://mywiki.wooledge.org/EnglishFrontPage?action=print http://mywiki.wooledge.org/EnglishFrontPage?action=AttachFile&do=view&target=Greg's-wiki.zip [...] }}} Add `-image_links` to include image links, if those are what you seek. Filtering the links according to your needs should be relatively simple now that each one is on a separate line with no HTML in the way. You'd think `wget` would also be good at this, right? I mean, it has that recursive mirroring mode, so it obviously does this internally. Good luck finding a way to get it to print those URLs for you instead of downloading them all. I tried my luck and found a way. Not well tested. We can use --rejected-log and a --reject-regex that always matches. We use --spider to not save the file. {{{ $ wget -q --spider -r --rejected-log=rejected --reject-regex=^ http://mywiki.wooledge.org/ $ cat rejected REASON U_URL U_SCHEME U_HOST U_PORT U_PATH U_PARAMS U_QUERY U_FRAGMENT P_URL P_SCHEME P_HOST P_PORT P_PATH P_PARAMS P_QUERY P_FRAGMENT REGEX http%3A//mywiki.wooledge.org/moin_static198/common/js/common.js SCHEME_HTTP mywiki.wooledge.org 80 moin_static198/common/js/common.js http%3A//mywiki.wooledge.org/ SCHEME_HTTP mywiki.wooledge.org 80 REGEX http%3A//mywiki.wooledge.org/moin_static198/modernized/css/common.css SCHEME_HTTP mywiki.wooledge.org 80 moin_static198/modernized/css/common.css http%3A//mywiki.wooledge.org/ SCHEME_HTTP mywiki.wooledge.org 80 REGEX http%3A//mywiki.wooledge.org/moin_static198/modernized/css/screen.css SCHEME_HTTP mywiki.wooledge.org 80 moin_static198/modernized/css/screen.css http%3A//mywiki.wooledge.org/ SCHEME_HTTP mywiki.wooledge.org 80 REGEX http%3A//mywiki.wooledge.org/moin_static198/modernized/css/print.css SCHEME_HTTP mywiki.wooledge.org 80 moin_static198/modernized/css/print.css http%3A//mywiki.wooledge.org/ SCHEME_HTTP mywiki.wooledge.org 80 REGEX http%3A//mywiki.wooledge.org/moin_static198/modernized/css/projection.css SCHEME_HTTP mywiki.wooledge.org 80 moin_static198/modernized/css/projection.css http%3A//mywiki.wooledge.org/ SCHEME_HTTP mywiki.wooledge.org 80 [...] }}} To extract the links to stdout: {{{ $ wget -q --spider -r --rejected-log=/dev/stdout --reject-regex=^ http://mywiki.wooledge.org/ | tail -n +2 | cut -f 2 http%3A//mywiki.wooledge.org/moin_static198/common/js/common.js http%3A//mywiki.wooledge.org/moin_static198/modernized/css/common.css http%3A//mywiki.wooledge.org/moin_static198/modernized/css/screen.css http%3A//mywiki.wooledge.org/moin_static198/modernized/css/print.css http%3A//mywiki.wooledge.org/moin_static198/modernized/css/projection.css [...] }}} === xmllint === Perhaps the best choice for most XML processing is `xmllint`. Unfortunately, using it requires learning XPath, and I do not know of any reasonable XPath introductions. Here are a few simple tricks. They are shown using the following input file: {{{ 70000 90000 }}} Note that xmllint does not add a newline to its output. If you're capturing with a CommandSubstitution this is not an issue. If you're testing in an interactive shell, it will quickly become annoying. You may want to consider writing a wrapper function, like: {{{ xmllint() { command xmllint "$@"; echo; } }}} Simple tricks: * Print the first salary tag: {{{ $ xmllint --xpath 'string(//salary)' foo.xml 70000 }}} * Print all salary tags (note that this is not particularly useful in this form): {{{ $ xmllint --xpath '//salary/text()' foo.xml 7000090000 }}} * Count the number of person tags: {{{ $ xmllint --xpath 'count(//person)' foo.xml 2 }}} * Print each person's salary separately: {{{ $ xmllint --xpath '//person[1]/salary/text()' foo.xml 70000 $ xmllint --xpath '//person[2]/salary/text()' foo.xml 90000 }}} * Print bob's salary: {{{ $ xmllint --xpath '//person[@name="bob"]/salary/text()' foo.xml 70000 }}} * Print the second person's name: {{{ $ xmllint --xpath 'string(//person[2]/@name)' foo.xml sue }}} === Namespaces === The above examples show that it is fairly easy to parse XML when you have a decent XML parser, but this defeats the purpose of XML, which is to make everyone miserable. Therefore some clever people introduced XML namespaces. An example of such technology is a typical maven build file, called `pom.xml`, which looks something like this {{{#!xml 4.0.0 org.codehaus.mojo my-project 1.0-SNAPSHOT }}} There will usually be a few hundred lines dedicated to dependencies too, but let's keep it short. With the examples from the previous chapter, we know that extracting the version from this file will simply be to use the xpath `/project/version/text()`: {{{ $ xmllint --xpath '/project/version/text()' pom.xml XPath set is empty }}} Well no, because the author has cleverly added a default namespace for this {{{xmlns="http://maven.apache.org/POM/4.0.0"}}}, so now you first have to specify that exact url before you can address that you want the text inside the version element inside the project element. ==== xmllint --shell ==== xmllint's --xpath option does not allow a way to specify the namespace, so it's now off the table (unless you edit the file and remove the namespace declaration). Its shell feature does allow setting the namespace though {{{ xmllint --shell pom.xml << EOF setns ns=http://maven.apache.org/POM/4.0.0 cat /ns:project/ns:version/text() EOF / > / > ------- 1.0-SNAPSHOT / > }}} Yea! We got the version number ... plus some prompts and crap from the xmllint shell which will have to be removed afterwards. ==== xmlstarlet ==== xmlstarlet is a bit easier to use for this {{{ $ xmlstarlet sel -N ns=http://maven.apache.org/POM/4.0.0 -t -v /ns:project/ns:version -n pom.xml 1.0-SNAPSHOT }}} ==== python ==== python bundles with an xml parser too, and is generally more available than xmllint and xmlstarlet. It also allows dealing with namespaces in a cludgy fashion. {{{ $ python -c 'import xml.etree.ElementTree as ET;print(ET.parse("pom.xml").find("{http://maven.apache.org/POM/4.0.0}version").text)' 1.0-SNAPSHOT }}} ==== xsltproc ==== xsltproc happens to be installed on most linux systems. For example to extract titles and urls of a podcast: {{{ xslt() { cat << 'EOX' # EOX } curl -s http://podcasts.files.bbci.co.uk/p02nq0lx.rss | xsltproc <(xslt) - }}}