XQuery/页面抓取和雅虎天气

背景

雅虎通过 REST API 提供全球天气预报服务，提供 RSS。它在 API 文档中进行了描述。

但是，每个英国城镇供稿的关键是雅虎位置 ID，例如 UKXX0953，并且没有可用的服务将位置名称转换为雅虎代码。雅虎确实提供了包含指向供稿本身链接的位置的字母索引页面。

雅虎管道

此任务可以通过 Paul Daniel 编写的雅虎管道完成。（直至提取位置 ID）但是，HTML 标记的固有稳定性导致当前管道失败。

XQuery

此脚本接收一个位置参数，提取位置的第一个字母，构建该字母的雅虎天气索引页面的 URL，字母 B 的索引页面，并通过 eXist 中的 httpclient 模块获取页面。该页面不是有效的 XHTML，但 httpclient:get 函数清理了 XML，使其格式良好。

HTML 页面

可以在树形视图中看到页面结构。

接下来，导航此 XML 以定位包含位置的 li 元素，并剥离该位置的代码。最后，将此代码附加到该位置 RSS 页面 URL 的根，创建一个指向该位置 RSS 供稿的 URL。 RSS 供稿，然后脚本重定向到该 URL。

可以使用数据流图图表可视化此过程。

declare variable $yahooIndex := "http://weather.yahoo.com/regional/UKXX";
declare variable $yahooWeather := "http://weather.yahooapis.com/forecastrss?u=c&amp;p=";

let $location := request:get-parameter("location","Bristol")
let $letter :=  upper-case(substring($location,1,1))
let $suffix := if($letter eq 'A') then '' else concat('_',$letter)
let $index := xs:anyURI(concat ($yahooIndex,$suffix,".html"))
let $page := httpclient:get($index,true(),())
let $href := $page//div[@id="yw-regionalloc"]//li/a[.= $location]/@href
let $code :=  substring-after(substring-before($href,'.'),'forecast/')
let $rss := xs:anyURI(concat($yahooWeather,$code) )

return 
   response:redirect-to ($rss)

备注

尽管索引页面不是有效的 XHTML（为什么？），并且需要整理，但雅虎通过在各节中使用 ID 对抓取程序提供了帮助。这允许 XPath 表达式通过 ID 选择相关的部分，然后选择包含位置的 li。但是，此类标记并不稳定，实际上最近从 browse 的 ID 更改为当前的 yw-regionalloc。还要注意，由于字母 A 的页面与其他字母的页面 URL 不同，因此需要额外的操作 - 这是一项不容易看到或测试的功能。
eXist 不太适合此任务，因为页面必须首先存储在数据库中，以便使用结构索引执行 XPath 表达式。像 Saxon 这样的内存中 XQuery 引擎预计在此任务中会执行得更好。目前，性能有点慢，但新的 1.3 版本改进了这种情况。
使用正则表达式从字符串中提取代码会更清晰，但是 XQuery 不提供简单的匹配函数来提取匹配的模式。在 analyse-string 中描述了包装一些 XSLT 来执行此操作的 XQuery 函数。
该脚本使用 eXist 函数 response:redirect-to 将浏览器重定向到为 RSS 供稿构建的 URL。

XSLT

为了比较，这是使用 analyse-string 的等效 XSLT 脚本。

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:param name="location"/>
    <xsl:variable name="html2xml">
        <xsl:text>http://www.html2xml.nl/Services/html2xml/version1/Html2Xml.asmx/Url2XmlNode?urlAddress=</xsl:text>
    </xsl:variable>
    <xsl:variable name="yahooIndex">
        <xsl:text>http://weather.yahoo.com/regional/UKXX_</xsl:text>
    </xsl:variable>
    <xsl:variable name="yahooWeather">
        <xsl:text>http://weather.yahooapis.com/forecastrss?u=c&amp;p=</xsl:text>
    </xsl:variable>
    <xsl:template match="/">
        <xsl:variable name="letter" select="upper-case(substring($location,1,1))"/>
        <xsl:variable name="suffix" select="if($letter eq 'A') then '' else concat('_',$letter)"></xsl:variable>
        <xsl:variable name="page" select="doc(concat ($html2xml,$yahooIndex,$suffix,'.html'))"/>
        <xsl:variable name="href" select="$page//div[@id='yw-regionalloc']//li/a[.= $location]/@href"/>
        <xsl:variable name="code" >
            <xsl:analyze-string select="$href" regex="forecast(.*)\.html">
                <xsl:matching-substring>
                    <xsl:value-of select="regex-group(1)"/>
                </xsl:matching-substring>
            </xsl:analyze-string>
            </xsl:variable>
         <xsl:variable name="rssurl" select="concat($yahooWeather,$code)"/>
        <xsl:copy-of select="doc($rssurl)"/>
    </xsl:template>
</xsl:stylesheet>

布里斯托尔天气 - 但目前已损坏。

XPL

另一种方法是使用 Erik Bruchez 和 Alessandro Vernet 在 Orbeon 开发的 XPL 来描述一系列转换作为管道。这里，管道扩展为从 RSS 供稿创建自定义 HTML 页面。

<?xml version="1.0" encoding="UTF-8"?>
<p:pipeline xmlns:p="http://www.cems.uwe.ac.uk/xpl"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"  >
    <p:output id="weatherPage"/>
    <p:processor name="xslt">
        <p:annotation>construct the index page url from the parameter</p:annotation>
        <p:input name="parameter" id="location"/>
        <p:input name="xml">
            <dummy/>
        </p:input>
        <p:input name="xslt">
            <xsl:template match="/">
                <xsl:text>http://weather.yahoo.com/regional/UKXX_</xsl:text>
                <xsl:value-of select="upper-case(substring($location,1,1))"/>
                <xsl:text>.html</xsl:text>
            </xsl:template>
        </p:input>
        <p:output name="result" id="indexUrl"/>
    </p:processor>
    <p:processor name="tidy">
         <p:annotation>tidy the index page</p:annotation>
        <p:input name="url" id="indexUrl"/>
        <p:output name="xhtml" id="indexXhtml"/>
    </p:processor>
    <p:processor name="xslt">
         <p:annotation>parse the index page and construct the URL for the RSS feed</p:annotation>
        <p:input name="xml" id="indexXhtml"/>
        <p:input name="parameter" id="location"/>
        <p:input name="xslt">
            <xsl:template match="/">
                <xsl:variable name="href" select="//div[@id='yw-regionalloc']//li/a[.= $location]/@href"/>
                <xsl:text>http://weather.yahooapis.com/forecastrss?u=c%26p=</xsl:text>
                <xsl:value-of select="substring-before(substring-after($href,'forecast/'),'.html')"
                />
            </xsl:template>
        </p:input>
        <p:output name="result" id="rssUrl"/>
    </p:processor>
    <p:processor name="fetch">
        <p:annotation>fetch the RSS feed</p:annotation>
        <p:input name="url" id="rssUrl"/>
        <p:output name="result" id="RSSFeed"/>
    </p:processor>
    <p:processor name="xslt">
        <p:annotation>Convert RSS to an HTML page</p:annotation>
        <p:input name="xml" id="RSSFeed"/>
        <p:input name="xslt" href="http://www.cems.uwe.ac.uk/xmlwiki/weather/yahooRSS2HTML.xsl"/>
        <p:output name="result" id="weatherPage"/>
    </p:processor>
</p:pipeline>

给定每个命名处理器类型的实现，这可以执行（尽管在此原型 XQuery 处理器中速度相当慢）。

这正在进行中 - 目前，此 XPL 引擎只是一个非常简单的部分原型，即使这个简单的顺序示例也不符合 XPL 模式（因此是本地命名空间）。

可以使用 GraphViz 可视化管道。

目的是生成一个额外的图像地图，以支持链接到基础进程，以及支持完整的 XPL 语言。