XQuery/全球气温记录
最近,英国气象局发布了全球约 1600 个站点的温度记录。每个站点记录都可以在线以文本文件形式获取,例如 斯托诺韦.
本案例研究描述了一个将这些数据作为 XML 提供的项目。主页是 http://www.cems.uwe.ac.uk/xmlwiki/Climate/index.html
第一步是将纯文本转换为 XML。主页解释了此文本文件的格式。代码 030260 是世界气象组织定义的站点代码。这些文件似乎存储在国家代码目录中。(实际上,这些在世界气象组织术语中被称为区块。)
使用 HTTP 获取远程数据文件的任务是常见任务,XQuery 模块中已经存在用于执行此任务的函数。
此模块声明了一个用于解析的常量
declare variable $csv:newline:= " ";
以及用于获取文本的基本函数,该函数可以是纯文本或 Base64 编码
declare function csv:get-data ($uri as xs:string , $binary as xs:boolean) as xs:string? { (:~ : Get a file via HTTP and convert the body of the HTTP response to text : force the script to get the latest version using the HTTP Pragma header : @param uri - URI of the text file to read : @param binary - true if data is base64 encoded : @return - the body of the response as text or null :) let $headers := element headers { element header {attribute name {"Pragma" }, attribute value {"no-cache"}}} let $response := httpclient:get(xs:anyURI($uri), true(), $headers) return if ($response/@statusCode eq "200") then let $raw := $response/httpclient:body return if ($binary) then util:binary-to-string($raw) else xmldb:decode($raw) else () };
我们将创建一个包含用于执行解析的函数的 XQuery 模块
module namespace met = "http://www.cems.uwe.ac.uk/xmlwiki/met";
需要导入 csv 模块
import module namespace csv = "http://www.cems.uwe.ac.uk/xmlwiki/csv" at "../lib/csv.xqm";
现在是用于解析 MET 气候数据的函数
declare function met:station-to-xml ($station as xs:string) as element(TemperatureRecord)? { (:~ : GET and parse a MET office temperature record as documented in : http://www.metoffice.gov.uk/climatechange/science/monitoring/subsets.html : @param the station number : @return the temperature record as an adhoc XML structure matched closely to the terms used in the original record :) let $country := substring($station,1,2) (: this is the directory for all temperature records in a country :) (: construct the URI for the corresponding record :) let $uri := concat("http://www.metoffice.gov.uk/climatechange/science/monitoring/reference/",$country,"/",$station) (:GET and convert to plain text :) let $data := csv:get-data($uri,false()) return if (empty($data)) then () else (: split into two sections :) let $headertext := substring-before($data,"Obs:") (: the first section contains the meta data in the form of name=value statements :) let $headers := tokenize($headertext,$csv:nl) (: the second section is the temperature record, year by year :) let $temperatures := substring-after ($data,"Obs:") let $years := tokenize($temperatures, $csv:nl) return element TemperatureRecord { element sourceURI {$uri}, (: the original temperature record :) for $header in $headers (: split each line into a name and its value :) let $name := replace(substring-before($header,"=")," ","") (: to create a valid XML name, just remove any spaces :) let $value := normalize-space(substring-after ($header,"=")) where $name ne "" return element {$name} { (:create an XML element with the name :) if ($name = ("Normals","Standarddeviations")) (: these names have values which are a list of temperatures :) then for $temp in tokenize($value,"\s+") (: temperatures are space-separated :) return element temp_C {$temp} else if ($name = ("Name","Country")) (: these names contain redundant hyphens :) then replace ($value,"-","") else if ($name = "Long") (: the convention for signing longitudes in this data is the reverse of the usual E +, W - convention :) then - xs:decimal($value) else $value }, for $year in $years let $value := tokenize($year,"\s+") where $year ne "" return element monthlyAverages { attribute year {$value[1]}, (: the first value in the row is the year :) for $i in (2 to 13) (: the remainder are the temperatures for the months Jan to Dec :) let $temp := $value[$i] return element temp_C { if ($temp ne '-99.0') (: generate all months, but those with no reading indicated by -99 will be empty :) then $temp else () } } } };
主脚本使用这些函数来转换给定站点的记录
(:~ : convert climate file to XML : @param station id of station :) import module namespace met = "http://www.cems.uwe.ac.uk/xmlwiki/met" at "met.xqm"; let $station := request:get-parameter("station",()) return local:station-to-xml($station,false())
站点 ID 基于世界气象组织定义的 ID。可以在线以 文本文件形式获取所有站点的完整列表,并附有 支持文档.
典型的记录是
00;000;PABL;Buckland, Buckland Airport;AK;United States;4;65-58-56N;161-09-07W;;;7;;
这些记录的格式是
- 区块号 表示世界气象组织分配的区块的 2 位数字。
- 站点号 表示世界气象组织分配的站点的 3 位数字。
- ICAO 地点指示符 4 个字母数字字符,并非所有此文件中的站点都已分配地点指示符。对于未分配地点指示符的站点,使用“----”。
- 地点名称 站点位置的通用名称。
- 州 2 个字符的缩写(仅适用于位于美国的站点)。
- 国家名称 国家名称为 ISO 简短英文形式。
- 世界气象组织区域 1 到 6 位数字,表示相应的世界气象组织区域,7 代表世界气象组织南极洲区域。
- 站点纬度 DD-MM-SSH,其中 DD 为度,MM 为分,SS 为秒,H 为 N 代表北半球或 S 代表南半球。对于秒值未知的站点,秒值被省略。
- 站点经度 DDD-MM-SSH,其中 DDD 为度,MM 为分,SS 为秒,H 为 E 代表东半球或 W 代表西半球。对于秒值未知的站点,秒值被省略。
- 高空纬度 DD-MM-SSH,其中 DD 为度,MM 为分,SS 为秒,H 为 N 代表北半球或 S 代表南半球。对于秒值未知的站点,秒值被省略。
- 高空经度 DDD-MM-SSH,其中 DDD 为度,MM 为分,SS 为秒,H 为 E 代表东半球或 W 代表西半球。对于秒值未知的站点,秒值被省略。
- 站点海拔(Ha) 站点海拔(米)。如果未知,则省略该值。
- 高空海拔(Hp) 高空海拔(米)。如果未知,则省略该值。
- RBSN 指示符 如果站点由世界气象组织定义为属于区域基本天气观测网,则为 P,否则省略。
需要一个函数来将纬度和经度从 DD-MM-SSH 格式转换。这会因该格式的变体而变得复杂。这些变体全部出现在数据中
- DD-MMH
- DD-MH
- DD-MM-SH
- DD-MM-SSH
由于这种格式出现在其他数据中,因此已将其添加到通用地理函数模块中。
declare function geo:lz ($n as xs:string?) as xs:integer { xs:integer(concat (string-pad("0",2 - string-length($n)),$n)) }; declare function geo:dms-to-decimal($s as xs:string) as xs:decimal { (:~ : @param $s - input string in the format of DD-MMH, DD-MH, DD-MM-SH,* DD-MM-SSH : where H is NSE or W : @return decimal degrees :) let $hemi := substring($s,string-length($s),1) let $rest := substring($s,1, string-length($s)-1) let $f := tokenize($rest,"-") let $deg := geo:lz($f[1]) let $min:= geo:lz($f[2]) let $sec := geo:lz($f[3]) let $dec :=$deg + ($min + $sec div 60) div 60 let $dec := round-half-to-even($dec,6) return if ($hemi = ("S","W")) then - $dec else $dec };
必须导入 geo 模块
import module namespace geo = "http://www.cems.uwe.ac.uk/xmlwiki/geo" at "../lib/geo.xqm";
解析站点数据。
declare function met:WMO-to-xml ($station as xs:string ) as element (station) { (:~ : @param $station string describing a station : Upper Air data is ignored at present. :) let $f := tokenize(normalize-space($station),";") let $cid := concat($f[1],$f[2],"0") (: this constructs the equivalent id used in the temperature records :) return element station{ element block {$f[1]}, element number {$f[2]}, element id {$cid}, if ($f[3] eq "----") then () else element ICAO {$f[3]}, element placeName {$f[4]}, if ($f[5] ne "") then element state {$f[5]} else (), element country {$f[6]}, element WMORegion {$f[7]}, element latitude {geo:dms-to-dec($f[8])}, element longitude {geo:dms-to-dec($f[9])}, if ($f[12] ne "") then element elevation {$f[12]} else (), if ($f[14] = "P") then element RBSN {} else () } };
XQuery 脚本获取文本文件并将每行转换为 XML 站点元素。然后将这些元素逐个插入到一个空的 XML 文件中。
import module namespace met = "http://www.cems.uwe.ac.uk/xmlwiki/met" at "met.xqm"; import module namespace csv = "http://www.cems.uwe.ac.uk/xmlwiki/csv" at "../csv.xqm"; <results> { (: create the empty XML document :) let $login := xmldb:login("/db/Wiki/Climate","user","password") let $store := xmldb:store("/db/Wiki/Climate/Stations","metstations.xml",<stations/>) let $doc := doc($store)/stations (: get the text list of stations and convert :) let $station-list := "http://weather.noaa.gov/data/nsd_bbsss.txt" let $csv := csv:get-data($station-list,false()) for $data in tokenize($csv,$nl) where $station ne "" return let $station := met:WMO-station-to-xml($data) let $update := update insert $station into $doc return <station>{$xml/id}</station> } </results>
总共有 11000 多个站点。需要对其进行索引以实现高效访问。在 eXist 中,索引在每个集合(目录)的配置文件中定义。对于将写入站点 XML 文档的集合,配置文件为
<collection xmlns="http://exist-db.org/collection-config/1.0"> <index> <create qname="id" type="xs:string"/> <create qname="country" type="xs:string"/> </index> </collection>
这意味着集合中的所有 XML 文档都将在 qname id 和 country 中被索引,无论这些 qname 出现在 XML 结构中的何处。索引将在将文档添加到集合或更新现有文档时执行。如果需要,可以强制重新索引。
如果站点数据存储在集合 /db/Wiki/Climate/Stations 中,那么此配置文件将存储在 /db/system/config/db/Wiki/Climate/Stations 中,名为 configuration.xconf
由于代码将在多个地方引用此集合,因此我们向库模块添加一个常量来引用车站集
declare variable $met:WMOStations := doc ("/db/Wiki/Climate/Stations/metstations.xml")//station;
需要完整列出车站才能提供索引。此数据不以简单文件的形式提供,而是以 JavaScript 数组的形式编码在 HTML 页面上。
locations[1]=["409380|Afghanistan, Islamic State Of / Afghanistan, Etat Islamique D'|Herat"",409480|Afghanistan, Islamic State Of / Afghanistan, Etat Islamique D'|Kabul Airport","409900|Afghanistan, Islamic State Of / Afghanistan, Etat Islamique D'|Kandahar Airport"]; ...
但是这里没有位置数据,因此我们将从 WMO 车站列表中获取这些数据
将此数据转换为 XML 的方法是
- 查看 HTML 页面的源代码
- 找到车站列表
- 复制文本
- 将文本文件保存在 eXist 数据库中
- 脚本读取此文件并将其解析为 XML
- 结果 XML 通过 WMO 车站数据增加了经度和纬度。
- 最终 XML 文档存储在数据库中的同一 Station 目录中
(:~ : convert the text representation of MET stations from the WMO list to XML :) <stationList> { (: get the raw data from a text file stored as base64 in the eXist dataabse :) let $text := util:binary-to-string(util:binary-doc("/db/Wiki/Climate/cstations.txt")) (: ; separates the stations in each country :) for $country in tokenize($text,";") (: the station list is the array element content i.e. the string between =[ and ] :) let $stationlist := substring-before(substring-after($country,"=["),"]") (: The stations in each country are comma-separated, but commas are also used within the names of countries and stations. However a comma followed by a double quote is the required separator. :) let $stations := tokenize($stationlist,',"') for $station in $stations (: some cleanup of names is needed :) let $data :=replace ( replace($station,'"',"")," ","") (: Each station is in the format of Stationid | English name / French name :) let $f := tokenize($data,"\|") let $id := $f[1] let $country := tokenize($f[2],"/") let $WMOStation := $met:WMOStations[id=$id] (: create a station element containing the id , country and english station name :) return element station { element id {$f[1]}, element country {normalize-space($country[1])}, element location {$f[3]}, $WMOStation/latitude, $WMOStation/longitude } } </stationList>
将此文件存储在同一 Stations 集合中意味着它将根据与完整 WMO 车站数据相同的元素名称、id 和 country 进行索引。
此车站集也将引用多个地方,因此我们定义了一个变量
declare variable $met:tempStations := doc ("/db/Wiki/Climate/Stations/tempstations.xml")//station;
我们将使用 XSLT 将此 XML 转换为车站位置和温度图表的形式。最初的样式表由 Dave Challender 开发。
(将添加解释)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" exclude-result-prefixes="msxsl">
<!-- Authored by Dave Callender, minor mods by Chris Wallace -->
<xsl:output method="html"/>
<xsl:param name="start-year" select="1000"/>
<xsl:param name="end-year" select="3000"/>
<xsl:template match="Station">
<html>
<head>
<script type="text/javascript" src="http://www.google.com/jsapi"/>
<title>
<xsl:value-of select="station/placeName"/>
<xsl:text> </xsl:text>
<xsl:value-of select="station/country"/>
</title>
</head>
<body>
<xsl:apply-templates select="station"/>
<xsl:apply-templates select="TemperatureRecord" mode="googlevis"/>
<xsl:apply-templates select="TemperatureRecord" mode="table"/>
<xsl:apply-templates select="TemperatureRecord" mode="smoothed"/>
</body>
</html>
</xsl:template>
<!-- Visualization of the full temperature record -->
<xsl:template match="TemperatureRecord" mode="googlevis">
<p/>
<p>Google visualization timeline (takes no account of standard deviation etc.)</p>
<div id="chart_div" style="width: 700px; height: 440px;"/>
<p/>
<script type="text/javascript">
google.load('visualization', '1', {'packages':['annotatedtimeline']});
google.setOnLoadCallback(drawChart);
function drawChart() {
var data = new google.visualization.DataTable();
data.addColumn('date', 'Date');
data.addColumn('number', 'temp');
data.addRows([
<xsl:apply-templates select="monthlyAverages[@year][@year >= $start-year][@year <= $end-year]" mode="googlevis"/>
[null,null]
]);
var chart = new google.visualization.AnnotatedTimeLine(document.getElementById('chart_div'));
chart.draw(data, {displayAnnotations: true});
}
</script>
</xsl:template>
<xsl:template match="temp_C" mode="googlevis">
<xsl:if test="(node())">
<xsl:text>[new Date(</xsl:text>
<xsl:value-of select="../@year"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="position() - 1 "/>
<!-- Google viz uses 0-based arrays -->
<xsl:text>,15),</xsl:text>
<xsl:value-of select="."/>
<xsl:text>],
</xsl:text>
</xsl:if>
</xsl:template>
<!-- Vizualisation of the smoothed data -->
<xsl:template match="TemperatureRecord" mode="smoothed">
<p/>
<p>Almost totally meaningless - sum all temps for a year and divide by 12 (only do if all 12
data points) but shows a bit of playing with data</p>
<p/>
<div id="smoothed_chart_div" style="width: 700px; height: 440px;"/>
<script type="text/javascript">
google.load('visualization', '1', {'packages':['annotatedtimeline']});
google.setOnLoadCallback(drawChartSmoothed);
function drawChartSmoothed()
{
var data = new google.visualization.DataTable();
data.addColumn('date', 'Date');
data.addColumn('number', 'temp');
data.addRows([
<xsl:apply-templates select="monthlyAverages[@year][@year >= $start-year][@year <=$end-year]" mode="smoothed"/>
[null,null]
]);
var chart = new google.visualization.AnnotatedTimeLine(document.getElementById('smoothed_chart_div'));
chart.draw(data, {displayAnnotations: true});
}
</script>
</xsl:template>
<xsl:template match="monthlyAverages" mode="smoothed">
<xsl:if test="count(temp_C[node()])=12">
<xsl:text>[new Date(</xsl:text>
<xsl:value-of select="@year"/>
<xsl:text>,5,15),</xsl:text>
<xsl:value-of select="sum(temp_C[node()]) div 12"/>
<xsl:text>],
</xsl:text>
</xsl:if>
</xsl:template>
<!-- Data tabulated -->
<xsl:template match="TemperatureRecord" mode="table">
<table border="1">
<tr>
<td>Year</td>
<td>Jan</td>
<td>Feb</td>
<td>Mar</td>
<td>Apr</td>
<td>May</td>
<td>Jun</td>
<td>Jul</td>
<td>Aug</td>
<td>Sep</td>
<td>Oct</td>
<td>Nov</td>
<td>Dec</td>
<tr/>
</tr>
<xsl:apply-templates
select="monthlyAverages[@year][@year >= $start-year][@year < $end-year]"
mode="table"/>
</table>
</xsl:template>
<xsl:template match="monthlyAverages" mode="table">
<tr>
<td>
<xsl:value-of select="@year"/>
</td>
<xsl:apply-templates select="temp_C" mode="table"/>
</tr>
</xsl:template>
<xsl:template match="temp_C" mode="table">
<td>
<xsl:value-of select="."/>
</td>
</xsl:template>
<xsl:template match="Number">
<p> Station Number:  <xsl:value-of select="."/>
</p>
</xsl:template>
<xsl:template match="station">
<h1>
<xsl:value-of select="placeName"/>
<xsl:text>, </xsl:text>
<xsl:value-of select="country"/>
<xsl:text> </xsl:text>
</h1>
<a href="http://maps.google.com/maps?q={latitude},{longitude}">
<img
src="http://maps.google.com/maps/api/staticmap?zoom=11&maptype=hybrid&size=400x300&sensor=false&key=ABQIAAAAVehr0_0wqgw_UOdLv0TYtxSGVrvsBPWDlNZ2fWdNTHNT32FpbBR1ygnaHxJdv-8mkOaL2BJb4V_yOQ&markers=color:blue|{latitude},{longitude}"
alt="{placeName}"/>
</a>
</xsl:template>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
我们希望呈现原始 XML 或 HTML 可视化页面。我们可以使用两个脚本,或者将它们组合成一个脚本,并使用参数来指示如何呈现输出。eXist 函数允许动态设置输出的序列化和 mime 类型。
import module namespace met = "http://www.cems.uwe.ac.uk/xmlwiki/met" at "met.xqm"; let $id := request:get-parameter("station",()) let $render := request:get-parameter("render",()) let $station := doc ("/db/Wiki/Climate/Stations/metstations.xml")//station[id = $id] let $tempStation := doc("/db/Wiki/Climate/Stations/tempstations.xml")//station[id = $id] let $temp := if ($tempStaion) then met:station-to-xml($id) else () let $station := <Station> {$station} {$temp} </Station> return if ($render="HTML") then let $ss := doc("/db/Wiki/Climate/FullHTMLMet-V2.xsl") let $options := util:declare-option("exist:serialize","method=xhtml media-type=text/html") let $start-year := request:get-parameter("start","1000") let $end-year := request:get-parameter("end","2100") let $params := <parameters> <param name="start-year" value="{$start-year}"/> <param name="end-year" value="{$end-year}"/> </parameters> return transform:transform($station,$ss,$params) else let $header := response:set-header("Access-Control-Allow-Origin","*") return $station
我们可以使用存储的车站列表来创建一个简单的 HTML 索引。
import module namespace met = "http://www.cems.uwe.ac.uk/xmlwiki/met" at "met.xqm"; declare option exist:serialize "method=xhtml media-type=text/html"; <html> <head> <title>Index of Temperature Record Stations </title> </head> <body> <h1>Index of Temperature Record Stations </h1> { for $country in distinct-values($met:tempStations/country) order by $country return <div> <h3>{$country} </h3> {for $station in $met:tempStations[country=$country] let $id := $station/id order by $station/location return <span><a href="station.xq?station={$id}&render=HTML">{string($station/location)}</a> </span> } </div> } </body> </html>
我们还可以生成一个(大型)KML 叠加层,其中包含指向每个车站页面的链接。
我们需要一个函数将车站转换为具有指向 HTML 车站页面的链接的 PlaceMark
declare function met:station-to-placemark ($station) { let $description := <div> <a href="http://www.cems.uwe.ac.uk/xmlwiki/Climate/station.xq?station={$station/id}&render=HTML">Temperature Record</a> </div> return <Placemark> <name>{string($station/location)}, {string($station/country)}</name> <description>{util:serialize($description,"method=xhtml")} </description> <Point> <coordinates>{string($station/longitude)},{string($station/latitude)},0</coordinates> </Point> </Placemark> };
然后主脚本遍历所有气温站以生成完整的 KML 文件。
import module namespace met ="http://www.cems.uwe.ac.uk/xmlwiki/met" at "met.xqm"; declare option exist:serialize "method=xml media-type=application/vnd.google-earth.kml+xml indent=yes omit-xml-declaration=yes"; let $x := response:set-header('Content-Disposition','attachment;filename=country.kml') return <kml xmlns="http://www.opengis.net/kml/2.2"> <Folder> <name>Stations</name> { for $station in $met:tempStations return met:station-to-placemark($station) } </Folder> </kml>
- 资源 URI
- RDF